Transform any website into a structured knowledge repository with this intelligent crawler that extracts hyperlinks from the homepage, intelligently filters images and content pages, and aggregates full Markdown-formatted content—perfect for fueling AI agents or building comprehensive company dossiers without manual effort.
📋 What This Template Does
This advanced workflow acts as a lightweight web crawler: it scrapes the homepage to discover all internal links (mimicking a sitemap extraction), deduplicates and validates them, separates image assets from textual pages, then fetches and converts non-image page content to clean Markdown. Results are seamlessly appended to Google Sheets for easy analysis, export, or integration into vector databases.
- Automatically discovers and processes subpage links from the homepage
- Filters out duplicates and non-HTTP links for efficient crawling
- Converts scraped content to Markdown for AI-ready formatting
- Categorizes and stores images, links, and full content in a single sheet row per site
🔧 Prerequisites
- Google account with Sheets access for data storage
- n8n instance (cloud or self-hosted)
- Basic understanding of URLs and web links
🔑 Required Credentials
Google Sheets OAuth2 API Setup
- Go to console.cloud.google.com → APIs & Services → Credentials
- Click "Create Credentials" → Select "OAuth client ID" → Choose "Web application"
- Add authorized redirect URIs: https://your-n8n-instance.com/rest/oauth2-credential/callback (replace with your n8n URL)
- Download the client ID and secret, then add to n8n as "Google Sheets OAuth2 API" credential type
- During setup, grant access to Google Sheets scopes (e.g., spreadsheets) and test the connection by listing a sheet
⚙️ Configuration Steps
- Import the workflow JSON into your n8n instance
- In the "Set Website" node, update the
website_url
value to your target site (e.g., https://example.com)
- Assign your Google Sheets credential to the three "Add ... to Sheet" nodes
- Update the
documentId
and sheetName
in those nodes to your target spreadsheet ID and sheet name/ID
- Ensure your sheet has columns: "Website", "Links", "Scraped Content", "Images"
- Activate the workflow and trigger manually to test scraping
🎯 Use Cases
- Knowledge base creation: Crawl a company's site to aggregate all content into Sheets, then export to Notion or a vector DB for internal wikis
- AI agent training: Extract structured Markdown from industry sites to fine-tune LLMs on domain-specific data like legal docs or tech blogs
- Competitor intelligence: Build dossiers by crawling rival websites, separating assets and text for SEO audits or market analysis
- Content archiving: Preserve dynamic sites (e.g., news portals) as static knowledge dumps for compliance or historical research
⚠️ Troubleshooting
- No links extracted: Verify the homepage has <a> tags; test with a simple site like example.com and check HTTP response in executions
- Sheet update fails: Confirm column names match exactly (case-sensitive) and credential has edit permissions; try a new blank sheet
- Content truncated: Google Sheets limits cells to ~50k chars—adjust the
.slice(0, 50000)
in "Add Scraped Content to Sheet" or split into multiple rows
- Rate limiting errors: Add a "Wait" node after "Scrape Links" with 1-2s delay if the site blocks rapid requests