π‘π Essential Multipage Website Scraper with Jina.ai
Use responsibly and follow local rules and regulations
This N8N workflow enables automated multi-page website scraping using Jina.ai's powerful web scraping capabilities, with seamless integration to Google Drive for content storage. Here's how it works:
Main Features
The workflow automatically scrapes multiple pages from a website's sitemap and saves each page's content as a separate Google Drive document.
Key Components
Input Configuration
- Starts with a sitemap URL (default: https://ai.pydantic.dev/sitemap.xml)
- Processes the sitemap to extract individual page URLs
- Includes filtering options to target specific topics or pages
Scraping Process
- Uses Jina.ai's web scraper to extract content from each URL
- Converts webpage content into clean markdown format
- Extracts page titles automatically for document naming
Storage Integration
- Creates individual Google Drive documents for each scraped page
- Names documents using the format "URL - Page Title"
- Saves content in markdown format for better readability
Usage Instructions
- Set your target website's sitemap URL in the "Set Website URL" node
- Configure the "Filter By Topics or Pages" node to select specific content
- Adjust the "Limit" node (default: 20 pages) to control batch size
- Connect your Google Drive account
- Run the workflow to begin automated scraping
Additional Features
- Built-in rate limiting through the Wait node to prevent overloading servers
- Batch processing capability for handling large sitemaps
The workflow requires no API key for Jina.ai, making it accessible for immediate use while maintaining responsible scraping practices.