Back to Templates

💡🌐 Essential Multipage Website Scraper with Jina.ai

Created by

Created by: Joseph LePage || joe

Joseph LePage

Last update

Last update 4 months ago

Categories

Share


💡🌐 Essential Multipage Website Scraper with Jina.ai

Use responsibly and follow local rules and regulations

This N8N workflow enables automated multi-page website scraping using Jina.ai's powerful web scraping capabilities, with seamless integration to Google Drive for content storage. Here's how it works:

Main Features

The workflow automatically scrapes multiple pages from a website's sitemap and saves each page's content as a separate Google Drive document.

Key Components

Input Configuration

  • Starts with a sitemap URL (default: https://ai.pydantic.dev/sitemap.xml)
  • Processes the sitemap to extract individual page URLs
  • Includes filtering options to target specific topics or pages

Scraping Process

  • Uses Jina.ai's web scraper to extract content from each URL
  • Converts webpage content into clean markdown format
  • Extracts page titles automatically for document naming

Storage Integration

  • Creates individual Google Drive documents for each scraped page
  • Names documents using the format "URL - Page Title"
  • Saves content in markdown format for better readability

Usage Instructions

  1. Set your target website's sitemap URL in the "Set Website URL" node
  2. Configure the "Filter By Topics or Pages" node to select specific content
  3. Adjust the "Limit" node (default: 20 pages) to control batch size
  4. Connect your Google Drive account
  5. Run the workflow to begin automated scraping

Additional Features

  • Built-in rate limiting through the Wait node to prevent overloading servers
  • Batch processing capability for handling large sitemaps

The workflow requires no API key for Jina.ai, making it accessible for immediate use while maintaining responsible scraping practices.