Automatically extracts all page URLs from website sitemaps, filters out unwanted sitemap links, and saves clean URLs to Google Sheets for SEO analysis and reporting.
This workflow automates the process of discovering and extracting all page URLs from a website's sitemap structure. Here's how it works step-by-step:
Step 1: URL Input
The workflow starts when you submit a website URL through a simple form interface.
Step 2: Sitemap Discovery
The system automatically generates and tests multiple possible sitemap URLs including /sitemap.xml, /sitemap_index.xml, /robots.txt, and other common variations.
Step 3: Valid Sitemap Identification
It sends HTTP requests to each potential sitemap URL and filters out empty or invalid responses, keeping only accessible sitemaps.
Step 4: Nested Sitemap Processing
For sitemap index files, the workflow extracts all nested sitemap URLs and processes each one individually to ensure complete coverage.
Step 5: Page URL Extraction
From each valid sitemap, it parses the XML content and extracts all individual page URLs using both XML <loc> tags and HTML links.
Step 6: URL Filtering
The system removes any URLs containing "sitemap" to ensure only actual content pages (like product, service, or blog pages) are retained.
Step 7: Google Sheets Integration
Finally, all clean page URLs are automatically saved to a Google Sheets document with duplicate prevention for easy analysis and reporting.
Estimated Setup Time: 10-15 minutes
1. Import the Workflow:
Import the provided JSON file into your n8n instance.
2. Configure Google Sheets Integration:
3. Test the Workflow:
4. Customize (Optional):
For technical support or questions about this workflow:
✉️ [email protected]
or
fill out this form: Contact Us