Extract Website URLs from Sitemap.XML for SEO Analysis

Created by

Le Thua Phu

Last update

Last update a month ago

Overview

This n8n workflow automates the process of crawling a website's sitemap to extract URLs, which is particularly useful for SEO analysis, website auditing, or content monitoring. By leveraging n8n's nodes, the workflow fetches the sitemap from a specified URL, processes the XML data, and extracts individual URLs, which can then be converted into a downloadable file or integrated with tools like Google Sheets.

How It Works

The workflow operates in a sequential manner, utilizing a series of nodes to fetch, parse, and process sitemap data:

Trigger: Initiates when the user clicks "Test workflow" (Manual Trigger node).
Set URL: Defines the base domain (e.g., https://phu.io.vn/) for the sitemap (Set URL node).
Crawl Sitemap: Fetches the main sitemap file (sitemap.xml) from the specified domain using an HTTP request (Crawl sitemap node).
Parse XML: Converts the sitemap XML into a JSON format for easier processing (XML node).
Split Sitemap: Extracts individual sitemap entries (e.g., <sitemap> tags) from the parsed data (Split Out node).
Crawl Sub-Sitemap: Fetches each sub-sitemap URL listed in the main sitemap (Crawl sitemap 2 node).
Parse Sub-Sitemap XML: Converts the sub-sitemap XML into JSON (XML 2 node).
Split URLs: Extracts individual URLs (e.g., <url> tags) from the sub-sitemap (Split Out 2 node).
Convert to File: Saves the extracted URLs into a file for download or further use (Convert to File node).

This workflow supports both single sitemap files and sitemap indexes that reference multiple sub-sitemaps, ensuring comprehensive URL extraction.

How to Use

To implement this workflow in n8n, follow these steps:

Set Up n8n: Ensure you have an active n8n instance (Cloud, npm, or self-hosted). Refer to the n8n documentation for setup instructions.
Import Workflow: Copy the JSON from the provided Extract Website URLs from Sitemap.XML for SEO Analysis.json file and import it into your n8n instance via the workflow editor.
Configure the Domain:
- In the Set URL node, update the Domain parameter with the target website's base URL (e.g., https://example.com/).
- Alternatively, in the Crawl sitemap node, directly paste the full sitemap URL if known (e.g., https://example.com/sitemap.xml).
Test the Workflow:
- Click "Test workflow" to execute the Manual Trigger node.
- Verify that the workflow fetches the sitemap and processes the URLs correctly.
Download or Integrate:
- The Convert to File node generates a file containing the extracted URLs.
- Optionally, replace this node with a Google Sheets node to append URLs to a spreadsheet. Refer to the Google Sheets node documentation for setup.
Save and Activate: Save the workflow and activate it for production use if needed, using a trigger like a schedule or webhook (see Trigger Node).

Requirements

n8n Instance: An active n8n instance (version 1.0 or later recommended) on n8n Cloud, npm, or self-hosted (Docker). See Choose your n8n for details.
Technical Knowledge: Basic understanding of n8n's editor UI and node configuration. Familiarity with XML sitemaps is helpful but not mandatory.
Permissions: For self-hosted setups, ensure the n8n process has network access to fetch the sitemap URL. For Docker deployments, verify permissions as outlined in the n8n v1.0 migration guide.
Optional: If integrating with Google Sheets, valid Google Sheets credentials are required (see Credentials).
Timeout Configuration: The HTTP Request nodes (Crawl sitemap and Crawl sitemap 2) have a 10-second timeout. Adjust the timeout parameter in the node settings if dealing with slow-responding servers.

FAQ

Q: What happens if the sitemap is large or contains many sub-sitemaps?
A: The workflow handles sitemap indexes by splitting and processing each sub-sitemap individually. For very large sitemaps, ensure your n8n instance has sufficient resources (memory and CPU) to avoid performance issues. See Scaling n8n for optimization tips.

Q: Can I use this workflow with a specific sitemap URL instead of a domain?
A: Yes, in the Crawl sitemap node, replace the url parameter ({{ $json.Domain }}sitemap.xml) with the direct sitemap URL (e.g., https://example.com/sitemap.xml). Update the node’s notes for clarity.

Q: Why am I getting a timeout error?
A: The HTTP Request nodes have a default timeout of 10 seconds. If the target server is slow, increase the timeout value in the options parameter of the Crawl sitemap or Crawl sitemap 2 nodes.

Q: How can I save the URLs to Google Sheets instead of a file?
A: Replace the Convert to File node with a Google Sheets node. Configure it with your Google Sheets credentials and map the loc field from the Split Out 2 node to the desired spreadsheet column. Refer to the Google Sheets node documentation.

Q: Is this workflow compatible with older n8n versions?
A: The workflow uses nodes compatible with n8n version 1.0 and later. For older versions, check for deprecated features (e.g., MySQL support) in the n8n v1.0 migration guide.

Q: Can I automate this workflow to run periodically?
A: Yes, replace the Manual Trigger node with a Schedule Trigger node to run the workflow at set intervals. See Trigger Nodes for configuration details.

For further assistance, consult the n8n Community Forum or submit an issue on the n8n GitHub repository.

Need help customizing?

Contact me for consulting and support or add me on Facebook or email.