This workflow contains community nodes that are only compatible with the self-hosted version of n8n.
This workflow allows users to extract sitemap links using ScrapingBee API. It only needs the domain name www.example.com and it automatically checks robots.txt and sitemap.xml to find the links. It is also designed to recursively run the workflow when new .xml links are found while scraping the sitemap.
domain=www.example.com
robots.txt
file, if not found it checks sitemap.xml
When the workflow is finished, you will see the output in the links
column of the Google Sheet that we added to the workflow.
links
. Connect to the sheet by signing in using your Google Credential and add the link to your sheet.domain
as query parameter. Example:curl "https://webhook_link?domain=scrapingbee.com"
Scrape robots.txt file
, Scrape sitemap.xml file
, and Scrape xml file
nodes.Append links to sheet
node with a relevant node.If you wish to scrape the pages using the extracted links, then you can implement a new workflow that reads the sheet or file (output generated by this workflow) for links and for each link send a request to ScrapingBee's HTML API
and save the returned data.
NOTE: Some heavy sitemaps could result in a crash if the workflow consumes more memory than what is available in your n8n plan or self-hosted system. If this happens, we would recommend you to either upgrade your plan or use a self-hosted solution with a higher memory.