HTTP Request node
+9

Scrape and store data from multiple website pages

Published 3 years ago

Template description

This workflow allows extracting data from multiple pages website.

The workflow:

  1. Starts in a country list at https://www.theswiftcodes.com/browse-by-country/.
  2. Loads every country page (https://www.theswiftcodes.com/albania/)
  3. Paginates every page in the country page.
  4. Extracts data from the country page.
  5. Saves data to MongoDB.
  6. Paginates through all pages in all countries.

It uses getWorkflowStaticData('global') method to recover the next page (saved from the previous page), and it goes ahead with all the pages.

There is a first section where the countries list is recovered and extracted.

Later, I try to read if a local cache page is available and I recover the cached page from the disk.

Finally, I save data to MongoDB, and we paginate all the pages in the country and for all the countries.

I have applied a cache system to save a visited page to n8n local disk. If I relaunch workflow, we check if a cache file exists to discard non-required requests to the webpage.

If the data present in the website changes, you can apply a Cron node to check the website once per week.

Finally, before inserting data in MongoDB, the best way to avoid duplicates is to check that swift_code (the primary value of the collection) doesn't exist.

I recommend using a proxy for all requests to avoid IP blocks. A good solution for proxy plus IP rotation is scrapoxy.io.

This workflow is perfect for small data requirements. If you need to scrape dynamic data, you can use a Headless browser or any other service.

If you want to scrape huge lists of URIs, I recommend using Scrapy + Scrapoxy.

Share Template

More Engineering workflow templates

Webhook node
Respond to Webhook node

Creating an API endpoint

Task: Create a simple API endpoint using the Webhook and Respond to Webhook nodes Why: You can prototype or replace a backend process with a single workflow Main use cases: Replace backend logic with a workflow
jon-n8n
Jonathan
Merge node

Joining different datasets

Task: Merge two datasets into one based on matching rules Why: A powerful capability of n8n is to easily branch out the workflow in order to process different datasets. Even more powerful is the ability to join them back together with SQL-like joining logic. Main use cases: Appending data sets Keep only new items Keep only existing items
jon-n8n
Jonathan
GitHub node
HTTP Request node
Merge node
+11

Back Up Your n8n Workflows To Github

This workflow will backup your workflows to Github. It uses the public api to export all of the workflow data using the n8n node. It then loops over the data checks in Github to see if a file exists that uses the workflow name. Once checked it will then update the file on Github if it exists, Create a new file if it doesn't exist and if it's the same it will ignore the file. Config Options repo_owner - Github owner repo_name - Github repository name repo_path - Path within the Github repository >This workflow has been updated to use the n8n node and the code node so requires at least version 0.198.0 of n8n
jon-n8n
Jonathan
Google Sheets node
HTTP Request node
Item Lists node
+5

Google Maps Scraper

This workflow allows to scrape Google Maps data in an efficient way using SerpAPI. You'll get all data from Gmaps at a cheaper cost than Google Maps API. Add as input, your Google Maps search URL and you'll get a list of places with many data points such as: phone number website rating reviews address And much more. Full guide to implement the workflow is here: https://lempire.notion.site/Scrape-Google-Maps-places-with-n8n-b7f1785c3d474e858b7ee61ad4c21136?pvs=4
lucasperret
Lucas Perret
HTTP Request node
Merge node
+3

Backup n8n workflows to Google Drive

Temporary solution using the undocumented REST API for backups using Google drive. Please note that there are issues with this workflow. It does not support versioning, so please know that it will create multiple copies of the workflows so if you run this daily it will make the folder grow quickly. Once I figure out how to version in Gdrive I'll update it here.
djangelic
Angel Menendez
HTTP Request node
Merge node
+13

AI Agent To Chat With Files In Supabase Storage

Video Guide I prepared a detailed guide explaining how to set up and implement this scenario, enabling you to chat with your documents stored in Supabase using n8n. Youtube Link Who is this for? This workflow is ideal for researchers, analysts, business owners, or anyone managing a large collection of documents. It's particularly beneficial for those who need quick contextual information retrieval from text-heavy files stored in Supabase, without needing additional services like Google Drive. What problem does this workflow solve? Manually retrieving and analyzing specific information from large document repositories is time-consuming and inefficient. This workflow automates the process by vectorizing documents and enabling AI-powered interactions, making it easy to query and retrieve context-based information from uploaded files. What this workflow does The workflow integrates Supabase with an AI-powered chatbot to process, store, and query text and PDF files. The steps include: Fetching and comparing files to avoid duplicate processing. Handling file downloads and extracting content based on the file type. Converting documents into vectorized data for contextual information retrieval. Storing and querying vectorized data from a Supabase vector store. File Extraction and Processing: Automates handling of multiple file formats (e.g., PDFs, text files), and extracts document content. Vectorized Embeddings Creation: Generates embeddings for processed data to enable AI-driven interactions. Dynamic Data Querying: Allows users to query their document repository conversationally using a chatbot. Setup N8N Workflow Fetch File List from Supabase: Use Supabase to retrieve the stored file list from a specified bucket. Add logic to manage empty folder placeholders returned by Supabase, avoiding incorrect processing. Compare and Filter Files: Aggregate the files retrieved from storage and compare them to the existing list in the Supabase files table. Exclude duplicates and skip placeholder files to ensure only unprocessed files are handled. Handle File Downloads: Download new files using detailed storage configurations for public/private access. Adjust the storage settings and GET requests to match your Supabase setup. File Type Processing: Use a Switch node to target specific file types (e.g., PDFs or text files). Employ relevant tools to process the content: For PDFs, extract embedded content. For text files, directly process the text data. Content Chunking: Break large text data into smaller chunks using the Text Splitter node. Define chunk size (default: 500 tokens) and overlap to retain necessary context across chunks. Vector Embedding Creation: Generate vectorized embeddings for the processed content using OpenAI's embedding tools. Ensure metadata, such as file ID, is included for easy data retrieval. Store Vectorized Data: Save the vectorized information into a dedicated Supabase vector store. Use the default schema and table provided by Supabase for seamless setup. AI Chatbot Integration: Add a chatbot node to handle user input and retrieve relevant document chunks. Use metadata like file ID for targeted queries, especially when multiple documents are involved. Testing Upload sample files to your Supabase bucket. Verify if files are processed and stored successfully in the vector store. Ask simple conversational questions about your documents using the chatbot (e.g., "What does Chapter 1 say about the Roman Empire?"). Test for accuracy and contextual relevance of retrieved results.
lowcodingdev
Mark Shcherbakov

More Building Blocks workflow templates

Webhook node
Respond to Webhook node

Creating an API endpoint

Task: Create a simple API endpoint using the Webhook and Respond to Webhook nodes Why: You can prototype or replace a backend process with a single workflow Main use cases: Replace backend logic with a workflow
jon-n8n
Jonathan
Customer Datastore (n8n training) node

Very quick quickstart

Want to learn the basics of n8n? Our comprehensive quick quickstart tutorial is here to guide you through the basics of n8n, step by step. Designed with beginners in mind, this tutorial provides a hands-on approach to learning n8n's basic functionalities.
deborah
Deborah
HTTP Request node
Item Lists node

Pulling data from services that n8n doesn’t have a pre-built integration for

You still can use the app in a workflow even if we don’t have a node for that or the existing operation for that. With the HTTP Request node, it is possible to call any API point and use the incoming data in your workflow Main use cases: Connect with apps and services that n8n doesn’t have integration with Web scraping How it works This workflow can be divided into three branches, each serving a distinct purpose: 1.Splitting into Items (HTTP Request - Get Mock Albums): The workflow initiates with a manual trigger (On clicking 'execute'). It performs an HTTP request to retrieve mock albums data from "https://jsonplaceholder.typicode.com/albums." The obtained data is split into items using the Item Lists node, facilitating easier management. 2.Data Scraping (HTTP Request - Get Wikipedia Page and HTML Extract): Another branch of the workflow involves fetching a random Wikipedia page using an HTTP request to "https://en.wikipedia.org/wiki/Special:Random." The HTML Extract node extracts the article title from the fetched Wikipedia page. 3.Handling Pagination (The final branch deals with handling pagination for a GitHub API request): It sends an HTTP request to "https://api.github.com/users/that-one-tom/starred," with parameters like the page number and items per page dynamically set by the Set node. The workflow uses conditions (If - Are we finished?) to check if there are more pages to retrieve and increments the page number accordingly (Set - Increment Page). This process repeats until all pages are fetched, allowing for comprehensive data retrieval.
jon-n8n
Jonathan
Merge node

Joining different datasets

Task: Merge two datasets into one based on matching rules Why: A powerful capability of n8n is to easily branch out the workflow in order to process different datasets. Even more powerful is the ability to join them back together with SQL-like joining logic. Main use cases: Appending data sets Keep only new items Keep only existing items
jon-n8n
Jonathan
HTTP Request node
WhatsApp Business Cloud node
+10

Building Your First WhatsApp Chatbot

This n8n template builds a simple WhatsApp chabot acting as a Sales Agent. The Agent is backed by a product catalog vector store to better answer user's questions. This template is intended to help introduce n8n users interested in building with WhatsApp. How it works This template is in 2 parts: creating the product catalog vector store and building the WhatsApp AI chatbot. A product brochure is imported via HTTP request node and its text contents extracted. The text contents are then uploaded to the in-memory vector store to build a knowledgebase for the chatbot. A WhatsApp trigger is used to capture messages from customers where non-text messages are filtered out. The customer's message is sent to the AI Agent which queries the product catalogue using the vector store tool. The Agent's response is sent back to the user via the WhatsApp node. How to use Once you've setup and configured your WhatsApp account and credentials First, populate the vector store by clicking the "Test Workflow" button. Next, activate the workflow to enable the WhatsApp chatbot. Message your designated WhatsApp number and you should receive a message from the AI sales agent. Tweak datasource and behaviour as required. Requirements WhatsApp Business Account OpenAI for LLM Customising this workflow Upgrade the vector store to Qdrant for persistance and production use-cases. Handle different WhatsApp message types for a more rich and engaging experience for customers.
jimleuk
Jimleuk
GitHub node
HTTP Request node
Merge node
+11

Back Up Your n8n Workflows To Github

This workflow will backup your workflows to Github. It uses the public api to export all of the workflow data using the n8n node. It then loops over the data checks in Github to see if a file exists that uses the workflow name. Once checked it will then update the file on Github if it exists, Create a new file if it doesn't exist and if it's the same it will ignore the file. Config Options repo_owner - Github owner repo_name - Github repository name repo_path - Path within the Github repository >This workflow has been updated to use the n8n node and the code node so requires at least version 0.198.0 of n8n
jon-n8n
Jonathan

Implement complex processes faster with n8n

red icon yellow icon red icon yellow icon