How uProc scraped a multi-page website with n8n
uProc saved time and engineering resources by using n8n to collect and wrangle data from a multi-page website.
Miquel Colomer is a self-proclaimed “IT, data, and open-source freak”. He is also the founder of uProc (and creator of the uProc node), a company that offers data solutions such as collection, cleaning, and automation.
One of his projects involved collecting and wrangling data from a multi-page website–a task he accomplished with n8n. Learn about the common challenges in web-scraping, how Miquel built a low-code workflow for this use case, and what is his advice for getting started with workflow automation.
The use case: scraping banking information
uProc creates tools that simplify data access, collect data (about persons, companies, products etc.), and use the Internet as a data source. For one project, Miquel had to create two tools for collecting banking-related data: financial data by Swift Code and Swift Code by IBAN account number.
The advantage of using the Internet as a data source is that it provides a tremendous amount of data, so Miquel could find the information they needed. However, when it came to using this information in his application, he encountered several challenges.
Challenges of web-scraping
When trying to collect data from the Internet, Miquel usually encounters three main challenges:
- The data is spread around different sources, which makes it difficult to collect and maintain.
- The data is available in different formats (e.g. HTML, RSS, CSV, XML), which makes it difficult to combine and process.
- The data is sometimes outdated, which makes it difficult to build reliable useful applications.
In the end, Miquel found the Swift codes that he needed for his application at https://www.theswiftcodes.com. In the next step, he needed to collect this data in a structured way. In the beginning, he used Python scripts making use of dedicated web-crawling libraries like Scrapy.
Though the scripts were up to the task, writing the code involved repetitive manual work and was time-consuming, as it included selecting the right tags and selectors, formatting, and processing the data in a way that could eventually be used in the end application.
In an effort to avoid manually writing extensive code, Miquel turned to workflow automation with n8n.
The low-code solution for scraping multi-page websites
Miquel built a 22-node low-code workflow that scrapes static websites with pagination. The workflow extracts data from every country page on the https://www.theswiftcodes.com/browse-by-country/ website and stores the collected information in MongoDB.
- the Execute Command node (to automatically create a local cache directory before starting the web-scraping process and avoid scraping the same pages)
- the HTTP Request node (to access data from the https://www.theswiftcodes.com website)
- the HTML Extract node (to extract the desired content from the website based on their HTML tags)
- the Set node (to set the necessary fields before transferring the data)
- the IF node (to filter information based on conditional logic, for example, checking whether a Swift code already exists in the database)
- the Read Binary File and Write Binary File nodes (to read and write data collected from the website)
- the Split In Batches node (to loop through the data)
With this workflow, Miquel was able to not only accomplish his project but also save precious time and resources by automating away repetitive coding.
How to get started with workflow automation
Use no-code or low-code solutions to create MVPs or tasks quickly. I avoid coding, I only code what I need.
Miquel’s advice for anyone interested in using workflow automation is:
Use your imagination to create your side project. Think about a problem that you need to solve and try to solve it with n8n.
If you want to get in touch with Miquel Colomer, you can contact him via email, LinkedIn, or Telegram @mcolomer1975. If you are a Spanish speaker, feel free to also join the Telegram group @comunidadn8n.