Back to Templates

๐ŸŒ Firecrawl website content extractor

Created by

Created by: Aashit Sharma || aashitsharma
Aashit Sharma

Last update

Last update 3 months ago

Share


๐ŸŒ Firecrawl Website Content Extractor (n8n Workflow)

This n8n automation workflow uses Firecrawl API to extract structured data (e.g., quotes and authors) from web pages โ€” such as Quotes to Scrape โ€” and handles retries in case of delayed extraction.


๐Ÿ” Workflow Overview

๐ŸŽฏ Purpose:

  • Crawl and extract structured web data using Firecrawl
  • Wait for asynchronous scraping to complete
  • Retrieve and validate results
  • Support retries if content is not ready

๐Ÿ”ง Step-by-Step Node Breakdown

1. ๐Ÿงช Manual Trigger

  • Node: When clicking โ€˜Test workflowโ€™
  • Used to manually test or execute the workflow during setup or debugging.

2. ๐Ÿ“ค Firecrawl Extract API Request

  • Node: Extract
  • Sends a POST request to https://api.firecrawl.dev/v1/extract
  • Payload includes:
    • urls: List of pages to crawl (https://quotes.toscrape.com/*)
    • prompt: "Extract all quotes and their corresponding authors from the website."
    • schema: JSON schema defining expected structure (quotes[], each with text and author)

๐Ÿ“Œ Uses an HTTP Header Auth credential for Firecrawl API


3. โฑ๏ธ Wait for 30 Seconds

  • Node: 30 Secs
  • Gives Firecrawl time to finish processing in the background
  • Prevents hitting the API before results are ready

4. ๐Ÿ“ฅ Get Results

  • Node: Get Results
  • Performs a GET request to the status URL using {{ $('Extract').item.json.id }} to retrieve extraction results.

5. โœ…โŒ Condition Check

  • Node: If
  • Checks if the data array is empty (i.e., no results yet)
  • If data is empty:
    • Waits 10 more seconds and retries
  • If data is available:
    • Passes data to the next step (e.g., processing or storage)

6. ๐Ÿ” Retry Delay

  • Node: 10 Seconds
  • Waits briefly before sending another GET request to Firecrawl

7. ๐Ÿ› ๏ธ Edit Fields (Optional Output Formatting)

  • Node: Edit Fields
  • Placeholder to structure or format the extracted results (quotes and authors)

๐Ÿงพ Sticky Note: Firecrawl Setup Guide

Included as an embedded reference:

  • ๐Ÿ”— 10% Firecrawl Discount
  • ๐Ÿงฐ Instructions to:
    • Add Firecrawl API credentials in n8n
    • Use Firecrawl Community Node for self-hosted instances
    • Set up the schema and prompt for targeted data extraction

โœ… Key Features

  • ๐Ÿ”Œ API-based crawling with schema-structured output
  • โฑ๏ธ Smart waiting + retry mechanism
  • ๐Ÿง  AI prompt integration for intelligent data parsing
  • โš™๏ธ Flexible for different URLs, prompts, and schemas

๐Ÿ“ฆ Sample Output Schema

{
  "quotes": [
    {
      "text": "The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.",
      "author": "Albert Einstein"
    },
    {
      "text": "It is our choices, Harry, that show what we truly are, far more than our abilities.",
      "author": "J.K. Rowling"
    }
  ]
}