Automatically discover and extract article URLs from any website using AI to identify valid content links while filtering out navigation, category pages, and irrelevant content—perfect for building content pipelines, news aggregators, and research databases.
What Makes This Different:
- AI-Powered Intelligence - Uses GPT-5-mini to understand webpage context and identify actual articles vs navigation pages, eliminating false positives
- Browser Spoofing - Includes realistic User-Agent headers and request patterns to avoid bot detection on publisher sites
- Smart URL Normalization - Automatically strips tracking parameters (utm_*, fbclid, etc.), removes duplicates, and standardizes URLs
- Source Categorization - AI assigns logical source names based on domain and content type for easy filtering
- Rate Limiting Built-In - Configurable delays between requests prevent IP blocking and respect website resources
- Deduplication on Save - Google Sheets append-or-update pattern ensures no duplicate URLs in your database
Key Benefits of AI-Powered Content Discovery:
- Save 10+ Hours Weekly - Automate manual link hunting across dozens of publisher sites
- Higher Quality Results - AI filters out 95%+ of junk links (nav pages, categories, footers) that rule-based scrapers miss
- Scale Effortlessly - Add new seed URLs to your sheet and the same workflow handles any website structure
- Industry Agnostic - Works for news, blogs, research papers, product pages—any content type
- Always Up-to-Date - Schedule daily runs to catch new content as it's published
- Full Audit Trail - Track discovered URLs with timestamps and sources in Google Sheets
Who's it for
This template is designed for content marketers, SEO professionals, researchers, media monitors, and anyone who needs to aggregate content from multiple sources. It's perfect for organizations that need to track competitor blogs, curate industry news, build research databases, monitor brand mentions, or aggregate content for newsletters without manually checking dozens of websites daily or writing complex scraping rules for each source.
How it works / What it does
This workflow creates an intelligent content discovery pipeline that automatically finds and extracts article URLs from any webpage. The system:
- Reads Seed URLs - Pulls a list of webpages to crawl from your Google Sheets (blog indexes, news feeds, publication homepages)
- Fetches with Stealth - Downloads each webpage's HTML using browser-like headers to avoid bot detection
- Converts for AI - Transforms messy HTML into clean Markdown that the AI can easily process
- AI Extraction - GPT-5-mini analyzes the content and identifies valid article URLs while filtering out navigation, categories, and junk links
- Normalizes & Saves - Cleans URLs (removes tracking params), deduplicates, and saves to Google Sheets with source tracking
Key Innovation: Context-Aware Link Filtering - Unlike traditional scrapers that rely on CSS selectors or URL patterns (which break when sites update), the AI understands the semantic difference between an article link and a navigation link. It reads the page like a human would, identifying content worth following regardless of the website's structure.
How to set up
1. Create Your Google Sheets Database
- Create a new Google Spreadsheet with two sheets:
- "Seed URLs" - Add column
URL with webpages to crawl (blog homepages, news feeds, etc.)
- "Discovered URLs" - Add columns:
URL, Source, Status, Discovered At
- Add 3-5 seed URLs to start (e.g.,
https://abc.com/, https://news.xyz.com/)
2. Connect Your Credentials
- Google Sheets: Click the "Read Seed URLs" and "Save Discovered URLs" nodes → Select your Google Sheets account
- OpenAI: Click the "OpenAI GPT-5-mini" node → Add your OpenAI API key
- Select your spreadsheet and sheet names in both Google Sheets nodes
3. Customize the AI Prompt (Optional)
4. Test Your Configuration
- Click "Test Workflow" or use the Manual Trigger
- Check the execution to verify:
- Seed URLs are being read correctly
- HTML is fetched successfully (check for 200 status)
- AI returns valid JSON array of URLs
- URLs are saved to your output sheet
- Review the "Discovered URLs" sheet for results
5. Schedule and Monitor
- Adjust the Schedule Trigger (default: daily at 6 AM)
- Enable the workflow to run automatically
- Monitor execution logs for errors:
- Rate limiting: Increase wait time if sites block you
- Empty results: Check if seed URLs have changed structure
- AI errors: Review AI output in execution data
- Set up error notifications via email or Slack (add nodes after Completion Summary)
Requirements
- Google Sheets Account - OAuth2 connection for reading seed URLs and saving results
- OpenAI API Key - For GPT-5-mini (or swap for any LangChain-compatible LLM)