Detect semantic duplicate website pages with Google Drive, Postgres and Ollama

Created by

Last update

Last update 3 days ago

Quick overview

This workflow scans HTML files in a Google Drive folder, extracts and stores page text in Postgres, generates local vector embeddings with Ollama, and uses PGVector similarity searches to produce CSV reports that flag semantically duplicate website pages.

How it works

Starts manually and clears the existing PGVector embeddings table and the scraped page text table in Postgres.
Lists files in a specified Google Drive folder, filters to the target documents, and processes them in batches.
Downloads each HTML file from Google Drive, extracts the main body text, cleans it, and upserts the results into a Postgres table for scraped pages.
Reads the scraped page text back from Postgres in batches, splits it into overlapping chunks, and attaches page metadata (sheet_id, file_name, file_url) to each chunk.
Generates embeddings locally with Ollama and inserts the chunk vectors and metadata into Postgres (PGVector), deduplicating already-processed pages.
Builds an HNSW index in Postgres, computes chunk-to-chunk similarity matches and a pairwise page report, and exports the results as a CSV file.
Computes page-level centroid embeddings, finds highly similar page pairs, and exports a page-level duplicate report as a CSV file.

Setup

Add Google Drive OAuth2 credentials and set the Google Drive folder URL/ID used to scan for your HTML files.
Add Postgres credentials for a database with the pgvector extension enabled and permissions to create/alter tables and indexes (including HNSW indexes).
Add an Ollama credential and ensure the embedding model mxbai-embed-large:latest is available on your Ollama instance.
Confirm your source files are HTML documents and that the workflow’s text extraction and similarity thresholds match your content and desired duplicate sensitivity.

Requirements

Working instance of n8n, either self-hosted or on the cloud. Remember, this workflow can be computationally expensive.
Google Drive API (with OAuth setup in n8n credentials section)
Ollama (for open source models) or any Embedding model API
PostgreSQL with PGVector or any other vector database
PgAdmin (for PostgreSQL) or your interface to access database tables via SQL for troubleshooting (optional).

Additional info

Limitations and Enhancements:
Physical system memory
mxbai-embed-large Running through Ollama is free and private, but the embedding generation speed depends entirely on your hardware. The more system memory you have, the more data you can process in batches in the loop node.

Similarity threshold and boilerplate content
The cosine distance used in this workflow is 0.15 for chunk-level matching. And 0.05 (similarity above 95%) of the threshold is used for page-level centroid matching. This is only the starting point. Once you have the data, and especially if your data has more noise, you might need to tweak these thresholds for better matching.

This workflow needs HTML files to extract text
This workflow doesn't crawl a website or fetch pages by entering a URL. You need to download HTML files (rendered or source) for consumption.

Use parallel processing and Cloud APIs
Two sub-processes take the most time:

Downloading HTML files from Google Drive

Creating vector embeddings
If you can use parallel processing in n8n and execute these sub-processes in parallel, the process will be done much faster. Additionally, if you can use cloud APIs for embedding, it may save some you some processing time as well.

Use efficient SQL queries
Since I am from a non-tech background and not a coder, I used a mix of Gemini, Perplexity and Claude to create SQL codes for this workflow. If you're better at it, you can run computationally efficient queries that would help you achieve better results with less computation expense and time.