Process Tour PDF from Google Drive to Pinecone Vector DB with OpenAI Embeddings
This workflow automates the process of extracting tour information from PDF files stored in a Google Drive folder, processes and vectorizes the extracted data, and stores it in a Pinecone vector database for efficient querying. This is especially useful for building AI-powered search or recommendation systems for travel packages.
A folder in Google Drive with PDF tour package brochures.
Pinecone account + API key
OpenAI API key
n8n cloud or self-hosted instance
Manual Trigger (When clicking 'Test workflow'): Used for manual testing and execution of the workflow.
Step 1: Store Tour Packages in PDF Format
Upload your curated tour packages containing the tours, activities and sight-seeings in PDF format into a designated Google Drive folder.
Step 2: Search Folder
Node: PDF Tour Package Folder (Google Drive)
This node searches the designated folder for files (filter by MIME type = application/pdf if needed).
Step 3: Download PDFs
Node: Download Package Files (Google Drive)
Downloads each matching PDF file found in the previous step.
Step 4: Loop Through Files
Node: Loop Over each PDF file
Iterates through each downloaded PDF file to extract, clean, split, and embed.
Step 5: Data Loader
Node: Data Loader
Reads each PDF’s content using a compatible loader. It passes clean raw text to the next node.
Often integrated with document loaders like pdf-loader, Unstructured, or pdfplumber.
Step 6: Recursive Text Splitter
Node: Recursive Character Text Splitter
Splits large chunks of text into manageable segments using overlapping window logic (e.g., 500 tokens with 50 token overlap).
This ensures contextual preservation for long documents during embedding.
Step 7: Generate Embeddings
Node: Embeddings OpenAI
Uses text-embedding-3-small model to vectorize the split chunks.
Outputs vector representations for each content chunk.
Step 8: Pinecone Vector Store
Node: Pinecone Vector Store - Store...
Stores each embedding along with its metadata (source PDF name, chunk ID, etc.).
This becomes the basis for fast, semantic search via RAG workflows or agents.
Google Drive (Search & Download)
Searches for all PDF files in a specified Google Drive folder.
Downloads each file for processing.
SplitInBatches (Loop Over Items)
Loops through each file found in the folder, ensuring each is processed individually.
Default Data Loader (LangChain)
Reads and extracts text from the PDF files.
Recursive Character Text Splitter (LangChain)
Splits the extracted text into manageable chunks for embedding.
OpenAI Embeddings (LangChain)
Converts each text chunk into a vector using OpenAI’s embedding model.
Pinecone Vector Store (LangChain)
Stores the resulting vectors in a Pinecone index for fast similarity search and querying.
The workflow starts manually for testing or can be scheduled.
Finds all PDF files in the specified folder.
Each file is processed one at a time using the SplitInBatches node.
Downloads the current PDF file from Google Drive.
The Default Data Loader node reads the PDF and extracts its text content.
The Recursive Character Text Splitter breaks the text into chunks (e.g., 1000 characters with 50 overlap) to optimize embedding quality.
**Each chunk is sent to the OpenAI Embeddings node to generate vector representations.
The vectors are inserted into a Pinecone index, making them available for semantic search and recommendations.
Add error handling nodes to manage failed downloads or extraction issues gracefully.
Ensure only PDF files are processed by adding a filter node.
Store additional metadata (e.g., file name, tour ID) alongside vectors in Pinecone for richer search results.
Optimize for large folders by processing multiple files in parallel (with care for API rate limits).
Replace manual trigger with a time-based or webhook trigger for full automation.
Add checks to ensure extracted text contains valid tour data before vectorization.
Integrate notifications (e.g., email or Slack) to inform when processing is complete or if issues arise.
This workflow demonstrates how n8n can orchestrate a powerful AI data pipeline using Google Drive, LangChain, OpenAI, and Pinecone. It’s a great foundation for building intelligent search or recommendation features for travel and tour data.
Feel free to ask for more details or share your improvements!
Let me know if you want to see a specific part of the workflow or need help with a particular node!