Quick overview
This workflow runs on a schedule to sync files from a Google Drive folder into a Pinecone vector index for RAG, extracting text from PDFs, XLSX, Google Docs, and spreadsheets, generating embeddings with Google Gemini, and tracking file state in a Google Sheets log to handle updates and deletions.
How it works
- Runs on a schedule and fetches the current file list from a target Google Drive folder and the existing file log from Google Sheets.
- Compares Google Drive files with the Google Sheets log to detect new/updated files to ingest and files that were deleted from Drive.
- For new or updated files, deletes any existing vectors in Pinecone for the file ID, downloads the file from Google Drive, and routes it by MIME type.
- Extracts text from PDFs, XLSX/Google Sheets, and plain text/Google Docs files and maps the extracted content with file metadata (file ID, name, modified time, and MIME type).
- Chunks the document text, generates embeddings with Google Gemini, and inserts the resulting vectors and metadata into a Pinecone index.
- Appends or updates the Google Sheets log with the latest file metadata, and for deleted Drive files it deletes matching vectors in Pinecone and removes the corresponding log rows.
Setup
- Connect Google Drive OAuth2 credentials and set the folder ID to the Drive folder you want to sync.
- Connect Google Sheets OAuth2 credentials and set the spreadsheet/sheet used as the sync log (it must include at least file_id, name, and modifiedTIme columns).
- Create or select a Pinecone index (for example, gdrive-rag), add Pinecone API credentials, and ensure the Pinecone delete endpoint and API key header are configured correctly.
- Add Google Gemini (PaLM) API credentials for the embeddings model used by the workflow.