Extract and Structure Thai Documents to Google Sheets using Typhoon OCR and Llama 3.1

Created by

Jaruphat J.

Last update

Last update a month ago

Who is this for?

This template is for developers, operations teams, and automation builders in Thailand (or any Thai-speaking environment) who regularly process PDFs or scanned documents in Thai and want to extract structured text into a Google Sheet.

It is ideal for:

Local government document processing
Thai-language enterprise paperwork
AI automation pipelines requiring Thai OCR

What problem does this solve?

Typhoon OCR is one of the most accurate OCR tools for Thai text. However, integrating it into an end-to-end workflow usually requires manual scripting and data wrangling.

This template solves that by:

Running Typhoon OCR on PDF files
Using AI to extract structured data fields
Automatically storing results in Google Sheets

What this workflow does

Trigger: Run manually or from any automation source
Read Files: Load local PDF files from a doc/ folder
Execute Command: Run Typhoon OCR on each file using a Python command
LLM Extraction: Send the OCR markdown to an AI model (e.g., GPT-4 or OpenRouter) to extract fields
Code Node: Parse the LLM output as JSON
Google Sheets: Append structured data into a spreadsheet

Setup

1. Install Requirements

Python 3.10+
typhoon-ocr: pip install typhoon-ocr
Install Poppler and add to system PATH (needed for pdftoppm, pdfinfo)

2. Create folders

Create a folder called doc in the same directory where n8n runs (or mount it via Docker)

3. Google Sheet

Create a Google Sheet with the following column headers:

book_id	date	subject	detail	signed_by	signed_by2	contact	download_url

You can use this example Google Sheet as a reference.

4. API Key

Export your TYPHOON_OCR_API_KEY and OPENAI_API_KEY in your environment (or set inside the command string in Execute Command node).

How to customize this workflow

Replace the LLM provider in the Basic LLM Chain node (currently supports OpenRouter)
Change output fields to match your data structure (adjust the prompt and Google Sheet headers)
Add trigger nodes (e.g., Dropbox Upload, Webhook) to automate input

About Typhoon OCR

Typhoon is a multilingual LLM and toolkit optimized for Thai NLP. It includes typhoon-ocr, a Python OCR library designed for Thai-centric documents. It is open-source, highly accurate, and works well in automation pipelines. Perfect for government paperwork, PDF reports, and multilingual documents in Southeast Asia.