Easy image captioning with Gemini 1.5 Pro

Created by

Jimleuk

Last update

Last update a year ago

How it works

For this demo, we'll import a public image from a popular stock photography website, Pexel.com, into our workflow using the HTTP request node.
With multimodal LLMs, there is little do preprocess other than ensuring the image dimensions fit within the LLMs accepted limits. Though not essential, we'll resize the image using the Edit image node to achieve fast processing.
The image is used as an input to the basic LLM node by defining a "user message" entry with the binary (data) type.
The LLM node has the Gemini 1.5 Pro language model attached and we'll prompt it to generate a caption title and text appropriate for the image it sees.
Once generated, the generated caption text is positioning over the original image to complete the task. We can calculate the positioning relative to the amount of characters produced using the code node.

Not using Google Gemini? n8n's basic LLM node supports the standard syntax for image content for models that support it - try using GPT4o, Claude or LLava (via Ollama).
Google Drive is only used for demonstration purposes. Feel free to swap this out for other triggers such as webhooks to fit your use case.