Multimodal telegram bot with voice, image & video analysis using Claude & Gemini

Created by

Last update

Last update a month ago

Quick overview

This is a starting point for building a Telegram AI agent. The base handles four input types: voice, pictures, video, and text, through the AI models of your choice. From here you connect tools to expand what the agent can do inside your n8n workflows.

How it works

Input: a message sent to the bot chat.
A Switch node sorts the message by type:
Voice message
Picture message
Video message
Text message
It currently uses OpenAI and Gemini to analyze voice, photos, and video, but you can swap in other models. The model reads the message, generates a response from the system prompt, and sends it back as a Telegram message.

Setup

Create the Telegram bot. In Telegram, search for "BotFather", send /newbot, follow the prompts, and copy the access token.
Add the Telegram credential in n8n. Open the Telegram trigger node, create a credential, paste the access token, and save.
Add the LLM credentials. Add your OpenAI and Gemini keys (and any other model you prefer) to the LLM nodes, then pick your model, and make sure each account has credits. Guides: OpenAI (voice) → https://winflowai.com/blog/get-openai-api-key/ and Google Gemini (images and video) → https://winflowai.com/blog/get-gemini-api-key/

Requirements

Telegram bot access token
OpenAI API key (voice)
Google Gemini API key (pictures and video)
n8n instance (Cloud or self-hosted)

Customization

Adjust the system prompt to shape the agent's output, and add tools to take it beyond conversation.