Back to Templates
  • +11

Narrating over a Video using Multimodal AI

Created by:
Last update:

Last update 6 months ago

Categories:
Share:

This n8n template takes a video and extracts frames from it which are used with a multimodal LLM to generate a script. The script is then passed to the same multimodal LLM to generate a voiceover clip.

This template was inspired by Processing and narrating a video with GPT's visual capabilities and the TTS API

How it works

  • Video is downloaded using the HTTP node.
  • Python code node is used to extract the frames using OpenCV.
  • Loop node is used o batch the frames for the LLM to generate partial scripts.
  • All partial scripts are combined to form the full script which is then sent to OpenAI to generate audio from it.
  • The finished voiceover clip is uploaded to Google Drive.

Sample the finished product here: https://drive.google.com/file/d/1-XCoii0leGB2MffBMPpCZoxboVyeyeIX/view?usp=sharing

Requirements

  • OpenAI for LLM
  • Ideally, a mid-range (16GB RAM) machine for acceptable performance!

Customising this workflow

  • For larger videos, consider splitting into smaller clips for better performance
  • Use a multimodal LLM which supports fully video such as Google's Gemini.