This workflow allows you to easily evaluate and compare the outputs of two language models (LLMs) before choosing one for production.
In the chat interface, both model outputs are shown side by side. Their responses are also logged into a Google Sheet, where they can be evaluated manually or automatically using a more advanced model.
You're developing an AI agent, and since LLMs are non-deterministic, you want to determine which one performs best for your specific use case. This template is designed to help you compare them effectively.
Note: This version is set up for two models. If you want to compare more, you’ll need to extend the workflow logic and update the sheet.
You can use OpenRouter or Vertex AI to test models across providers.
If you're using a node for a specific provider, like OpenAI, you can compare different models from that provider (e.g., gpt-4.1
vs gpt-4.1-mini
).
This is ideal for teams, allowing non-technical stakeholders (not just data scientists) to evaluate responses based on real-world needs.
Advanced users can automate this evaluation using a more capable model (like o3
from OpenAI), but note that this will increase token usage and cost.
Since each input is processed by two different models, the workflow will consume more tokens overall.
Keep an eye on usage, especially if working with longer prompts or running multiple evaluations, as this can impact cost.