A technical evaluation of workflow-based automation tooling for building enterprise-grade agentic systems using LLMs. This is the second iteration of the report, conducted by independent research analyst Andrew Green in Q2 2026

Workflow-based AI Agent Development Tools are products for enterprises which offer a no-code/low-code development environment to automate business logic using LLMs. They allow users to define an automation sequence using both deterministic actions and self-governing agents.
A common critique is that these tools do not create authentic agents, as they are not fully self-governing and require users to have prior knowledge of how a flow looks. I therefore want to clearly define the intent of this report is to evaluate agent-based automation for enterprises. If it is acceptable for solopreneurs to delegate their calendars and emails to fully autonomous agents, it is not an acceptable scenario in an enterprise.

This report is evaluating the enterprise qualities of these AI agents. I therefore distinguish between an enterprise-grade agent and an enterprise-grade agent development tool, as these capabilities cut both ways. I find that both humans and agents interpret them whichever way they want in a given scenario.
For example, take authentication and authorization. This takes two forms
This report exclusively focuses on the second aspect. This will be applicable across triggers, Code execution, Sandboxing, Filesystem access, API call logs, Killswitches, Rate Limits and the rest.
Writing a prompt asking the LLM not to hallucinate or disclose sensitive data does not qualify as a security feature.
I have also excluded other non-AI product features such as tool hosting and form factor, or monitoring and error handling of wider workflow.

Only a handful of vendors offer a sandbox as a security boundary for untrusted, LLM-generated code. While roughly half of the vendors offer some incarnation of code execution, even fewer have sandboxing. Out of those with sandboxing, most rely on third party services, most commonly E2B.
CrewAI notably deprecated its native code execution service and suggested customers use E2B as a purpose-built sandbox. Some don’t offer conventional MicroVM or virtual kernels, but rather use process isolation through a self-hosted configuration.
Most marketing assets conflate "the agent uses an API key to call Anthropic" with "the agent provides credentials when accessing third party services." Only Google, Langflow, Workato, CrewAI, Sim.ai, and Gumloop score 2.
Lineage, which refers to the ability to trace an agent to a human identity is essentially non-existent across the market, with only Google, Workato and Gumloop scoring anything.
Secrets management is similarly thin: only Google, Sim.ai and Gumloop score 2, with Make, and Retool scoring 1. This matters most in enterprise contexts where agents are calling third-party APIs or accessing internal systems.
Most tools don’t really have a security-first mindset. Google and Gumloop were noticeably one of the tools concerned with security, being the only ones to offer all the following Proxy-based filtering and firewalling, policy definition, tool ABAC, authentication and authorization, lineage, and secrets management.
Some vendors use evaluations as a way to define guardrails and security policies. For example, some vendors use a “does answer contain PII” evaluation to enforce the “don’t disclose PII” guardrails, to result in a data loss prevention security policy.
This can, for example, use an LLM-as-judge to detect PII, whose outcome is then sent to a summarizer agent that non-determinically determines not to share the PII.
This is different from running a deterministic regex rule that detects social security numbers and replaces them with asterisks and a slap on the wrist.
MCP Host/Client functionality, where agents consume external MCP servers are commonplace. MCP Servers, where exposing the platform itself as an MCP server for other agents to call are similarly widespread. By contrast, Google's agent-to-agent protocol is only employed by Google (obvs), CrewAI, Retool, and Sim.ai.
There is nuance in MCP implementation, but generally speaking it is a commoditized feature.
However much you want to position these tools for citizen developers, you cannot ignore that sophisticated automation still requires technical knowledge. Most people with technical knowledge know how to code, so offering a coding environment which can just be a “run script” action, is a huge asset to the product.
Between the rather lackluster agent code execution and running human-written code, there are NO PLATFORMS that do both to a good degree. Not even Google.
Considering how important evaluations are to determine how an agent and workflow are performing, many vendors are not implementing them.
Evaluations were hard to define for the report. There are many ways you can implement evaluations, and it is difficult to write an exhaustive list. This report has mainly focused on evaluating agents against known answers, as these are one of the better ways of preventing hallucinations. These include Matches, Semantic similarity/relevancy, and Factual Correctness. More generic LLM-as-judge and Custom evaluations, which can mean anything are also included.

This report is considerably different from the one written last year. While the premise remains consistent and most of the landscape includes familiar names, there are a number of changes which I will state and explain below:
The code-based and no-code duality continues this year. Users can use these tools to mix-and-match between the two. Generally speaking, the no-code canvas is used to define the high level logic, while code-based snippets (usually hosted within a node in an automaton tool) perform custom tasks which cannot be defined with the canvas.
However, we introduce another dimension, which is AI-generated code execution. This refers to LLMs autonomously producing code to complete a task, with the platform offering a code execution environment to run the actions. Agents may self-determine how best to complete a task without needing pre-defined tools. For example, given two different data sources, the LLM may write a Python script to transform and normalize data before performing correlation or analysis. To support agents writing and reading files, we’ve also added a filesystem access metric.
Alongside the code execution environment, which really means that the agent can write anything, you need to offer a safe and sanitized way of running the code prior to executing it in production. So, we’ve added the agent sandbox metric.
In the previous iteration of the report, we distinguished between native agentic AI development tools and workflow automation tools that pivoted into AI agents. One year on and this differentiation is still accurate, but should not influence buying decisions. AI native products had the time to develop enterprise-grade features and workflow-based automation platforms had time to commit to implementing AI agents to a high standard in their product.
I also noted that almost every tool on the market has opted for a no-code workflow development GUI. This is further validated by more vendors developing these exact capabilities, including the likes of OpenAI and Google, as well comparably smaller players who developed them over the past year such as CrewAI. Today, there are more of these tools than I can count, so this report will only evaluate a representative slice of the market.
Vanilla LLM services such as Claude and ChatGPT offer features such as web search and reasoning natively. I therefore removed those from the evaluation list. Basic no-code functionalities such as swappable components and sequential agents are also removed. Low level controls over temperature and top-k on an LLM are easy to integrate and provide little benefits, so they are now removed.
“Integrations” is a very 2022-IPaaS style mindset that people in the AI community don’t seem to identify with. I’ve therefore removed the Integratability axis and repurposed some of the integration features across Codability and Enterprise Readiness axes. The integration burden also unequally shifted to the server side, where those who want to expose their services to AI agents need to publish an MCP server. This is different compared to APIs, where servers had to expose their services via APIs and consumers had to integrate with these exposed APIs. In an MCP world, MCP clients autonomously decide when and how to interact with the servers.

This section defines the evaluation criteria used to compare a selection of tools in the AI Agent development market. It provides a comprehensive list of features that can support developers in creating production-ready AI Agent applications and integrating them in their existing business and technology stack.
For each feature, vendors will be scored as follows
Feature is absent or unstated
Feature is partially available or achieved via third-party integrations
Feature is available natively in the tool
The depth of the feature only goes as far as the definition below. For example, if a tool offers data loss prevention features, it will get a 2, regardless of how simple or complex the feature is.
To do the assessment for each vendor, I have done the following:
First, go through all the documentation manually and populate the spreadsheet with all that I can find.
Second, go through the criteria where I did not find the information and check the website and other resources, use the search function in the docs, or write a site:tools.com query with the right keywords of the capabilities I’m looking for.
Lastly, for those with AI-powered documentation search, I asked the AI whether the tool was offering the features.
Vendors that cannot be assessed based on publicly available documentation will be excluded.
This section evaluates tools’ capabilities for configuring AI models, leveraging frameworks and optimizing the use of AI.
Triggers
Developer code management
Agent code management
Human-Agent interaction
RAG support
Agentic system building
LLM evaluations
Context management
Integrations
This is a catch-all term that defines how an LLM can be deployed and configured in a responsible way. This will make the difference between a crude personal agent that consumers or solopreneurs are using, and responsible deployments that are suitable for organizations that actually deal with customer data and such.
Traceability and Observability
Security
Agent Identity
Guardrails
LLM hosting
Third-party API Management
*Google’s EAP is not comparable with the other tools here in terms of procurement, provisioning, and management.
| CrewAI | Dify | Flowise | Google Gemini EAP | Gumloop | Langflow | Make | n8n | OpenAI | Retool | Sim.ai | StackAI | Tines | Workato | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Codability (%) | 65 | 59 | 63 | 80 | 71 | 35 | 41 | 72 | 65 | 55 | 76 | 62 | 58 | 42 |
| Enterprise (%) | 48 | 46 | 37 | 74 | 48 | 30 | 46 | 54 | 48 | 50 | 54 | 52 | 63 | 54 |

Considering the research is based on vendors’ technical documentation, the scoring is directly tied to the quality of the technical documentation. This means that undocumented features may still be present in the tool, in which case, they won’t be reflected in the scoring.
The assessment is not conducted through user testing, so user experience is not in scope. This is comparable to evaluating cars without driving them. We can have an intuitive understanding of how a hatchback, performance SUV, or electric people carrier differ in terms of both usage and experience. With enough low level detail, we can compare cars in the same category, such as differences between an M5 F90 and G90.
The assessment is not based on benchmarking, which means that the evaluation does not include the tools’ behavior under stress.
The evaluation criteria is intended to be as comprehensive as possible, which means that some - or many - of the features we evaluate may not be relevant or applicable to your use cases, which is why we recommend looking at the complete scores rather than the final average.
We have not engaged with any of the vendors featured in the report prior to publishing it. If any vendors have corrections they want to make, I invite them to send me any comments that I will evaluate to update the report.
33
CATEGORIES
32
CATEGORIES
10
CATEGORIES
Triggers - this will evaluate how AI Agents are triggered within a process or workflow
Developer code management - this metric evaluates features for managing code produced by human developers
Agent code management - this metric evaluates features for managing code produced by coding agents
Human-Agent interaction - this evaluates how end-users can interact with the agents
RAG support - a selection of features which improve retrieval augmented generation
Agentic system building - these are capabilities that allow end-users to define flexible, efficient, and predictable multi-agent architectures
LLM evaluations
Context management
Integrations
Traceability and Observability
Security
Agent Identity
Guardrails
LLM hosting
Third-party API Management
Revisit last year’s report and discover which platforms defined the AI agent ecosystem.