A Re-evaluation of
Workflow-based AI Agent Development Tools

A technical evaluation of workflow-based automation tooling for building enterprise-grade agentic systems using LLMs. This is the second iteration of the report, conducted by independent research analyst Andrew Green in Q2 2026

Workflow-based AI Agent Development Tools are products for enterprises which offer a no-code/low-code development environment to automate business logic using LLMs. They allow users to define an automation sequence using both deterministic actions and self-governing agents.

A common critique is that these tools do not create authentic agents, as they are not fully self-governing and require users to have prior knowledge of how a flow looks. I therefore want to clearly define the intent of this report is to evaluate agent-based automation for enterprises. If it is acceptable for solopreneurs to delegate their calendars and emails to fully autonomous agents, it is not an acceptable scenario in an enterprise.

What enterprise-grade means:

This report is evaluating the enterprise qualities of these AI agents. I therefore distinguish between an enterprise-grade agent and an enterprise-grade agent development tool, as these capabilities cut both ways. I find that both humans and agents interpret them whichever way they want in a given scenario.

For example, take authentication and authorization. This takes two forms

Auth for the agent development tool, where human users have accounts provisioned, inherit permissions from the organization’s identity provider, use SSO and MFA to sign in, etc.
Auth for agents, where the code-execution and tool-calling component of an agent has its own authentication mechanism, which could be API keys, JWT tokens, use mTLS and SPIFFE. This ensures that this code-execution and tool-calling component has been explicitly authorized to perform an action and it can demonstrate it by providing a token or similar.

This report exclusively focuses on the second aspect. This will be applicable across triggers, Code execution, Sandboxing, Filesystem access, API call logs, Killswitches, Rate Limits and the rest.

Writing a prompt asking the LLM not to hallucinate or disclose sensitive data does not qualify as a security feature.

I have also excluded other non-AI product features such as tool hosting and form factor, or monitoring and error handling of wider workflow.

Scoring observations

Agent code management is surprisingly underdeveloped

Only a handful of vendors offer a sandbox as a security boundary for untrusted, LLM-generated code. While roughly half of the vendors offer some incarnation of code execution, even fewer have sandboxing. Out of those with sandboxing, most rely on third party services, most commonly E2B.

CrewAI notably deprecated its native code execution service and suggested customers use E2B as a purpose-built sandbox. Some don’t offer conventional MicroVM or virtual kernels, but rather use process isolation through a self-hosted configuration.

Agent authentication and identity is almost universally absent

Most marketing assets conflate "the agent uses an API key to call Anthropic" with "the agent provides credentials when accessing third party services." Only Google, Langflow, Workato, CrewAI, Sim.ai, and Gumloop score 2.

Lineage, which refers to the ability to trace an agent to a human identity is essentially non-existent across the market, with only Google, Workato and Gumloop scoring anything.

Secrets management is similarly thin: only Google, Sim.ai and Gumloop score 2, with Make, and Retool scoring 1. This matters most in enterprise contexts where agents are calling third-party APIs or accessing internal systems.

Security guardrails are shallow across the board

Most tools don’t really have a security-first mindset. Google and Gumloop were noticeably one of the tools concerned with security, being the only ones to offer all the following Proxy-based filtering and firewalling, policy definition, tool ABAC, authentication and authorization, lineage, and secrets management.

Some Evaluations = Guardrails = Model Behavior Security

Some vendors use evaluations as a way to define guardrails and security policies. For example, some vendors use a “does answer contain PII” evaluation to enforce the “don’t disclose PII” guardrails, to result in a data loss prevention security policy.

This can, for example, use an LLM-as-judge to detect PII, whose outcome is then sent to a summarizer agent that non-determinically determines not to share the PII.

This is different from running a deterministic regex rule that detects social security numbers and replaces them with asterisks and a slap on the wrist.

MCP everywhere, A2A somewhere

MCP Host/Client functionality, where agents consume external MCP servers are commonplace. MCP Servers, where exposing the platform itself as an MCP server for other agents to call are similarly widespread. By contrast, Google's agent-to-agent protocol is only employed by Google (obvs), CrewAI, Retool, and Sim.ai.

There is nuance in MCP implementation, but generally speaking it is a commoditized feature.

Tools don’t mix and match human and agent written code

However much you want to position these tools for citizen developers, you cannot ignore that sophisticated automation still requires technical knowledge. Most people with technical knowledge know how to code, so offering a coding environment which can just be a “run script” action, is a huge asset to the product.

Between the rather lackluster agent code execution and running human-written code, there are NO PLATFORMS that do both to a good degree. Not even Google.

Evaluations are surprisingly absent

Considering how important evaluations are to determine how an agent and workflow are performing, many vendors are not implementing them.

Evaluations were hard to define for the report. There are many ways you can implement evaluations, and it is difficult to write an exhaustive list. This report has mainly focused on evaluating agents against known answers, as these are one of the better ways of preventing hallucinations. These include Matches, Semantic similarity/relevancy, and Factual Correctness. More generic LLM-as-judge and Custom evaluations, which can mean anything are also included.

Deltas from last year

This report is considerably different from the one written last year. While the premise remains consistent and most of the landscape includes familiar names, there are a number of changes which I will state and explain below:

The code-based and no-code duality continues this year. Users can use these tools to mix-and-match between the two. Generally speaking, the no-code canvas is used to define the high level logic, while code-based snippets (usually hosted within a node in an automaton tool) perform custom tasks which cannot be defined with the canvas.

However, we introduce another dimension, which is AI-generated code execution. This refers to LLMs autonomously producing code to complete a task, with the platform offering a code execution environment to run the actions. Agents may self-determine how best to complete a task without needing pre-defined tools. For example, given two different data sources, the LLM may write a Python script to transform and normalize data before performing correlation or analysis. To support agents writing and reading files, we’ve also added a filesystem access metric.

Alongside the code execution environment, which really means that the agent can write anything, you need to offer a safe and sanitized way of running the code prior to executing it in production. So, we’ve added the agent sandbox metric.

In the previous iteration of the report, we distinguished between native agentic AI development tools and workflow automation tools that pivoted into AI agents. One year on and this differentiation is still accurate, but should not influence buying decisions. AI native products had the time to develop enterprise-grade features and workflow-based automation platforms had time to commit to implementing AI agents to a high standard in their product.

I also noted that almost every tool on the market has opted for a no-code workflow development GUI. This is further validated by more vendors developing these exact capabilities, including the likes of OpenAI and Google, as well comparably smaller players who developed them over the past year such as CrewAI. Today, there are more of these tools than I can count, so this report will only evaluate a representative slice of the market.

Vanilla LLM services such as Claude and ChatGPT offer features such as web search and reasoning natively. I therefore removed those from the evaluation list. Basic no-code functionalities such as swappable components and sequential agents are also removed. Low level controls over temperature and top-k on an LLM are easy to integrate and provide little benefits, so they are now removed.

“Integrations” is a very 2022-IPaaS style mindset that people in the AI community don’t seem to identify with. I’ve therefore removed the Integratability axis and repurposed some of the integration features across Codability and Enterprise Readiness axes. The integration burden also unequally shifted to the server side, where those who want to expose their services to AI agents need to publish an MCP server. This is different compared to APIs, where servers had to expose their services via APIs and consumers had to integrate with these exposed APIs. In an MCP world, MCP clients autonomously decide when and how to interact with the servers.

Definitions and Methodology

This section defines the evaluation criteria used to compare a selection of tools in the AI Agent development market. It provides a comprehensive list of features that can support developers in creating production-ready AI Agent applications and integrating them in their existing business and technology stack.

For each feature, vendors will be scored as follows

0

Feature is absent or unstated

1

Feature is partially available or achieved via third-party integrations

2

Feature is available natively in the tool

The depth of the feature only goes as far as the definition below. For example, if a tool offers data loss prevention features, it will get a 2, regardless of how simple or complex the feature is.

To do the assessment for each vendor, I have done the following:

First, go through all the documentation manually and populate the spreadsheet with all that I can find.

Second, go through the criteria where I did not find the information and check the website and other resources, use the search function in the docs, or write a site:tools.com query with the right keywords of the capabilities I’m looking for.

Lastly, for those with AI-powered documentation search, I asked the AI whether the tool was offering the features.

Vendors that cannot be assessed based on publicly available documentation will be excluded.

Codability

This section evaluates tools’ capabilities for configuring AI models, leveraging frameworks and optimizing the use of AI.

Triggers

Developer code management

Agent code management

Human-Agent interaction

RAG support

Agentic system building

LLM evaluations

Context management

Integrations

Enterprisiness

This is a catch-all term that defines how an LLM can be deployed and configured in a responsible way. This will make the difference between a crude personal agent that consumers or solopreneurs are using, and responsible deployments that are suitable for organizations that actually deal with customer data and such.

Traceability and Observability

Security

Agent Identity

Guardrails

LLM hosting

Third-party API Management

Evaluation Outcomes

*

Codability

Enterprise

^*Google’s EAP is not comparable with the other tools here in terms of procurement, provisioning, and management.

	CrewAI	Dify	Flowise	Google Gemini EAP	Gumloop	Langflow	Make	n8n	OpenAI	Retool	Sim.ai	StackAI	Tines	Workato
Codability (%)	65	59	63	80	71	35	41	72	65	55	76	62	58	42
Enterprise (%)	48	46	37	74	48	30	46	54	48	50	54	52	63	54

Full breakdown of scores

Vendor profiles

CrewAI
65%
Codability score
48%
Enterprisiness score
Since the last iteration of the report, CrewAI released their Crew Studio, making them eligible to be featured in the report. I found CrewAI’s product to be very coherent, well-documented, and well-explained. Some cool stuff includes:

Fingerprints - provide a way to uniquely identify and track components throughout their lifecycle. Each Agent, Crew, and Task automatically receives a unique fingerprint when created, which cannot be manually overridden. These fingerprints can be used for auditing and tracking component usage, ensuring component identity integrity, attaching metadata to components, or creating a traceable chain of operations

Training - a feature in CrewAI allows users to train AI agents using the command-line interface. During training, CrewAI utilizes techniques to optimize the performance of agents along with human feedback. This helps the agents improve their understanding, decision-making, and problem-solving abilities.

Agent Repositories - allow enterprise users to store, share, and reuse agent definitions across teams and projects. This feature enables organizations to maintain a centralized library of standardized agents, for consistency and reducing duplication of effort.
Dify
59%
Codability score
46%
Enterprisiness score
The BYD of agentic workflows with >140k stars on Github, Dify does fairly well across scores in the report. I haven’t seen many deltas in Dify since last year, and I expected them to score slightly higher. They are quite limited in differentiators, but some of their cool stuff includes:

Available in AWS marketplace - you can get nifty stuff like publishing applications as WebApps, embed into websites, or integrate via API, and apply custom branding.

Annotations - a way of creating a curated library of responses for specific questions. When users ask similar questions, Dify returns pre-written answers instead of generating new responses.

Dify’s DSL - can export flows as YAML files, such that users can create Dify apps from these DSL files directly. This makes it easy to port apps to other Dify instances and share with others.
Flowise
63%
Codability score
37%
Enterprisiness score
Recently acquired by Workday, Flowise is a tool synonymous with agentic workflows. I found Flowise’s RAG to be way above most other tools. It supports things like:

Text splitting — broadest range of splitter types in the evaluation. Character, Token, Recursive Character, Markdown, Code, and HTML-to-Markdown splitters all available natively. Configurable chunk size and chunk overlap per splitter

Data preview and refinement - Live preview of chunking output before processing, allowing iterative tuning of splitter, chunk size, and overlap before committing

Post-processing chunk editing — individual chunks can be deleted or have content added after upsert

Embedding - Multiple embedding model providers supported (OpenAI, Google, others)

Vector store - Multiple vector store options including production-ready hosted options (e.g. Upstash)

Record management - Optional Record Manager component for tracking upserted chunks, enabling incremental updates and targeted deletions without full re-ingestion

Retrieval tuning - Chunk overlap explicitly designed to bridge the top K retrieval limit and preserve contextual continuity across chunk boundaries

I also like shared state, which provides a way to manage and share data dynamically throughout the execution of a single workflow instance. It’s a runtime, key-value store that is shared among the nodes in a single execution. It functions as temporary memory or a shared context that exists only for the duration of that particular run/execution. Its primary purpose is to enable explicit data sharing and communication between nodes, especially those that may not be directly connected in the workflow graph, or when data needs to be intentionally persisted and modified across multiple steps.
Google Gemini EAP
80%
Codability score
74%
Enterprisiness score
Google delivers. They rebranded VertexAI and combined it with ADK to release the Gemini Enterprise Agent Platform as I was researching this report. A part of this platform is Agent Studio, the visual designer with a drag-and-drop canvas.

However, what you can achieve with Agent Studio is a subset of the wider platform.

As such, I’ve evaluated the whole EAP against the report’s criteria for you, the reader, but I do not think Google’s EAP is comparable with the other tools here in terms of procurement, provisioning, and management. So they get an asterisk. Google themselves said users need a solid understanding of Python programming, and mandatory completion of the ADK Quickstart tutorial(s) or equivalent foundational knowledge of ADK basics.

Between the Vertex rebrand and all its modules, I am confused about their overall product taxonomy, including the pricing structure, which includes model inference. I spent a lot of time trying to figure it out, but as I could not write clear lines between products, I have evaluated it all.

Regardless, there is a lot of cool enterprise-grade stuff, including:

Agent Gateway - is the network entry and exit point for all agent interactions. It gives enterprise security administrators the ability to enforce security and governance policies for agents as a part of the platform infrastructure.

Model Armor - applies the organization's content security guardrails by inspecting all prompts and responses that pass through the Agent Gateway. This feature enforces content security guardrails consistently across all agents governed by the gateway. Model Armor helps mitigate risks such as prompt injection, jailbreak attempts, leakage of sensitive information, and generation of harmful content.

Agent Identity - provides an attested, cryptographic identity for each agent that is based on the SPIFFE standard. With Agent Identity, agents can securely authenticate to MCP servers, cloud resources, endpoints, and other agents, acting either on its own behalf or on behalf of an end user. Agent Identity uses the agent's own credential and Agent Identity auth manager.

Policy testing - used to verify that it correctly filters traffic during ingress or egress based on the conditions defined.
Gumloop
71%
Codability score
48%
Enterprisiness score
I like Gumloop, not only because it’s a good product, but mainly because it takes agent security and authentication really seriously. Most other products have security features, like auth and RBAC and such, but they’re mainly for the tool itself rather than for agent behavior. But in Gumloop you get goodness like:

Per-tool authorization - Authorize different access levels for different types of tools by setting per-tool policies.

Credentials & Authentication - When an agent uses integrations and workflows, it needs credentials to access external services. Every node that requires authentication has a “Credentials to use” dropdown.

Lineage - Agents always use the credentials of the person running the agent, not the agent creator’s credentials. If a user runs the agent, it uses their credentials. Personal Agents always use the personal default credentials of whoever is running the agent.

AI Model Governance & Configuration - enable administrators to implement security policies, manage costs, ensure compliance, and maintain centralized control over AI automation workflows.

AI Proxy Routing - Route AI requests through custom proxy URLs

Secrets management - No local plaintext keys, no mystery config files. Enforce credential flows, handle rotation and revocation, and plug into existing vaults.
Langflow
35%
Codability score
30%
Enterprisiness score
Langflow is another tool synonymous with agentic workflows. I think this is IBM’s doing, but Langflow really gives you a lot more back-end configuration capabilities than the other tools, which include:

Memory management options - Langflow provides flexible memory management options for storage and retrieval of data relevant to flows and Langflow servers. This includes essential Langflow database tables, file management, and caching, as well as chat memory.

Logging - Langflow produces logs for individual flows and the Langflow application itself using the structlog library for logging.

File management system - Each Langflow server has a file management system where files that need to be used can be stored. Files uploaded to Langflow file management are stored in Langflow's storage backend (local or AWS S3), and they are available to all flows. Uploading files to Langflow file management keeps files in a central location, and allows users to reuse files across flows without repeated manual uploads.

Proxy - Deploying Langflow on a Linux-based server and using Nginx as a reverse proxy. It encrypts for SSL certificates, and Certbot for automated certificate management. This setup encrypts all communications between users and the Langflow server. SSL certificates ensure that sensitive data is protected from eavesdropping and tampering, and the automatic certificate management through Certbot eliminates the complexity of manual SSL configuration.
Make
41%
Codability score
46%
Enterprisiness score
I remember Make from last year as the one with three sets of docs. While this is neither a bonus nor a limitation, you should know about this if you’re evaluating them.

https://help.make.com/

https://developers.make.com/

https://apps.make.com/code

Some of their cool stuff include:

Grid - a dependency map that helps businesses visually manage and optimize their entire automation and AI ecosystem. It’s like a simulation of the automation infrastructure - it shows a layout of how everything is built and wired. Make Grid creates a near real-time, auto-generated map that shows how scenarios, apps, data stores, and AI-powered components are connected, making it easier to debug, scale, and optimize workflows.

Rollback error handler - stops the scenario run and reverts changes made by modules that support transactions. They always use a database app, like MySQL or Data store. Make cannot undo actions made by modules that don't support transactions, like Gmail > Send an email or Dropbox > Delete a file.

Make Managed Services (MMS) is a product offering that lets distributors of Make manage multiple organizations under a single entity. Users can manage child organizations in the Make platform or with the Make API. With MMS, organizations can now easily distribute Enterprise licenses to clients and manage their operations in one place, or create automations for clients after an easy, quick user setup. Make Managed Services is useful to distributors that need to manage organizations on behalf of their clients, such as Make resellers or automation experts who build for organizations.

Make White Label - Make's White Label solution allows OEM customers to rebrand and manage their own instance of the automation platform. It provides tools for customizing appearance (logos, domain) and controlling user access roles, enabling partners to offer Make’s automation capabilities seamlessly under their own brand.
n8n
72%
Codability score
54%
Enterprisiness score
First, n8n is cool for commissioning, publishing, and hosting this independent report. It is synonymous with automation and is often the reference point when comparing tools, as I am constantly reminded by my YouTube feed with ‘New Tool just killed n8n’.

But beyond the strong brand, community, and community content, n8n has been serious about their enterprise-grade scalability and security. Some of the cooler stuff I’ve seen includes:

Self-hosted AI kit - a pre-packaged containerized image which includes n8n (obviously), Ollama as a cross-platform LLM platform to run alongside and as part of n8n workflows, Qdrant as an open-source vector store that integrates with the workflow via API, and PostgreSQL to handle data locally.

Docker Hardened Images - n8n migrated to DHI to offer a low-CVE foundation with continuous patching and verified provenance. This is particularly important for self-hosted versions for hardened assurance.

The OEM deployment option - lets users embed and surface n8n's interface inside their own product's UI. This allows end-users to build workflows, configure connections, and run workflow automation inside the user’s product.

Task isolation via task runners as sidecars in external mode - running tasks as separate containers to provide a fully isolated environment when executing the JavaScript defined in the Code node.

Distroless images - reduces the attack surface by only including the application and its runtime dependencies, excluding package managers, shells, and other utilities that aren't needed at runtime.

Nobody user - task runners can execute as the unprivileged nobody user to prevent the container process from running with root privileges and limit potential damage from security vulnerabilities.

Read-only root filesystem - Configure a read-only root filesystem to prevent any modifications to the container's filesystem at runtime. This helps protect against malicious code that might attempt to modify system files.
OpenAI
65%
Codability score
48%
Enterprisiness score
I initially thought OpenAI’s Agent Builder was a filler service that would retain users that wanted to use alternatives. But their capabilities are overall very good, and considering OpenAI is the model provider, you can do some nifty things with their agent builder.

While some of the capabilities evaluated include both features available with the Agent Builder as well as the SDKs, this gap is nowhere near as big as with Google’s EAP.

Agent visualization - generates a structured graphical representation of agents and their relationships using Graphviz. This is useful for understanding how agents, tools, and handoffs interact within an application. Users can generate an agent visualization using the draw_graph function.

Compaction - Manages long-running conversations with server-side and standalone compaction to reduce context size while preserving state needed for subsequent turns. Compaction helps balance quality, cost, and latency as conversations grow.

Counting tokens - Get accurate input token counts before sending requests. Token counting determines how many input tokens a request will use before sending it to the model. Used to optimize prompts to fit within context limits, estimate costs before making API calls

Predicted Outputs - Reducing latency for model responses where much of the response is known ahead of time. Predicted Outputs can speed up API responses from Chat Completions when many of the output tokens are known ahead of time. This is most common when regenerating a text or code file with minor modifications.
Retool
55%
Codability score
50%
Enterprisiness score
Retool is an app development tool first, for both backend and frontend. All the agent automation stuff you do with it is a bonus. It also has some really great monitoring features, like these:

The agent sidebar - The agent sidebar includes an expandable list of all agents, small activity graphs showing recent runs, and a search bar. Clicking on an agent from the sidebar accesses monitoring information for that particular agent, or expand an agent and click on a particular run to access the monitoring information for that agent run.

Time range selector - The time range selector shows runs and errors across all agents for the selected timeframe when hovering over the graph. The time range selector can view the run and error history of all agents or an individual agent.

Real-time live events - pulls in live events, or visualize agents and tool calls in real-time.

The Agent graph - shows agent-resource interactions in real-time. Agents are displayed as primary nodes, and tools are shown as secondary nodes. Active connections are shown as animated lines during tool execution. .

As part of the app building experience, Retool also does external apps to white-label an instance and deliver web applications to external users in a controlled, branded environment.
Sim.ai
76%
Codability score
54%
Enterprisiness score
I found Sim.ai late in the report writing process and I was surprised to see so much one-to-one mapping between the criteria defined and their capabilities. Outside of the report, there aren’t many notable features, but they’ve only been around for one year, so let them cook. Today, some of their distinguishing features include:

Mothership - I really want to understand how this is more than just a global copilot. The Mothership is fed all of the Sim environment as context to run things rather than just context of a task or process. Regardless, it is advertised to help users to build, edit, run and debug a workflow, run research, generate a presentation, query a table, schedule a recurring job, or send a Slack message.

Sim Mailer - creates a dedicated email address for the workspace. Users can forward or send emails to it and Sim will process them as tasks — reading the subject, body, and any attachments, then replying to the thread with the result. Users can interact with Sim directly from the email client without switching apps.

Secrets - better secrets management features than we evaluate, which includes Workspace and Personal sections with inline key-value rows. External workspace members count as workspace members for workspace-scoped secrets. They can use workspace secrets according to their workspace permission level, even though they are not members of the organization.
StackAI (acquired by Asana)
62%
Codability score
52%
Enterprisiness score
In late May, Asana announced its acquisition of StackAI. I’m unclear on how the product will look a few months from now, but what is clear is that StackAI made the biggest jump in capabilities from last year. They’ve developed their product to cater to enterprises (as you can tell from their feature set and their ISO27001 and SOC2 certs) so I can see why Asana chose them over a whole lot of other similar products. A lot of their enterprise-grade features have to do with the tool itself rather than agents (see call out box at the beginning of this document), and considering their trajectory, I expect they’ll do their Agent ABAC and such soon. Some cool stuff includes:

Interfaces - While we evaluate interfaces in the report, StackAI’s explicit interface methods are comprehensive, and include Forms, Chat Assistant, Website Chatbot, Slack App, Microsoft Teams, WhatsApp / SMS with Twilio, APIs, iFrame, and React component.

The URL Node - allows users to add a URL to the flow and scrape the HTML or Metadata of a website to use as an input to the LLM. If an LLM node returns a URL as its output, it can feed into the URL node to scrape a website in a more complex workflow. The entire output of the URL Node will be given to the LLM as context.

The StackAI Node - an all-purpose node that does things like analyze data, browse the web, send emails, call APIs, and more. Pick the action, fill in the required inputs, and connect the outputs to the rest of the workflow.
Tines
58%
Codability score
63%
Enterprisiness score
Tines started as a security-first product which then extended its scope to non-security use cases. The good thing about this is that you get a tool with a security pedigree. It’s also a product for enterprises, which is reflected in the second-best score (after Google*) in Enterprise-Readiness.

On top of that, Tines is a product for operations teams. This means that you, the human, can handle things that happen in your environment from Tines in real-time.

Cases - a mechanism for creating and managing tickets. This is a whole workspace where users can investigate and respond to incidents, as well as invoking agents to automate the processes, such as enrichment, data transformation, automatic assignment, and orchestrating third party tools.

Self-hosting - Tines is a product for enterprises. To self-host it, Tines offers a package with three core components - Web server, Background worker and Command runner components, which are run as Docker containers. Additional components used for optimization include a load balancer-Routing requests to the web server component(s), Postgres as a Persistent data store, Redis for caching of data from the database and queuing for the background worker(s), and an SMTP server (optional) - Sending emails for user management, notifications, monitoring and the “Send Email” action.

Data residency - all of AI-powered features in Tines run in the same private architecture. Customer data never leaves the stack, does not travel on the open internet, is not logged, and is not used for training.
Workato
42%
Codability score
54%
Enterprisiness score
Workato is an enterprise for enterprises. They score well on criteria where other vendors don’t do as well, such as Proxy-based filtering and FW, tool ABAC, authentication and authorization, lineage.

Some distinguishing factors include:

Workato Genies - purpose built, production ready, and optimized for major functions across the enterprise.

Platform identity - integrates with identity provider. Supported providers include Okta, Azure AD, and other SAML-compatible IdPs. User identity is established through the authentication flow of the genie chat interface. Supported interfaces include Slack, Microsoft Teams, and Workato GO. Workato

User identity for access control - these are pulled from the skill trigger context rather than from the conversation. Every skill recipe begins with a genie trigger. Users send a message that invokes a skill and the trigger passes authenticated user context to the recipe.

Item 1 of 14

Limitations

Considering the research is based on vendors’ technical documentation, the scoring is directly tied to the quality of the technical documentation. This means that undocumented features may still be present in the tool, in which case, they won’t be reflected in the scoring.

The assessment is not conducted through user testing, so user experience is not in scope. This is comparable to evaluating cars without driving them. We can have an intuitive understanding of how a hatchback, performance SUV, or electric people carrier differ in terms of both usage and experience. With enough low level detail, we can compare cars in the same category, such as differences between an M5 F90 and G90.

The assessment is not based on benchmarking, which means that the evaluation does not include the tools’ behavior under stress.

The evaluation criteria is intended to be as comprehensive as possible, which means that some - or many - of the features we evaluate may not be relevant or applicable to your use cases, which is why we recommend looking at the complete scores rather than the final average.

We have not engaged with any of the vendors featured in the report prior to publishing it. If any vendors have corrections they want to make, I invite them to send me any comments that I will evaluate to update the report.

Compare tools

Tool 1: n8n

Tool 2: Make

n8n wins in

33

Draw in

32

Make wins in

10

Annex 1 - Full Definitions

Codability

Triggers - this will evaluate how AI Agents are triggered within a process or workflow

Developer code management - this metric evaluates features for managing code produced by human developers

Agent code management - this metric evaluates features for managing code produced by coding agents

Human-Agent interaction - this evaluates how end-users can interact with the agents

RAG support - a selection of features which improve retrieval augmented generation

Agentic system building - these are capabilities that allow end-users to define flexible, efficient, and predictable multi-agent architectures

LLM evaluations

Context management

Integrations

Enterprisiness

Traceability and Observability

Security

Agent Identity

Guardrails

LLM hosting

Third-party API Management

Annex 2 - Out of scope

Next-step validation
Rollbacks
Supply chain integrity
Attributed-based access control
Rollbacks
API Version control

Enterprise AI agent development tools report 2025

Revisit last year’s report and discover which platforms defined the AI agent ecosystem.

Go to report

What enterprise-grade means:

Scoring observations

Agent code management is surprisingly underdeveloped

Agent authentication and identity is almost universally absent

Security guardrails are shallow across the board

Some Evaluations = Guardrails = Model Behavior Security

MCP everywhere, A2A somewhere

Tools don’t mix and match human and agent written code

Evaluations are surprisingly absent

Deltas from last year

0

1

2

Codability

Enterprisiness

CrewAI

Dify

Flowise

Google Gemini EAP

Gumloop

Langflow

Make

n8n

OpenAI

Retool

Sim.ai

StackAI (acquired by Asana)

Tines

Workato

Limitations

n8n wins in

Draw in

Make wins in

Codability

Enterprisiness

Enterprise AI agent development tools report 2025