Paste any text and get a verdict on whether it was written by a human, AI, or a hybrid mix. Instead of trusting one black-box score, this workflow runs your text through statistical analysis and a three-agent debate where each agent challenges the others using hard numbers.
This is not another "detect AI with AI" template. The workflow measures six forensic markers first, then makes three separate agents argue about what those numbers mean. You see the raw data, the debate, and the final verdict with confidence scores.
The workflow runs in five stages:
Extract forensic metrics: A code node measures burstiness (sentence length variation), type-token ratio (vocabulary diversity), hapax rate (words appearing once), repetition score (repeated phrases), transition density (filler words like "furthermore"), and AI fingerprints (100+ known LLM phrases stored in a data table). Short texts under 150 words get recalibrated because metrics are less reliable.
Agent 1 - The Scanner: Reads the text cold with zero metrics. Gives a gut impression (human/AI/hybrid) based purely on instinct. Acts like an editor who has read thousands of manuscripts.
Agent 2 - Forensic Analyst: Gets the text, all metrics, and Agent 1's verdict. Writes a data-driven report that must cite specific numbers. Either agrees or disagrees with Agent 1 and explains why using the forensic evidence.
Agent 3 - Devil's Advocate: Gets everything above and argues the opposite of whatever Agent 2 concluded. If Agent 2 said AI, Agent 3 must argue human. Finds holes in the logic and metrics that got ignored.
Weighted verdict: A code node scores all three agents (35% Analyst, 15% Scanner, 15% Devil's Advocate, 35% raw metrics) and classifies as human (score under 0.35), AI (score over 0.60), or AI-augmented (in between). Confidence is calculated separately so you get verdicts like "AI with 67% confidence."
The chat response shows:
Example output for AI text:
🤖 Verdict: AI-Generated
Confidence: ████████░░ 87%
📊 Stylometric Metrics:
Burstiness: 0.18 🟥 AI
Vocabulary Diversity: 0.36 🟥 AI
Hapax Rate: 0.32 🟥 AI
Repetition: 0.21 🟥 AI
Transition Density: 0.024 🟥 AI
🔎 Agent 1 (Gut Check): AI (90%)
"Monotonous rhythm, corporate vocabulary, zero personality"
🔬 Agent 2 (Data): AI (95%)
"Five of six metrics flag AI. Burstiness of 0.18 well below human threshold..."
😈 Agent 3 (Critic): AI-AUGMENTED (65%)
"Could be human technical writing. Transition density alone not conclusive..."
A separate workflow branch runs weekly to keep the AI phrase list current:
Requires: A data table (Google Sheets, Airtable, or n8n Data Table) to store fingerprint words. The workflow includes a starter list of 100+ phrases like "delve into," "it's worth noting," "as of my last update."
LLM writing patterns shift fast. What worked for GPT-3 detection does not work for GPT-4. This keeps the detector current without manual updates.
At least one LLM provider: OpenAI, Anthropic, Google Gemini, Groq,
or any other provider with JSON output support. Each agent can use
a different provider or all can use the same one.
Data storage for fingerprint phrases: n8n Data Table (built-in),
Google Sheets, or Airtable. The workflow checks this table to
identify known AI phrases during analysis.
The workflow works best on long-form content (500+ words). Short texts under 100 words produce less reliable metrics because statistical patterns need more data to emerge. The recalibration helps but is not perfect.
AI fingerprint phrases evolve as models improve. GPT-5 might not use "delve into" but will have new tells. The self-updating workflow helps but lags current releases by a few weeks.
The three-agent debate architecture assumes disagreement is meaningful. For extremely niche topics where only one agent has relevant training data, the minority opinion might be correct but gets outvoted. Review the individual agent reasoning when dealing with specialized content.