Everyone is calling everything "AI agents," but almost no one understands what they actually mean.
When a vendor says "we have AI agents," they could be talking about four completely different systems. The pricing, risk profile, governance requirements, and implementation complexity vary by an order of magnitude between them.
Deploy the wrong type for your use case, and you have a six-figure lesson in mismatched expectations.
Here are the four species of AI agents actually running in production today—what they do, when to use them, and why confusing them is the fastest way to an implementation disaster.
Species 1: Coding Harnesses (Task-Level Automation)
What it is: A single LLM agent with development tools—file access, code execution, search—that takes direction from a human and produces code.
The model: Human as manager, agent as individual contributor. The human decomposes problems, assigns specific tasks, reviews output.
Real examples:
- Andrej Karpathy running agents 16 hours/day on personal coding projects
- Individual developers using Claude Code or Cursor for task-level assistance
- Peter Steinberger managing multiple CodeEx agents simultaneously for OpenClaw development
When to use it: You have discrete, well-defined coding tasks. You need individual productivity multiplication, not organizational transformation.
The decomposition requirement: This only works if the human can break problems into clear, bounded tasks. Agents cannot handle ambiguity without guidance.
Scale limit: Individual contributor level. Works for one person managing multiple tasks. Does not solve team coordination or project-scale complexity.
Governance: Low. Human review is inherent. Agent cannot deploy without human approval.
Species 2: Project-Scale Coding Harnesses (Team-Level Orchestration)
What it is: Multiple agents coordinated by a planner agent that assigns work, tracks tasks, and manages context across a codebase over time.
The model: Agent as manager (planner), agents as workers (executors). Human involvement at specification and evaluation, minimal in the middle.
Real examples:
- Cursor implementing browsers and compilers with millions of lines of AI-generated code
- Short-running "grunt" agents spun up by a planner to hit specific problems, solve them, and terminate
- Agent tracking tasks and memory across a project, not just a single session
When to use it: You have genuine project complexity—8, 16, 20+ developers worth of coordination required. Single-threaded agents create human bottlenecks at this scale.
The complexity shift: Instead of speeding up human work and keeping all the bottlenecks, you reframe around enabling the agent to do the work. Simple configurations that scale.
Critical insight from Cursor: They tried three levels of management and it did not work. Simple scales. Planner + executors is sufficient.
Governance: Medium. Planner needs audit trails. Humans review completions. But the middle is agent-managed.
Species 3: Dark Factories (Specification-to-Output Autonomy)
What it is: A complete system where you put specification in at one end and—if the evals pass—production-ready output comes out the other. Human involvement: specification and final evaluation only.
The model: Human at specification (beginning) and evaluation (end). Agents handle everything in between. Agents iterate automatically until they pass specified tests.
Why it exists: Agents move so fast that humans become bottlenecks in the middle. Dark factories remove the human from the middle to eliminate that friction.
The eval is everything: Without robust evaluation criteria, a dark factory is a liability generator. The eval is the quality gate.
Production reality:
- Bold companies may auto-deploy from dark factories (high risk, high trust in evals)
- Most enterprises have human review before production (see Amazon's AI-generated incident learnings)
- Hybrid approaches: Dark factory middle, human eval at end
When to use it: Your specification and evaluation are so solid that you trust the middle to optimize toward passing the eval without human intervention. You have continuous monitoring for production quality.
Risk profile: High if evals are weak. The system will optimize toward passing tests, not toward correctness, safety, or business value.
Governance: High complexity. Specification governance, evaluation governance, monitoring for production drift, accountability for outputs.
Species 4: Auto Research (Metric Optimization)
What it is: An agentic process relentlessly experimenting to optimize a specific metric—conversion rates, code runtime, LLM weights—not to produce software but to improve a measured outcome.
The model: Agent runs experiments at scale, validates successes/failures against metric, "hill climbs" toward optimal performance. Human reviews scalable successes.
Real examples:
- Tobi Lütke optimizing Shopify's Liquid template engine
- Andrej Karpathy's recent GP2-scale auto-research work
- Conversion rate optimization through automated landing page variants
When to use it: You have a clear metric, sufficient data, and the ability to run many small experiments. The problem is metric-shaped, not software-shaped.
The experiment cascade: Many experiments fail. Some succeed. Humans review scalable successes for viability. This is not about building working software—it is about climbing an optimization hill.
Common error: Using auto-research to build software. Do not do this. Auto-research optimizes metrics. Coding harnesses/factories build software. Different species.
Governance: Depends on metric risk. Financial metrics, customer-facing metrics require tight oversight. Internal optimization may be looser.
Species 5: Orchestration Frameworks (Workflow Handoffs)
What it is: Multiple specialized agents with distinct roles (researcher, writer, marketer) handing work off through defined workflows with managed context.
The model: Specialist A completes work, hands to Specialist B, who hands to Specialist C. Each has narrow expertise. Coordination overhead is high.
Examples: LangGraph, Crew AI, customer success ticket routing through research → draft → review → close.
When to use it: You have genuinely specialized work requiring different capabilities, done at sufficient scale (thousands to millions of instances) to justify the coordination complexity.
The efficiency question: Is the coordination overhead worth it for your volume? Small scale (hundreds of instances) often means orchestration is overkill.
Governance: High complexity. Handoff points need context management. Approval gates at critical transitions. Audit trails across agent boundaries.
The Cheat Sheet: Which Agent for Which Problem?
| Your Goal | Use This | Human Role | Governance Level |
|---|---|---|---|
| Multiply individual developer productivity | Coding Harness (single agent) | Manager/reviewer | Low (inherent review) |
| Coordinate 20+ devs worth of code complexity | Project-Scale Harness (planner + executors) | Spec + eval reviewer | Medium (audit trails) |
| Specification → production with minimal human middle | Dark Factory (eval-driven iteration) | Spec + eval + monitoring | High (comprehensive) |
| Optimize a specific metric through experimentation | Auto Research (hill-climbing) | Define metric + review scalable wins | Medium (metric-dependent) |
| Route complex workflows through specialists at scale | Orchestration (handoff framework) | Workflow design + exception handling | High (multi-agent coordination) |
Why This Matters for Your Organization
For Web Agencies:
When a client says "we want AI agents," your first question should be: "Which species?"
- Coding harness for their dev team? $2-5K implementation.
- Project-scale coordination for their platform rebuild? $50-150K engagement.
- Dark factory for their content pipeline? $200K+ with months of eval development.
The scope, pricing, and governance requirements are an order of magnitude different. Confusing them means mismatched expectations on both sides.
For Family Offices:
Your AI use cases likely fall into specific categories:
- Document processing: Orchestration (specialist agents for classification, extraction, summarization)
- Investment research: Coding harness or project-scale (depending on research complexity)
- Knowledge management: Project-scale harness (planner coordinates retrieval across document corpus)
Understanding the species helps you evaluate vendor proposals. If someone pitches "AI agents" for knowledge management without mentioning planner/executor architecture, they do not understand what they are proposing.
The Governance Requirement for Each Species
The pattern: As you move from single agents → multi-agent projects → dark factories, governance is not just added—it is multiplied.
Coding Harness: Human-in-the-loop is sufficient. Inherent review.
Project-Scale Harness: Need audit trails for planner decisions, executor outputs, task completions.
Dark Factory: Specification governance, evaluation governance, production monitoring, drift detection, accountability frameworks. The eval is the gatekeeping system.
Auto Research: Metric definition governance, experiment review protocols, scalability validation.
Orchestration: Handoff context management, approval gates at transitions, multi-agent audit trails.
The mistake: Organizations implement higher-species agents with lower-species governance. This is where $140,000 lessons happen.
The Bottom Line
We are past the era where "AI agent" is a meaningful term on its own. Sophisticated implementations have diverged into distinct species with different capabilities, requirements, and risk profiles.
The organizations winning with AI agents are the ones that understand which species they are deploying, why that species matches their problem, and what governance architecture it requires.
The organizations losing are treating all "AI agents" as fungible—deploying orchestration frameworks for problems that need coding harnesses, or dark factories without the eval infrastructure to support them.
Understand the species. Match the tool to the problem. Build the governance that the species requires.
Your move.
Adapted from production agent implementations across enterprise software development, with thanks to the teams at Cursor, Cognition, and the independent researchers advancing this field.