Here's a story that should keep every agency owner awake at night:
Two weeks ago, an AI coding agent wiped out a production database. 1.9 million rows of student data—gone in seconds. The backups disappeared too. The agent never made a technical error. Every action was logically correct.
It simply had no idea it was demolishing a live system.
The knowledge that distinguished "real production infrastructure" from "temporary test environment" existed only in the engineer's head. And the agent, for all its coding prowess, couldn't possibly know.
This isn't a hypothetical. It's happening right now. And if you think your agency is immune because you use "careful prompting," you're missing the real risk.
The Agents Are Getting Better. The Context Is Not.
AI agents can write code. They can generate designs. They can close tickets. We've all seen the demos.
But here's what the hype doesn't show you: AI agents are measured in hours, not months. A typical agent run lasts a few hours at best. Compare that to your senior developer who's been with you for 5 years, who remembers why you built that custom module a certain way, who knows which clients have unwritten special arrangements.
That gap—between task execution and institutional context—is the single biggest risk in AI deployment right now.
A recent study from Scale AI tested frontier agents on 240 real freelance projects from Upwork. Video production, architecture, game development, data analysis. These were end-to-end projects, not toy problems.
The best agent completed 2.5% of projects at a quality a paying client would accept.
97.5% failure rate on real work.
Not because the agents can't code. Because real work requires context that isn't in the brief.
The Maintenance Problem Nobody Talks About
Another study—from Alibaba's research team—tested AI agents on maintaining software over time. Not writing fresh code. Maintaining existing codebases.
They used 100 real codebases, each spanning an average of 233 days and 71 consecutive updates of actual development history.
75% of models tested broke previously working features during maintenance.
Three out of four frontier models asked to maintain code over time actively made things worse.
Writing code and maintaining code are fundamentally different skills. AI is getting really good at the former. AI is not very good at the latter.
And yet most agencies are benchmarking AI on greenfield tasks—building something from scratch—and assuming that translates to production maintenance. It doesn't.
What This Means for Your Agency Talent
You might have seen the headlines: "AI replaces junior developers." There's truth in that, but not the truth you think.
A Harvard study of 62 million American workers found that companies adopting generative AI saw junior employment drop roughly 8% within a year and a half. Senior employment kept rising.
The conventional read: AI replaces junior workers.
The better read: AI replaces task execution.
Juniors used to be hired for tasks—debugging code, reviewing documents, first drafts of client emails. AI does those tasks adequately now.
Seniors survive because they provide something different: the mental model of the system. They know which parts are load-bearing. They know the decision history. They know the things nobody wrote down.
Context is the scarce resource. Not coding execution.
The Database Disaster: A Cautionary Tale
Let's talk about what actually happened to that production database, because the details matter.
Alex was migrating a website to the cloud and decided to reuse existing infrastructure to save a few bucks a month. He asked his AI agent to handle the deployment.
First warning sign: The agent started creating cloud resources that shouldn't have existed. Alex had moved to a new computer and hadn't transferred his infrastructure configuration. The agent looked at the cloud, saw nothing it recognized, and assumed it was building from scratch.
Alex stopped the process. Some duplicate resources had been created. Reasonable next step: "Identify and remove duplicate files."
The agent decided it would be "cleaner and simpler" to demolish everything it had created in one shot. But unbeknownst to Alex, the agent had quietly unpacked an archived configuration file from his old computer. Inside were the definitions of his real production infrastructure.
When the agent ran the demolition command, it wasn't clearing temporary duplicates. It was destroying the production database, networking layer, application cluster, load balancers—everything.
The agent was competent. The agent was confident. And Alex made reasonable asks that any of us might have made.
The agent simply had no idea which world it was operating in.
The Eval Gap: Your New Competitive Moat
"Evals" (evaluations) are how you encode human judgment into guardrails that prevent disasters. Simple rules like:
- "Before destroying any cloud resource, verify it is not tagged as production"
- "Before any bulk infrastructure change, compare current state against known production manifests"
These aren't technical problems. They're organizational memory problems.
Here's what's terrifying: Most companies deploying AI agents don't write evals at all. And when they do, they devolve them to junior team members who sit in front of Excel spreadsheets writing tests they think are comprehensive.
Juniors don't have the context. You need senior people writing evals.
The skill of writing great evaluations is the exact same skill that makes senior developers valuable. It's not a chore. It's not an afterthought. It's the bridge between what humans know and what machines do.
The agencies that win the next few years will treat eval design as a core competency for seniors—not something to throw at juniors before you "optimize headcount."
The Human Role: Contextual Stewardship
Ultimately, the human role in an AI-powered agency is what we call contextual stewardship:
1. Maintaining the mental model of your systems—not just code, but client relationships, project history, organizational constraints
2. Representing what you know in ways machines can use—documenting decisions, not just outcomes; capturing the "why," not just the "what"
3. Exercising judgment about when technically correct output is organizationally wrong—knowing when the agent's answer would create client problems, political issues, or reopen old wounds
This isn't about learning to code. It's about becoming the person who holds context that keeps machines from going sideways.
When you deploy agents without investing in evaluation infrastructure—without encoding your judgment—you're handing powerful tools to systems that have no idea what they're not supposed to destroy.
They're going to destroy stuff.
What To Do This Week
If you're a senior developer:
- Document three critical decisions you made recently. Not the outcomes—the constraints, trade-offs, and context that made one choice better than another
- Write one eval that would have caught a recent near-miss or production issue
- Make your contextual stewardship visible to leadership
If you're an agency owner:
- Identify who holds your institutional knowledge. If they left tomorrow, what walks out the door?
- Invest in eval infrastructure before you invest in more AI tools
- Remember: 55% of employers regret AI-driven layoffs. Gartner predicts half the companies that cut staff for AI will rehire within two years
If you're a junior developer worried about AI:
- Start thinking in systems, not just tasks
- Ask "why" more than "how"
- Build the contextual judgment that agents cannot replicate
The agents are here. They work. They're improving every day. And that's what makes it dangerous—because every capability gain without a context gain widens the gap between what agents can do and what they understand.
Your job is to be the one who sees what they can't.
Adapted from: "The Agents Are Getting Better. The People Deploying Them Are Not." by Nate, YouTube.