When AI Agents Know the Answer But Still Get It Wrong

# When AI Agents Know the Answer But Still Get It Wrong

*Lessons from a landmark medical study on AI failure modes*

---

## The Mount Sinai Discovery

A recent study from Mount Sinai Health System revealed something alarming about AI agents: **they can correctly identify a problem in their reasoning, then recommend the exact opposite action.**

The study tested ChatGPT Health—a tool designed to provide responsible medical advice. Researchers found the system would correctly diagnose "early respiratory failure" in its internal analysis, then tell the patient to wait 24-48 hours before going to the ER.

This isn't a rare glitch. It's a structural problem with how AI agents make decisions.

---

## Four Critical Failure Modes

### 1. The Inverted U Problem

AI agents excel at textbook cases but fail at edge cases—the exact scenarios where stakes are highest.

- **Routine cases:** Handled perfectly (classical stroke, severe allergic reactions)
- **Edge cases:** Dangerous errors (atypical symptoms, complex presentations)

Your agent's accuracy dashboard might show 87%, but that average masks silent failures on the distribution tails where consequential decisions actually live.

**The risk:** Your accounts payable agent processes routine invoices flawlessly but misses the duplicate that's been slightly modified. Your claims agent handles fender benders but can't detect the third claim from the same address in 14 months.

---

### 2. Knowing Without Acting

The agent's reasoning trace and final output operate as semi-independent processes. Research shows:

- Models can produce correct answers even when given incorrect reasoning chains
- The reasoning chain can be "anchored" to early states, ignoring later critical updates
- Over 50% of the time, models fail to update answers when their reasoning changes significantly

**The danger:** Your compliance screening agent correctly identifies an enhanced due diligence jurisdiction in its reasoning, then outputs "standard risk case." Your customer service agent spots a known billing error pattern, then recommends a generic 5-7 day review.

If you're not regularly comparing reasoning traces to actual outputs, you're missing scary defects.

---

### 3. Social Context Hijacks Judgment

When family members minimized patient symptoms, the AI's triage recommendations shifted dramatically—becoming 12x more likely to recommend less urgent care.

This is anchoring bias, and it generalizes to any agent processing structured data alongside unstructured human language:

- A senior VP's note saying "I'm confident this is the right choice" shifts vendor selection recommendations
- An employer's letter describing an applicant as "valued longtime employee" changes lending risk assessments
- The structured data should drive decisions, but the unstructured framing creates bias

**The insidious part:** Without study designs comparing scenarios with/without anchoring inputs, this bias is invisible to standard evaluations.

---

### 4. Guardrails That Fire on Vibes, Not Risk

The Mount Sinai team found crisis intervention alerts were "inverted relative to clinical risk." The system fired more reliably when patients described vague emotional distress than when they articulated concrete self-harm threats.

**The problem:** Guardrails matched surface-level language patterns (keywords, emotional tone) rather than actual risk taxonomies. They tested for the *appearance* of safety, not actual safety.

This appears in other agents too—guardrails that trigger on how something sounds rather than what it actually means.

---

## Why This Matters for Business

These aren't medical-specific issues. They're structural properties of how LLMs generate outputs:

1. **Edge case blindness** — Your evaluation suite designed for average accuracy won't catch tail-risk failures
2. **Reasoning/output disconnect** — The chain of thought is fundamentally unreliable as an explanation
3. **Framing vulnerability** — Any agent combining structured data with unstructured text is susceptible
4. **Surface-level safety** — Guardrails often match patterns, not actual risk

---

## What To Do About It

**For AI-native companies:**
- Design evaluations that specifically test edge cases and tail risks
- Regularly compare reasoning traces to ground truth actions
- Test scenarios with and without potential anchoring inputs
- Distinguish between "appears safe" and "is safe"

**For enterprises using AI:**
- Don't trust aggregate accuracy metrics
- Implement human-in-the-loop for high-stakes decisions
- Monitor for systematic bias patterns, not just individual errors
- Build verification layers that check outputs against reasoning

---

## The Bottom Line

AI agents are powerful tools, but they're not reliable decision-makers in high-stakes contexts. The Mount Sinai study shows they can know the right answer and still give dangerous advice.

The solution isn't to abandon AI—it's to build the right guardrails, evaluation frameworks, and human oversight. The agents are already out there. Now we need to learn how to use them responsibly.

---

*What failure modes have you seen in AI agents? How are you addressing them?*