LLM-as-Judge is Not Enough: Why Transcript Analysis Falls Short
Introduction
The appeal is obvious. Take your voice agent's transcripts, feed them to GPT-4, and ask: "Did the agent handle this conversation well?" You get a score, maybe some reasoning, and you move on. It's fast, it's cheap, and it feels like you're measuring something real.
But you're not.
LLM-as-Judge for transcript analysis has become the default evaluation approach for many voice AI teams. And while it's better than nothing, it systematically misses the failure modes that actually hurt your users and your business.
This isn't a theoretical concern. We've seen teams with "95% quality scores" from their LLM evaluators who still had 20% task failure rates in production. The gap between what transcript analysis measures and what users experience is larger than most teams realize.
What transcript analysis actually measures
When you pass a transcript to an LLM and ask it to evaluate quality, you're measuring a narrow slice of the conversation:
The words that were said. The LLM sees the text - what the agent said, what the user said, and the order of turns.
Coherence of responses. Does the agent's language make sense given what the user asked?
Politeness and tone. Is the agent courteous, professional, patient?
Apparent task completion. Based on the dialogue, does it look like the user's request was handled?
These are all reasonable things to measure. But they're proxies - and proxies that can be deeply misleading.
The five blind spots of transcript analysis
1. Transcripts don't capture what the user actually said
Here's a fundamental problem: the transcript you're evaluating isn't what the user said. It's what your ASR system thought the user said.
When a user says "I need to transfer fifty dollars to my savings account" and your ASR transcribes it as "I need to transfer fifteen dollars to my savings account," the transcript looks perfect. The agent handles a fifty-dollar transfer request flawlessly. Your LLM evaluator gives it high marks.
But the user wanted fifty dollars transferred, and fifty is what they'll expect to see. The transcript shows a successful conversation. The user experiences a failure.
This isn't an edge case. ASR errors are endemic in production voice systems, especially for:
- Numbers and amounts (the most consequential data in many conversations)
- Names and proper nouns
- Homophones and near-homophones
- Accented speech
- Noisy environments
Transcript analysis evaluates the transcript. It can't evaluate the gap between transcript and reality.
2. Transcripts hide latency problems
Voice conversations happen in real time. Delays that would be imperceptible in text become painfully obvious in speech.
Consider this transcript:
> User: What's my current balance?
> Agent: Your current balance is $1,247.32.
Looks great. The agent answered correctly and concisely.
But what if there was a 4-second pause between the user's question and the agent's response? In text, it's invisible. In a real conversation, it's an eternity. The user might repeat their question. They might hang up. They might lose trust in the system's competence.
Transcript analysis is blind to timing. An LLM evaluating the text above has no way to know whether the exchange felt natural or agonizingly slow.
3. Transcripts can't verify task completion
"Based on this transcript, did the agent successfully book the user's appointment?"
This seems like a straightforward question for an LLM evaluator. The transcript shows the agent confirming an appointment for Tuesday at 2pm. The user says thank you and hangs up. Clear success, right?
Not necessarily.
The transcript shows what was said. It doesn't show what was done. Maybe the API call to the booking system failed. Maybe the appointment was created for the wrong location. Maybe a duplicate booking was created. Maybe the confirmation was sent to the wrong email address.
Task completion requires verification against the actual state of the world - checking the database, confirming the API response, validating the downstream effects. A transcript only shows the conversation about the task, not whether the task actually happened.
4. Transcripts collapse multi-modal failures
Voice agents don't exist in isolation. They often work alongside SMS confirmations, email follow-ups, app integrations, and other channels. Users expect these channels to be consistent.
A transcript might show:
> Agent: I've sent a confirmation to your email at john.smith@email.com.
> User: Perfect, thanks!
The LLM evaluator sees a completed action and a satisfied user. But what if:
- The email was never sent due to a downstream failure?
- The email was sent to the wrong address?
- The email content didn't match what the agent described?
- The email arrived 3 hours later?
The transcript captures the promise. It can't capture whether the promise was kept.
5. Transcripts miss the conversations that never happened
Perhaps the biggest blind spot: transcript analysis only evaluates conversations you captured. It tells you nothing about:
- Calls that were abandoned before connecting to the agent
- Users who hung up during long queue times or system messages
- Sessions that crashed or terminated unexpectedly
- Users who gave up after multiple failed attempts and never called back
These are often your most dissatisfied users - and they leave no transcript to evaluate. A team that only measures transcript quality can have excellent scores while hemorrhaging frustrated customers who never make it into the dataset.
The "looks good" problem
LLM evaluators are trained on human preferences, which means they're optimized to identify conversations that look good to a reader. But looking good and being good are different things.
An agent that:
- Uses warm, empathetic language
- Mirrors the user's concerns back to them
- Sounds confident and professional
- Ends conversations with clear next steps
...will score well on transcript analysis even if it:
- Misunderstood the user's actual request
- Provided incorrect information
- Failed to complete the requested task
- Left the user worse off than before the call
LLM-as-Judge is essentially asking: "If I read this conversation, would I think it went well?" That's a very different question from: "Did this conversation achieve the user's goal?"
Beyond transcripts: what real evaluation requires
If transcript analysis is insufficient, what does sufficient evaluation look like?
End-to-end task verification
Real evaluation doesn't just check what was said - it checks what happened. Did the appointment appear in the system? Did the transfer complete? Did the email arrive? Did the account settings actually change?
This requires instrumenting your entire stack, not just your conversation layer. It means correlating agent actions with downstream system states. It's more complex than transcript analysis, but it's the only way to measure what matters.
Audio-level analysis
The transcript is a lossy compression of the conversation. Working with audio directly lets you capture:
- Speech recognition confidence and alternative hypotheses
- Timing and latency throughout the conversation
- Acoustic features like background noise and speech quality
- Interruptions, overlapping speech, and conversation dynamics
Audio-level evaluation is harder, but it measures what users actually experience rather than what the ASR thought they said.
Outcome correlation
The gold standard for voice agent evaluation is correlating against business outcomes: task completion rates, customer satisfaction scores, repeat call rates, and downstream metrics that matter to your business.
This requires tracking users across interactions and systems. It's not something you can do with a single prompt to an LLM. But it's the only way to know whether your agent is actually working.
Proactive testing
Transcript analysis is reactive - you can only evaluate conversations that already happened. Proactive evaluation uses synthetic callers to test your agent before real users encounter it.
With scenario-based testing, you can:
- Verify task completion end-to-end
- Test against specific acoustic conditions
- Explore edge cases systematically
- Catch regressions before they reach production
Reactive analysis tells you about yesterday's failures. Proactive testing helps you prevent tomorrow's.
When transcript analysis makes sense
This isn't to say transcript analysis is useless. It has legitimate applications:
Spot-checking at scale. When you have thousands of daily conversations, LLM analysis can surface transcripts worth human review. It's a filtering tool, not a measurement tool.
Tone and compliance auditing. For questions like "Did the agent use approved language?" or "Did the agent follow disclosure requirements?", transcript analysis works well.
Coaching and training. Reviewing transcripts with human agents is valuable for identifying coaching opportunities. LLM analysis can accelerate this by pre-categorizing conversations.
Directional quality signals. If your transcript quality scores drop suddenly, something probably changed. The absolute score may be unreliable, but the relative trend can be informative.
The key is understanding what you're measuring and what you're not. Use transcript analysis for what it's good at. Don't mistake it for comprehensive evaluation.
The path forward
Moving beyond transcript analysis requires a shift in how teams think about voice AI evaluation:
From text to behavior. Evaluate what the agent does, not just what it says.
From reactive to proactive. Don't wait for production failures to learn about problems.
From proxy metrics to outcomes. Measure task completion and user satisfaction, not just conversation quality scores.
From sampled review to systematic testing. Define test scenarios that cover your critical paths and run them automatically, repeatedly.
This is harder than passing transcripts to an LLM. It requires more infrastructure, more instrumentation, and more intentionality about what you're measuring.
But it's the difference between feeling confident and being confident. Between having good scores and having good outcomes. Between thinking your voice agent works and knowing it does.
Conclusion
LLM-as-Judge for transcript analysis is seductive because it's easy. You already have transcripts. You already have access to powerful LLMs. The setup is minimal and the output looks authoritative.
But the ease is deceptive. You're measuring a shadow of the conversation - a text representation that strips away timing, audio quality, ASR errors, and most importantly, whether anything actually worked.
The teams building voice agents that succeed in production are the ones who've moved beyond transcript analysis. They're measuring behavior, not words. Outcomes, not proxies. Reality, not representations.
Your LLM evaluator might say your agent is doing great. Your users might disagree. The question is: which signal are you building on?
Related Articles
Beyond the Demo: Why Voice Agents Break in the Real World
The demo went flawlessly. Then came the support tickets. Understanding the gap between controlled testing and production chaos is the first step toward building voice agents that actually work.
Read moreStress-Testing Voice AI: Finding Your Agent's Breaking Points
Every voice agent has limits. The question isn't whether they exist—it's whether you've found them before your users do. A systematic approach to behavioral limit testing.
Read more