Evaluation Methods

LLM-as-Judge is Not Enough: Why Transcript Analysis Falls Short

Evalgent Team

•January 2026•

7 min read

LLM-as-Judge is Not Enough: Why Transcript Analysis Falls Short

Introduction

The appeal is obvious. Take your voice agent's transcripts, feed them to GPT-4, and ask: "Did the agent handle this conversation well?" You get a score, maybe some reasoning, and you move on. It's fast, it's cheap, and it feels like you're measuring something real.

But you're not.

LLM-as-Judge for transcript analysis has become the default evaluation approach for many voice AI teams. And while it's better than nothing, it systematically misses the failure modes that actually hurt your users and your business.

This isn't a theoretical concern. We've seen teams with "95% quality scores" from their LLM evaluators who still had 20% task failure rates in production. The gap between what transcript analysis measures and what users experience is larger than most teams realize.

What transcript analysis actually measures

When you pass a transcript to an LLM and ask it to evaluate quality, you're measuring a narrow slice of the conversation:

The words that were said. The LLM sees the text - what the agent said, what the user said, and the order of turns.

Coherence of responses. Does the agent's language make sense given what the user asked?

Politeness and tone. Is the agent courteous, professional, patient?

Apparent task completion. Based on the dialogue, does it look like the user's request was handled?

These are all reasonable things to measure. But they're proxies - and proxies that can be deeply misleading.

The five blind spots of transcript analysis

1. Transcripts don't capture what the user actually said

Here's a fundamental problem: the transcript you're evaluating isn't what the user said. It's what your ASR system thought the user said.

When a user says "I need to transfer fifty dollars to my savings account" and your ASR transcribes it as "I need to transfer fifteen dollars to my savings account," the transcript looks perfect. The agent handles a fifty-dollar transfer request flawlessly. Your LLM evaluator gives it high marks.

But the user wanted fifty dollars transferred, and fifty is what they'll expect to see. The transcript shows a successful conversation. The user experiences a failure.

This isn't an edge case. ASR errors are endemic in production voice systems, especially for:

Numbers and amounts (the most consequential data in many conversations)
Names and proper nouns
Homophones and near-homophones
Accented speech
Noisy environments

Transcript analysis evaluates the transcript. It can't evaluate the gap between transcript and reality.

2. Transcripts hide latency problems

Voice conversations happen in real time. Delays that would be imperceptible in text become painfully obvious in speech.

Consider this transcript:

> User: What's my current balance?

> Agent: Your current balance is $1,247.32.

Looks great. The agent answered correctly and concisely.

But what if there was a 4-second pause between the user's question and the agent's response? In text, it's invisible. In a real conversation, it's an eternity. The user might repeat their question. They might hang up. They might lose trust in the system's competence.

Transcript analysis is blind to timing. An LLM evaluating the text above has no way to know whether the exchange felt natural or agonizingly slow.

3. Transcripts can't verify task completion

"Based on this transcript, did the agent successfully book the user's appointment?"

This seems like a straightforward question for an LLM evaluator. The transcript shows the agent confirming an appointment for Tuesday at 2pm. The user says thank you and hangs up. Clear success, right?

Not necessarily.

The transcript shows what was said. It doesn't show what was done. Maybe the API call to the booking system failed. Maybe the appointment was created for the wrong location. Maybe a duplicate booking was created. Maybe the confirmation was sent to the wrong email address.

Task completion requires verification against the actual state of the world - checking the database, confirming the API response, validating the downstream effects. A transcript only shows the conversation about the task, not whether the task actually happened.

4. Transcripts collapse multi-modal failures

Voice agents don't exist in isolation. They often work alongside SMS confirmations, email follow-ups, app integrations, and other channels. Users expect these channels to be consistent.

A transcript might show:

> Agent: I've sent a confirmation to your email at john.smith@email.com.

> User: Perfect, thanks!

The LLM evaluator sees a completed action and a satisfied user. But what if:

The email was never sent due to a downstream failure?
The email was sent to the wrong address?
The email content didn't match what the agent described?
The email arrived 3 hours later?

The transcript captures the promise. It can't capture whether the promise was kept.

5. Transcripts miss the conversations that never happened

Perhaps the biggest blind spot: transcript analysis only evaluates conversations you captured. It tells you nothing about:

Calls that were abandoned before connecting to the agent
Users who hung up during long queue times or system messages
Sessions that crashed or terminated unexpectedly
Users who gave up after multiple failed attempts and never called back

These are often your most dissatisfied users - and they leave no transcript to evaluate. A team that only measures transcript quality can have excellent scores while hemorrhaging frustrated customers who never make it into the dataset.

The "looks good" problem

LLM evaluators are trained on human preferences, which means they're optimized to identify conversations that look good to a reader. But looking good and being good are different things.

An agent that:

Uses warm, empathetic language
Mirrors the user's concerns back to them
Sounds confident and professional
Ends conversations with clear next steps

...will score well on transcript analysis even if it:

Misunderstood the user's actual request
Provided incorrect information
Failed to complete the requested task
Left the user worse off than before the call

LLM-as-Judge is essentially asking: "If I read this conversation, would I think it went well?" That's a very different question from: "Did this conversation achieve the user's goal?"

Beyond transcripts: what real evaluation requires

If transcript analysis is insufficient, what does sufficient evaluation look like?

End-to-end task verification

Real evaluation doesn't just check what was said - it checks what happened. Did the appointment appear in the system? Did the transfer complete? Did the email arrive? Did the account settings actually change?

This requires instrumenting your entire stack, not just your conversation layer. It means correlating agent actions with downstream system states. It's more complex than transcript analysis, but it's the only way to measure what matters.

Audio-level analysis

The transcript is a lossy compression of the conversation. Working with audio directly lets you capture:

Speech recognition confidence and alternative hypotheses
Timing and latency throughout the conversation
Acoustic features like background noise and speech quality
Interruptions, overlapping speech, and conversation dynamics

Audio-level evaluation is harder, but it measures what users actually experience rather than what the ASR thought they said.

Outcome correlation

The gold standard for voice agent evaluation is correlating against business outcomes: task completion rates, customer satisfaction scores, repeat call rates, and downstream metrics that matter to your business.

This requires tracking users across interactions and systems. It's not something you can do with a single prompt to an LLM. But it's the only way to know whether your agent is actually working.

Proactive testing

Transcript analysis is reactive - you can only evaluate conversations that already happened. Proactive evaluation uses synthetic callers to test your agent before real users encounter it.

With scenario-based testing, you can:

Verify task completion end-to-end
Test against specific acoustic conditions
Explore edge cases systematically
Catch regressions before they reach production

Reactive analysis tells you about yesterday's failures. Proactive testing helps you prevent tomorrow's.

When transcript analysis makes sense

This isn't to say transcript analysis is useless. It has legitimate applications:

Spot-checking at scale. When you have thousands of daily conversations, LLM analysis can surface transcripts worth human review. It's a filtering tool, not a measurement tool.

Tone and compliance auditing. For questions like "Did the agent use approved language?" or "Did the agent follow disclosure requirements?", transcript analysis works well.

Coaching and training. Reviewing transcripts with human agents is valuable for identifying coaching opportunities. LLM analysis can accelerate this by pre-categorizing conversations.

Directional quality signals. If your transcript quality scores drop suddenly, something probably changed. The absolute score may be unreliable, but the relative trend can be informative.

The key is understanding what you're measuring and what you're not. Use transcript analysis for what it's good at. Don't mistake it for comprehensive evaluation.

The path forward

Moving beyond transcript analysis requires a shift in how teams think about voice AI evaluation:

From text to behavior. Evaluate what the agent does, not just what it says.

From reactive to proactive. Don't wait for production failures to learn about problems.

From proxy metrics to outcomes. Measure task completion and user satisfaction, not just conversation quality scores.

From sampled review to systematic testing. Define test scenarios that cover your critical paths and run them automatically, repeatedly.

This is harder than passing transcripts to an LLM. It requires more infrastructure, more instrumentation, and more intentionality about what you're measuring.

But it's the difference between feeling confident and being confident. Between having good scores and having good outcomes. Between thinking your voice agent works and knowing it does.

Conclusion

LLM-as-Judge for transcript analysis is seductive because it's easy. You already have transcripts. You already have access to powerful LLMs. The setup is minimal and the output looks authoritative.

But the ease is deceptive. You're measuring a shadow of the conversation - a text representation that strips away timing, audio quality, ASR errors, and most importantly, whether anything actually worked.

The teams building voice agents that succeed in production are the ones who've moved beyond transcript analysis. They're measuring behavior, not words. Outcomes, not proxies. Reality, not representations.

Your LLM evaluator might say your agent is doing great. Your users might disagree. The question is: which signal are you building on?

Beyond the Demo: Why Voice Agents Break in the Real World

Voice AI Evaluation

8 min read

Beyond the Demo: Why Voice Agents Break in the Real World

The demo went flawlessly. Then came the support tickets. Understanding the gap between controlled testing and production chaos is the first step toward building voice agents that actually work.

Stress-Testing Voice AI: Finding Your Agent's Breaking Points

Testing Strategies

9 min read

Stress-Testing Voice AI: Finding Your Agent's Breaking Points

Every voice agent has limits. The question isn't whether they exist—it's whether you've found them before your users do. A systematic approach to behavioral limit testing.

Back to all articles