Reviews

Challenge LLM judgements with human-in-the-loop reviews

When the AI gets a verdict wrong, flag it. Submit an appeal, provide context, and get the outcome corrected — keeping your evaluation scores accurate.

Appeal submitted

Corrected

SSR updated

Audit trail

ReviewPending

Refund reason collected

Success condition — Refund request handling

LLM verdictFailed

Appealed byQA Team

Appeal comment

"The caller confirmed the reason at turn 4 — the LLM missed the implicit confirmation."

What is a voice agent review?

A review is a human-in-the-loop correction. When an LLM scores a condition incorrectly, you submit an appeal explaining why the judgement is wrong. A reviewer examines the evidence, approves or rejects the appeal, and — if approved — corrects the outcome and recalculates your metrics. A reviewer examines the evidence, approves or rejects the appeal, and — if approved — corrects the outcome and recalculates your metrics.

Reviews ensure your evaluation scores stay accurate by letting domain experts correct the mistakes that automated scoring inevitably makes.

How does the review process work?

Flag a judgement

Select a condition from your evaluation results and challenge the LLM's verdict. See the evidence and transcript context before flagging.

Condition result

ConditionRefund reason collected

TypeSuccess

LLM verdictFailed

Evidence from transcript

Turn 4: "Yeah it was because the item arrived damaged"

Flag for review →

Submit your appeal

Provide your comment explaining why the judgement is incorrect. Include references to specific turns or evidence the LLM missed.

Submit appeal

Condition

Refund reason collected

Current verdict

Failed→Should pass

Your comment

"The caller confirmed the reason at turn 4 — the LLM missed the implicit confirmation when the caller said 'it arrived damaged'."

Submit appeal →

Get a decision

A reviewer examines the appeal, the original evidence, and the transcript. They approve with a corrected outcome or reject with notes.

Review decision

ConditionRefund reason collected

Original verdictFailed

DecisionApproved

Corrected outcomePassed

Reviewer note

"Implicit confirmation is valid — turn 4 clearly states the reason."

SSR impact

78%→ 82%

What you get back

Corrected outcomes

Approved appeals replace the original LLM judgement with the correct outcome

Recalculated metrics

SSR scores and pass/fail verdicts update automatically after corrections

Audit trail

Every appeal, decision, and reviewer note is preserved for traceability

The difference human reviews make

Without reviews

Trust the LLM blindly

Accept every LLM judgement at face value
No way to correct false positives or false negatives
Metrics drift from reality over time

With Evalgent Reviews

Human-corrected accuracy

Challenge any verdict with a structured appeal
Corrected outcomes feed back into your scores
Continuous improvement loop between human and AI

Know if your voice agent is ready for production

Functional

Behavioral

Limit