Roll Roll AI: Build // AI.
Posts
"Evals" Aren’t Just for Models, They’re for Products

"Evals" Aren’t Just for Models, They’re for Products

Jeremy Lezac
April 26, 2025

Evals: Everyone talks about them, few get them right. If your AI feels impressive but isn’t delivering clear business results, your eval process probably needs fixing.

As for the name? No need to panic, it’s just shorthand for evaluations. That’s it.

And evaluating something is about as basic (and crucial) as it gets. If you build a product, you want to know:

👉 Is this thing doing what it’s supposed to do?
👉 Is it actually useful, safe, and worth continuing?

In the world of AI, especially with the wave of large language models and generative tools, teams often skip this question entirely. They jump straight to demos, dashboards, or “let’s integrate this into production” without first pausing to ask: Does it work? And does it matter?

An evaluation isn’t just about technical performance.
It’s about viability.

It’s not enough for an AI to generate fluent text, identify images, or make recommendations. The real question is:

Is it solving a real problem in a way that moves the needle for users or the business?

When you treat evals as a checkbox, you end up optimizing for surface metrics. When you treat them as a process, you build better products.

Quick Takeaways:

Adversarial testing is critical, not optional.

Technical performance ≠ real-world usefulness.

Foundation models need human-driven, nuanced evals.

Clear evaluation = better AI products, fewer costly mistakes.

Traditional ML: Great Models, Meh Products?

Before foundation models were writing poems or helping draft legal contracts, we had “traditional” machine learning. I will use “traditional ML” to refer to all ML before foundation models (ie: genAI, LLM, LMM, SLM, …)

Think: regression models, decision trees, XGBoost, logistic classifiers… back when data scientists still felt cool showing off ROC curves.

In traditional ML, evals were crystal clear: you had your train/test split, ran your model, and checked metrics like accuracy, precision, recall, F₁-score, or AUC. All good, right?

Well, technically, yes. But also: not even close to enough.

Optimizing models to look good on paper provide no information about it usefulness in the real world.

A telco company builds a churn prediction model. After a while, the engineering teams hits an impressive 93% AUC. They are thrilled, the dashboards are beautiful and at first, even the leadership claps.

What went right:
- Solid modeling.
- Strong evaluation metrics.
- Excellent offline performance.

But there were a fatal flaw:
While the model identified likely churners, there was no next step:

- Marketing didn’t have personalized offers ready to catch-up the churners
- Ops didn’t have automation to trigger interventions.
- Customer success teams weren’t even looped in.

Result: Zero business impact. The customers still churned as before. The model while technically great was invisible business-wise.

Technical performance is the floor, not the ceiling.

The evaluation mindset is often siloed within the model development phase. What is truly needed is cross-functional evaluation that looks at:

Model quality (sure),
But also business impact,
Operational integration,
And the cost/benefit of maintaining it over time.

That’s the key shift: moving from “Does the model work?” to “Does the product built on this model create value?”

Spoiler: This problem didn’t go away with foundation models. In fact, it got more chaotic.

Foundation models change the game, including evaluations

Foundation models aren’t just “bigger”, they are a whole new beast. With traditional ML, evals were clear: you had labeled data, clean metrics and controlled outputs.

But with foundation models, you are not evaluating numbers, you are also evaluating behavior.

And that changes everything.

Why Classical Metrics Break Down

Ask an LLM to write a product spec or summarize a legal document, and… how exactly do you measure success? BLEU score? Nope. AUC? Not even close.

LLM outputs are:

Open-ended
Subjective
Contextual
Can be “technically wrong, but practically useful”

Which means that evaluating them requires human judgment, task-specific rubrics, and above all: process.

Too many teams slap together a helpfulness or correctness score and call it done. That’s like checking your heart rate and declaring yourself an Olympian. It’s not wrong… it’s just wildly insufficient.

LLM Product Evaluation Is Messy and That’s a Good Thing

High-performing teams do things differently:

They build simple, custom tools to inspect real outputs.
They use error analysis to spot high-leverage failures.
They involve domain experts (not just engineers) in refining prompts and outputs.

A real estate AI assistant struggling with rescheduling tours.

The fix didn’t come from a better model. It came from annotating real chat logs, finding the date handling bugs, and iterating based on actual errors. Result? 33% ➝ 95% success rate.

And they don’t stop there. They test outputs not just for correctness, but for trustworthiness, safety, and brand alignment.

Safety? Brand Alignment? Let’s go deeper with adversarial testing.

Adversarial testing: The Evals You Can’t Ignore

When you're deploying LLMs into users-facing products, there’s a whole class of risk that doesn’t show up on your eval dashboard:

Can your AI be hijacked with a tricky prompt?
Can it recommend illegal or misleading content?
Can it hallucinate a toxic answer and blow up your brand?

This is where adversarial testing (aka red teaming) comes in:
You intentionally stress-test your model with:

Harmful prompts
Brand-damaging scenarios
Forbidden content traps
Prompt leakage simulations
Misleading product offers

Think of it like Chaos Testing but for your foundational model.

It’s not paranoia. It’s defense.
Because every product is one weird query away from a viral PR disaster.

Oopsy. (Source: Wired)

“LLM-as-Judge” Won’t Save You Either

You’ve probably seen this trend: “We’ll use an LLM to evaluate other LLMs.”

Useful? Yes. Dangerous shortcut? Also yes.

An LLM-as-a-judge is only as good as your evaluation process.

If your criteria are vague or misaligned with user needs, automating them just speeds up your descent into irrelevance.

Automated evals work best when:

They're aligned with real human judgments (via binary pass/fail + detailed critique).
You regularly calibrate them against real examples.
You use them to scale insights, not replace critical thinking.

What This Means for You

If you're building with foundation models, your evaluation system is key to your product.
Without it, you’re just guessing its effectiveness.

You need:

Error logs that make sense to non-engineers
Evaluation frameworks that include business goals
Red teaming checklists for brand and compliance risks
Human + LLM judges that co-evolve as your product grows

Evals should not be treated as a final exam, passing the once-and-done checks and should be primed as an important part of your feedback loops.

Now, as a bonus, let’s explore a framework so that YOU can structure your evaluations to build better AI Products.

Bonus: How to Structure Evals That Actually Help You Build Better AI Products

You don’t need an academic paper or a 10-page rubric to start evaluating your AI solution.
But, if you have read this article, you know that you do need structure otherwise, your team will default to vibes, gut-checks, and subjective opinions.

Here’s a practical framework to structure your evals without drowning in complexity:

1. Define Success in Plain Language First

Ask:

What does “good” look like for this feature, to a user?
What happens if it goes wrong?

Write this out in normal words, like you're explaining it to someone in customer support. Because that’s who’ll deal with it if your AI misfires.

Example:
“Answer must be factually accurate, written in a professional tone, and avoid medical advice if user is not a doctor.”

The goal is not to define the most exhaustive success plan but to have a starting point.

2. Use Binary Check First, Add Nuance Later

Start with a simple Pass / Fail judgment:

Did it solve the user’s problem? Pass / Fail
Did it do it safely? Pass / Fail
Would we feel confident shipping this? Pass / Fail

Then optionally add qualitative notes to explain why it passed or failed. This is where nuance lives. And this is what teaches your team (and your automated evaluators) to get better.

3. Create a Judgment Set, Not Just a Test Set

A “judgment set” is a curated list of examples that:

Represent real use cases (not toy examples)
Include edge cases and failure modes
Reflect your personas and tone of voice
Are tied to your product’s goals and constraints

Each example should be short and auditable. Don’t test in theory, test in context.

4. Don’t forget adversarial testing

Add eval cases specifically designed to stress your system.
These should test:

Brand violations
Legal/regulatory compliance
Jailbreaks or adversarial prompts
Offensive or misleading outputs
Leaks of internal instructions or prompts

These aren’t “edge” cases anymore. They’re essential product risks.

5. Make Evals a Weekly Habit, Not a One-Time Event

The best AI teams don’t just evaluate once, they build evals into their workflows:

Use shared Airtables, spreadsheets, or custom data viewers to review outputs
Have cross-functional review sessions (PMs, UX, eng, legal)
Track trends over time: are we improving, regressing, or drifting?

Evaluation is a muscle. If you don’t train it regularly, it atrophies.

6. Automate With Guardrails

Use LLMs to scale your evals after you’ve built trust in your human evaluation process.

Train LLM judges using real human-labeled data
Validate regularly: do human and AI judges agree?
Use automated scoring for fast iteration, but always spot-check

Automation is your second line of defense, not your first.

🎯 Summary: Your Key Eval Components

Eval Component	Why It Matters
✅ Binary Check	Immediate clarity on usability & safety
🗣 Critique	Nuanced understanding of success & failure
📋 Judgment Sets	Real-world relevance, not theoretical accuracy
⚠️ Red Teaming	Preventing costly mistakes and brand risks
📈 Trend Tracking	Sustained product improvement over time
🤖 LLM-as-Judge	Scaling up evaluation without losing depth

What did you think of today’s edition? Hit reply or send me your feedback at [email protected] - I read and reply to every single email you send.

Reply

or to participate.