Blog
Gen AI

Episode 3: Stop Guessing, Start Measuring: A Framework for Enterprise AI Evaluation

This post is part of Incorta's Innovate with Intelligence webinar series, a four-part exploration of agentic AI built for enterprise teams. From design patterns to evaluation to governance, each session tackles a different layer of what it takes to move AI from demo to production. Catch the full series here.

There's a moment every AI team knows well. The demo goes perfectly, the agent answers everything correctly, stakeholders are impressed... Then you deploy it.
Users start phrasing questions in unexpected ways. Edge cases appear that nobody thought to test. A prompt tweak that improves one query quietly breaks three others. And suddenly, the team is back to manually checking outputs, hoping nothing slipped through.

In Episode 3 we tackled this problem head-on: how do you move from vibes-based testing to a rigorous, systematic evaluation framework for enterprise AI?

Watch the full episode, or keep reading for our step-by-step approach.

The Core Problem: Vibes don't scale

Traditional software is deterministic. You write a test, it passes or fails, and you know exactly why. Evaluation is largely static.

Agentic AI is fundamentally different. The same question, phrased slightly differently, can produce a different answer with no clear explanation. A change that sharpens performance on finance queries might silently break supply chain queries. And you won't know until a user complains, or worse, until a wrong answer causes a real business problem.

The shift required isn't just technical. It's philosophical: stop treating evaluation as a manual chore and start treating it as a governed data workload.

Build a golden dataset, not a random sample

The foundation of any serious evaluation framework is a golden dataset: a curated, governed set of test cases that lives in a structured table, not a loose CSV on someone's laptop.

But the contents of that dataset matter as much as the format. A stratified approach covers three layers:

  1. Happy Path: Does the agent get the basics right? These are the expected, well-formed questions with clean data. Necessary, but nowhere near sufficient.
  2. Noise and Edge Cases: What happens when a user misspells "revenue"? What if the data for a requested time period is missing? Does the agent fail gracefully, or does it hallucinate a number? Edge cases reveal how an agent behaves when the real world stops cooperating.
  3. Adversarial: Intentionally try to break it. Ask for data the agent shouldn't access. Try to override system instructions. Probe for prompt injection vulnerabilities. If your evaluation only covers the happy path, you're not testing. You're hoping.

What to Measure: three signals that actually matter

Vague metrics like "helpfulness" are hard to act on. Instead, measure the technical relationships between data components:

1. Context Recall: Did the agent retrieve the correct rows from the database? If the right data isn't in the context, the answer can't be right, no matter how well the LLM reasons.

2. Faithfulness: Is every claim in the agent's response actually supported by the retrieved data? This is the anti-hallucination check. An agent that sounds confident while making things up is worse than one that admits uncertainty.

3. Answer Relevance: Did the agent answer the specific question asked, or did it just summarize the data broadly and call it done?

Track these three signals independently. When something fails, you'll know immediately whether the problem is in retrieval or in reasoning, which tells you exactly where to fix it.

Automate the scoring with an LLM as judge

You can't have humans review thousands of test outputs every time you update a prompt. The solution: use a highly capable LLM to evaluate your agent's outputs against a strict rubric.

A well-designed judge prompt produces a structured JSON response with both a numerical score and a reasoning string, converting qualitative text into hard integers you can aggregate, average, and graph over time.

Tools like Promptfoo offer out-of-the-box assertions for common metrics (factual accuracy, format validity, keyword presence, LLM-as-judge scoring) without requiring teams to build evaluation logic from scratch.

Close the iteration loop: from hours to seconds

Here's where the architecture pays off. In a traditional setup, testing a prompt change means running evaluations, exporting to CSV, uploading to a BI tool, and waiting for a dashboard refresh. Context switching at every step.

When evaluation infrastructure lives in the same ecosystem as the agent (as Incorta's implementation does, using notebooks and dashboards on shared data), the loop collapses. Tweak a parameter, hit run, see the quality dashboard update instantly.

That speed of iteration is what separates teams that improve quickly from teams that stay stuck.

Where engineering meets business

Evaluation results shouldn't sit in an isolated data table. They should be the pulse of the product: visible, queryable, and directly tied to decisions.

A well-designed performance dashboard answers the questions stakeholders actually care about. Is the new model accurate enough to justify its higher cost? Did the latest prompt change improve performance or introduce regressions? Which specific test cases are failing most frequently, and what does that tell us to fix next?

Critically, this visibility should require zero manual effort. Test suites run automatically, nightly, or triggered by any model deployment or semantic layer change. Results push to the dashboard automatically. The health you see always reflects the current state of the system.

Offline evaluation is necessary - and it's not sufficient

There's a gap that even a rigorous golden dataset can't close: the difference between curated test inputs and real user behavior.

In the lab, inputs are clean and controlled. In production, users are messy, impatient, and unpredictable. Closing this gap requires online observability: watching what actually happens when real users interact with the agent.

The most valuable signals are often implicit. A user who interrupts the agent mid-generation is usually signaling a relevance failure. A user who asks the exact same question twice in a row didn't get what they needed the first time. High query volume paired with low active user count typically means people are getting stuck and retrying.

These silent signals are a rich source of data for improving agent performance. Every real-world interaction, including the ones the agent struggles with, is a high-value data point.

Feed those real-world failures back into your golden dataset. Use them to upgrade the system. This is the evaluation flywheel: production traffic doesn't just expose problems, it actively makes the agent smarter over time.

The 45% Rule: Why Complexity Is the Enemy

One research finding worth keeping front of mind: once a single agent crosses roughly 45% accuracy on a task, adding more agents to the system often makes overall performance worse. Errors cascade. Coordination overhead grows. System performance slips.

The implication: build simple, well-evaluated agents before layering complexity. Every added layer amplifies both the capability and the failure modes. Evaluation is what keeps you honest about which one is growing faster.

Deploying an AI agent is just the starting point. The teams that succeed in production are the ones that treat evaluation as a first-class engineering discipline: structured test cases, measurable metrics, automated scoring, continuous monitoring, and a feedback loop that turns real user behavior into system improvements.

Share this post

Get more from Incorta

Your data. No limits.