AI is popping up everywhere in software these days, and it’s getting smarter all the time with new “AI Agents.” Making sure they actually work well and do what they’re supposed to do is a big deal. If these smart systems aren’t tested properly, they could mess things up for users or even make unfair decisions – and nobody wants that.
So, asQuality Assurance (QA) folks in software development companies, getting our heads around how to approach AI agent evaluation and LLM evaluation techniques is crucial. It’s not just about learning something new; it’s about making sure the AI we’re putting out there is actually good and trustworthy.
This blog post is our take on how we, as QA, are tackling this new world of AI Agents, looking at how we can figure out if they’re working right using various AI metrics and tools, including exploring the potential of using LLM as Judge.
More than just finding bugs, we QA are bringing our understanding of the business to the table. This helps us see if these AI Agents are actually doing what they need to both for the project and the users.
First Things First: Seeing What’s Going On (Observability)
Before we can even think about checking if an AI Agent is doing a good job, we need to be able to see what it’s actually doing. That’s where “observability” comes in. Think of it as needing to see the pipes to fix a leak.
From simple logs to fancier tools that trace everything, observability helps us answer basic questions: “What’s happening?” and “Why?” It’s about getting clues from the system’s data without having to guess or dig around too much. We’ll talk about specific tools later on.
Figuring Out What “Good” Looks Like (Metrics and Evaluation)
Once we can see what our AI Agent is up to, the next step is to define what doing a good job looks like. This means creating evaluation metrics to measure its AI outputs and overall performance. We look at general AI evaluation concepts, but more importantly, we think about what makes sense for our specific project.
The key is to focus on the real problems we’re seeing with the agent. No need to grab every single measurement out there – let’s start with what really matters for us.
Also, it’s easy to want to build super complicated evaluation frameworks right away, but it’s better to start simple. Let’s figure out the real issues first before we build a giant testing machine.
For example, if we have an AI Agent helping customers, a standard thing to measure might be how many issues it solves (a performance metric). But we might find that a really important, less obvious thing to track is how many times it makes a customer angry right at the start. That could be a really important sign that something’s wrong.
Two Ways to Let the Agent Prove Itself
- Code evaluators / assertions – Sometimes a plain, deterministic check is better than asking an LLM to judge.
- Math or pricing calls If the agent calculates a sales-tax total, a one-line assert expected_total == agent_total is faster and crystal-clear.
- API-schema compliance When the agent assembles a JSON payload for an external service, running it through a JSON-Schema validator tells you instantly if the structure is correct.
- Math or pricing calls If the agent calculates a sales-tax total, a one-line assert expected_total == agent_total is faster and crystal-clear.
- LLM-as-Judge – Smart language models can also score our agent’s work. An LLM can look at how well it writes or how smoothly it talks to people. In a vacation-planning chatbot, for example, the LLM might judge whether the user was satisfied, the offer was good, the data was accurate, whether it hallucinated, and whether the tone felt right—using a 1-to-5 scale or simple yes/no tags.
Keep a Bit of Human Judgment in the Loop
Even though having people label agent outputs costs time and money, doing a little manual review is priceless. Try a lightweight habit where the team spends ten focused minutes each session tagging a handful of conversations instead of relying solely on automated LLM scores. That tiny slice of human feedback keeps the metrics honest, catches blind spots, and builds the intuition you just can’t get from a robotic score alone.
Keeping Things Consistent (Regression Testing)
Just like with regular software, we need to make sure that when we update our AI Agents or change things, we don’t break what was already working. That’s where “regression testing” comes in.
And one way to help with that is using “golden data” (think of it as a set of perfect examples with the right answers). If we always test our AI Agent against this perfect data after we make changes, we can quickly see if it’s suddenly doing worse. With AI models changing all the time, having this kind of testing in place helps us keep things stable and reliable.
Our Evolving Role: Still Bringing the Brains
This whole AI thing is changing our QA roles a bit, but in a lot of ways, it’s still about what we’ve always been good at: understanding the full picture. We often have the most knowledge about how the whole project works, and that’s super helpful when we’re trying to figure out if an AI Agent is behaving correctly. For example, in AI quality assurance, if we have an AI agent for support tickets, our domain knowledge of support workflows is key.
Picking the Right Tools
As we get better at measuring AI metrics and testing our AI agents, the next step is choosing the right observability and evaluation platforms. There are plenty on the market—Arize AI, DeepEval, Promptfoo, Ragas and Langfuse, to name a few.
These tools let you define and run evaluation frameworks and metrics, then report on how the agent is doing. Some even track “thinking” (token usage) and latency. The best choice always comes down to your project’s needs. Some tools are generalists, while others zoom in on performance measurement. It’s a fast-moving space, so we aim for products that look like they’re ready to stick around and keep improving. We’ll dive into individual platforms more in our next post.
Before you decide, though, it helps to ask a few practical questions:
- How does the tool integrate with the way you build agents? Expanded platforms such as Arize AI or Galileo offer automatic tracing hooks that plug straight into popular agent builders like LangChain, LlamaIndex and Hugging Face. If your code follows a clean architecture around those SDKs, turning observability on from day one is a huge win.
- Which elements of the agent flow matter most to you? Suppose your agent is heavy on RAG calls and you need custom retrieval-focused evaluations. A specialist tool like Ragas may be the right fit, while a broader platform could be overkill.
- Do you actually need a full-blown dashboard right now? If you’re happy to build the tracing stack yourself and a simple HTML report will do the job, Promptfoo is a solid option. It’s open source, widely recommended—even by folks at OpenAI—and keeps the tooling footprint light until you’re ready for something more sophisticated.
- Is the tool the best match for the current use-case? And are you willing to switch later? At the moment we lean on Arize AI because it offers many features plus a slick monitoring dashboard. For other scenarios, we’ll probably reach for different tools. The guiding principle never changes: pick the option that fits the problem and don’t hesitate to revisit the decision as requirements evolve.
Summary
Delivering Quality Assurance for AI agents is a new challenge QA teams face today. But, we are luckily not powerless, since we already have tools and heuristics to tackle these problems. But our greatest strengths and assets will be our minds and experience – and that’s unlikely to change in the near future.