Is Your Agent Production Ready?

Latest
March 12, 2025

Artificial intelligence is poised to shake up much of what we know about software development, but as it turns out, agent lifecycle management, also known as agent development lifecycle (ADL), is fairly similar to traditional application lifecycle management (ALM). From ideation and configuration, to testing and deployment, the path to taking your agent from the drawing board to production will be familiar. But of course, the devil is in the details, especially when generative AI is involved.

Generative models lend AI agents their capacity for creativity and reasoning, but it’s that very lack of rules-based logic that can sometimes lead to novel and unpredictable outcomes. It’s why the same prompt can generate countless different outputs, or even hallucinations. For some customers, that novelty could be considered a feature, but for most enterprise users, unpredictability is a bug. Predictable and consistent outputs are an understandably important criteria for enterprise-grade AI systems.

So, how do you know if your agent will behave the way you expect it to? How do you know when it’s production-ready? The answer takes a page straight out of the traditional ALM playbook: testing, testing, and more testing! But unlike a conventional software application, testing an agent means grappling with a few curveballs.

From unit to scale testing

Like traditional ALM, unit testing is an essential component of agentic lifecycle management. Whether you’re working with an out-of-the-box agent or building one from scratch, you’ll want to be sure it has the right tools for the job to be done. In Agentforce, this means choosing from a library of topics and actions to define your agent’s scope and capabilities. Once you’ve selected your topics and actions, it’s important to validate that these composable building blocks work the way you expect — and play well together at run-time. This can be done directly in Agent Builder’s built-in Plan Tracer by entering an utterance in the prompt window and observing the agent in real-time as it identifies the topic and orchestrates a sequence of actions that (hopefully) culminate in the desired outcome.

Like any software test, the first one will probably fail, but Plan Tracer will help you quickly identify what went wrong and make any necessary changes. If you need to take a closer look at what’s happening, you can explore the event logs by clicking on the icon at the bottom of Agent Builder’s left-hand navigation menu.

Once you’re happy with the way your agent responds to an individual query, it’s time to see how it performs at scale and deals with various different scenarios. Again, scale testing should be a familiar carryover from traditional ALM. Since no one wants to be stuck writing test cases for hours on end, Agentforce Testing Center is equipped to auto-generate hundreds of realistic test cases in a single click. Simply describe the type of use cases you want to test for using natural-language instructions, and Testing Center will generate a diverse set of representative test cases to evaluate, which you can edit or add to as needed. Based on the test results, your team can make any necessary changes to the agent’s instructions, topics, or actions.

Remember, testing is a journey, not a one-off. Any time you apply a patch or fix a bug, you’ll want to run a predefined set of regression tests to ensure the overall functionality of your agent is intact.

Now that we know what tools to use to evaluate our agents, let’s take a look under the hood at how an AI agent test actually works.

Under the hood of agentic testing

Traditional software operates on deterministic logic, which makes testing it straightforward. You put together your test cases and check whether the software behaves as expected in various conditions. Since you’ve already codified exactly what should happen for any given user input, a test can only pass or fail with little room for interpretation. Either the result maps back to the requirements you’ve defined, or it doesn’t. This is standard assertion-based testing.

Generative AI doesn’t work this way. There’s an infinite number of possible inputs and outputs, and the same input can yield countless slightly different yet still correct responses. That makes it impossible to anticipate and define rules for every situation.

As you can imagine, this isn’t an easy problem to solve. So what’s an agent developer to do? With Agentforce Testing Center, there’s an LLM for that!

Using an LLM to test an agent

To ensure the subjective elements of an agentic workflow run as expected, Agentforce Testing Center uses an LLM-as-a-judge in the flow of testing. Using this method, the LLM compares the actual response to evaluation criteria defined by the user to determine if an agent’s response is acceptable.

Testing an agentic workflow comprehensively requires both rules-based and subjective evaluations. Here’s what happens under the hood when you use Agentforce Testing Center to test an agent.

Assertion-based evaluations test the deterministic elements of the agent, such as topic choice, action choice, and latency. Meanwhile, LLM-based evaluations test the stochastic elements, such as response quality and accuracy, as well as checking for hallucinations and bias.

So, when is your agent production-ready?

Every company needs to determine their own baseline for pass/fail rates in various scenarios. A great first step is to measure how accurately a human would perform the same task. This establishes a baseline. From there, aim to meet or exceed the same level of accuracy with your agent. But how do you know when your agent is production-ready? For deterministic elements like identifying the right topic and orchestrating the correct actions, a pass rate close to 100% should be your target. When you’re using an LLM to judge subjective tasks, an acceptable pass rate might look more like 70%-85%, depending on the complexity of the task, the specific criteria being judged, and the desired level of accuracy for your agent. Higher pass rates are preferable for use cases or industries where precision is essential.

Of course, testing is just one aspect of the AI agent lifecycle. Check back soon for our follow-up post exploring monitoring, where we’ll cover tools like Agentforce Interaction Explorer (launching soon!) and how to leverage insights from production to continuously monitor, analyze and improve your agents.