Run evals and compare them side-by-side

Agents Availability
Cloud	Public beta	Available on all plans
Self-hosted Edge 3.234 or later	Public beta	Available on all plans
Self-hosted Stable 3.253 or later	Public beta	Available on all plans

Evals allow you to:

Test your agent's effectiveness and accuracy.
Experiment with the output that different models provide.
Compare two eval runs side-by-side.

For more information about evals, refer to the Evals conceptual guide.

After you've created a datset and test cases, you can create and run an eval. Running an eval of your dataset produces an Avg score based on whether the test cases in the dataset passed or failed.

Eval runs count towards billable agent runtime. For more information about billing, refer to the Billing and usage page.

Run an eval

To run an eval of your dataset items, from a test case, use the breadcrumbs to go back to the Datasets tab.

From the agent:

Click the Evals tab.
Click the Run button.
Give the run a Name, check the box next to your dataset name, and click Run eval.
When the status changes to Completed, your eval will produce an Avg score.
Click on the run to view the logs for each test case. Clicking on each line item opens the Test Case Details panel.

Hovering over the Score displays the rationale for a success or failure. In the case of a failed run, navigate back to the test case to examine or correct the expected output.

Compare evals

You can compare two runs side-by-side.

On the Evals tab, check the box next to two runs and select Compare.
If Eval A and Eval B look correct on the Compare evals modal, select Compare again.
The run details are shown in side-by-side panels so you can more easily identify failures.

If you have done multiple eval runs, you can select other runs from the dropdowns at the top of the Compare page to change the information displayed.

Run an eval​

Compare evals​

Run an eval

Compare evals