Create datasets to evaluate agent performance

Agents Availability
Cloud	Public beta	Available on all plans
Self-hosted Edge 3.234 or later	Public beta	Available on all plans
Self-hosted Stable 3.253 or later	Public beta	Available on all plans

Before you create an eval, you first need to add a dataset. A dataset is a collection of test cases. An agent can have many datasets, and each dataset can have many test cases. It may be beneficial to group test cases into datasets based on use-case (for instance, agent accuracy, or response time).

Within an eval, you can select one or more datasets to evaluate.

Create a dataset

To create a dataset, navigate to the agent you want to evaluate.

Click the Datasets tab, and then click Add Dataset.
Provide a Name and optionally add a Description for your dataset.
Click Create.

Create a test case

Test cases provide input and expected output for the evaluations.

Datasets can have many test cases, and test cases can be one of two Types:

Tool choice: Verifies that the agent selects the expected tool, and extracts the expected parameters, based on the specified input.
Final answer: Requires choosing either a Programmatic or LLM as a Judge reviewer to score the correctness of an agent's output. Use programmatic reviewers when the agent's expected output can be clearly defined (for example, Exact match), and LLM-as-a-Judge reviewers when it's not as clearly defined (for example, Tone detection).

For more information on reviewers, refer to the Reviewers section of the Evals concept page.

Tool choice test case

To create a Tool choice test case:

Click on the dataset, and select the Add Test Case button.
Select the dataset name from the Dataset dropdown.
Enter a phrase you want to test in the Input field. The example below uses the Input phrase What's the weather in Tokyo?
Select Tool choice for the Type.
In the Expected tool dropdown, select the tool you would expect your agent to choose given the input phrase. In the following example the Expected tool is Get weather.
Enter any Expected parameters. In the following example the parameters are the City, Tokyo, and the current Date in DD-MM-YYYY format.
Click Create.

Final answer test case

To create a Final answer test case:

Click on the dataset, and select the Add Test Case button.
Select the dataset name from the Dataset dropdown.
Enter a phrase you want to test in the Input field.
Select Final answer for the Type.
Choose a reviewer from the Programmatic or LLM-as-a-Judge options.
Add the data you want to test with the reviewer (this varies based on the selected reviewer). The example below uses the String contains reviewer and searches for the Substring access based on the Input What's my calendar look like today?. Since the test case involves a weather agent, it would not be expected that the agent would have access to a calendar tool, and it is likely that the agent will respond with I do not have access.
Click Save on the Choose a reviewer modal.
Click Create to add the test case to your dataset.

Next steps

Once you've created a dataset and test cases, you can create an eval. Check out Run and compare evals to learn how to create an eval from a dataset and compare two evals side-by-side.

Create a dataset​

Create a test case​

Tool choice test case​

Final answer test case​

Next steps​

Create a dataset

Create a test case

Tool choice test case

Final answer test case

Next steps