Create datasets to evaluate agent performance
Learn how to create a dataset for Retool Agents Evals.
Before you create an eval, you first need to add a dataset. A dataset is a collection of test cases. An agent can have many datasets, and each dataset can have many test cases. It may be beneficial to group test cases into datasets based on use-case (for instance, agent accuracy, or response time).
Within an eval, you can select one or more datasets to evaluate.
Create a dataset
To create a dataset, navigate to the agent you want to evaluate.
- Click the Datasets tab, and then click Add Dataset.
- Provide a Name and optionally add a Description for your dataset.
- Click Create.
Create a test case
Test cases provide input and expected output for the evaluations.
Datasets can have many test cases, and test cases can be one of two Types:
- Tool choice: Verifies that the agent selects the expected tool, and extracts the expected parameters, based on the specified input.
- Final answer: Requires choosing either a Programmatic or LLM as a Judge reviewer to score the correctness of an agent's output. Use programmatic reviewers when the agent's expected output can be clearly defined (for example, Exact match), and LLM-as-a-Judge reviewers when it's not as clearly defined (for example, Tone detection).
For more information on reviewers, refer to the Reviewers section of the Evals concept page.
Tool choice test case
To create a Tool choice test case:
- Click on the dataset, and select the Add Test Case button.
- Select the dataset name from the Dataset dropdown.
- Enter a phrase you want to test in the Input field. The example below uses the Input phrase
What's the weather in Tokyo?
- Select Tool choice for the Type.
- In the Expected tool dropdown, select the tool you would expect your agent to choose given the input phrase. In the following example the Expected tool is
Get weather
. - Enter any Expected parameters. In the following example the parameters are the City,
Tokyo
, and the current Date inDD-MM-YYYY
format. - Click Create.
Final answer test case
To create a Final answer test case:
- Click on the dataset, and select the Add Test Case button.
- Select the dataset name from the Dataset dropdown.
- Enter a phrase you want to test in the Input field.
- Select Final answer for the Type.
- Choose a reviewer from the Programmatic or LLM-as-a-Judge options.
- Add the data you want to test with the reviewer (this varies based on the selected reviewer). The example below uses the
String contains
reviewer and searches for the Substringaccess
based on the InputWhat's my calendar look like today?
. Since the test case involves a weather agent, it would not be expected that the agent would have access to a calendar tool, and it is likely that the agent will respond withI do not have access
. - Click Save on the Choose a reviewer modal.
- Click Create to add the test case to your dataset.
Next steps
Once you've created a dataset and test cases, you can create an eval. Check out Run and compare evals to learn how to create an eval from a dataset and compare two evals side-by-side.