Create datasets to evaluate agent performance
Learn how to create a dataset for Retool Agents Evals.
| Agents Availability | |||
|---|---|---|---|
| Cloud | Public beta | ||
| Self-hosted Edge 3.234 or later | Public beta | ||
| Self-hosted Stable 3.253 or later | Public beta | ||
Before you create an eval, you first need to add a dataset. A dataset is a collection of test cases. An agent can have many datasets, and each dataset can have many test cases. It may be beneficial to group test cases into datasets based on use-case (for instance, agent accuracy, or response time).
Within an eval, you can select one or more datasets to evaluate.
Create a dataset
To create a dataset, navigate to the agent you want to evaluate.
- Click the Datasets tab, and then click Add Dataset.
- Provide a Name and optionally add a Description for your dataset.
- Click Create.
Create a test case
Test cases provide input and expected output for the evaluations.
Datasets can have many test cases, and test cases can be one of two Types:
- Tool choice: Verifies that the agent selects the expected tool, and extracts the expected parameters, based on the specified input.
- Final answer: Requires choosing either a Programmatic or LLM as a Judge reviewer to score the correctness of an agent's output. Use programmatic reviewers when the agent's expected output can be clearly defined (for example, Exact match), and LLM-as-a-Judge reviewers when it's not as clearly defined (for example, Tone detection).
For more information on reviewers, refer to the Reviewers section of the Evals concept page.