Skip to main content

Evals in Retool Agents

Learn about systematic tests using evals in Retool Agents.

Evals are systematic tests to assess large language models (LLMs) and agents on tasks like reasoning, accuracy, or safety. They use benchmarks or real-world scenarios to ensure models produce reliable, coherent, and ethical outputs.

When running an eval, Retool provides an input to the LLM. A reviewer scores the LLM output (0 to 1). Sometimes there are clear expected outputs (for example, programmatic reviewers like exact match), and sometimes expected outputs are less clear (for example, LLM as a judge reviewers like tone detection).

By tracking and scoring the results of your evals, you can test agents to determine if you've introduced breaking changes to an agent, you've improved the agent’s behavior, or if the agent is working as expected.

Evals tab

The Evals tab allows you to evaluate and compare agent runs. To evaluate your agent run, you must first add a Dataset, create a Test case, and add a Reviewer.

Retool Agents Evals tab.

You can compare two agent runs side-by-side by selecting the checkbox next to the Name of the runs you want to compare and clicking Compare. You cannot compare more than two runs at a time.

Evals comparison of two agent runs.

Datasets and test cases

A dataset is a collection of test cases.

A test case contains Input, which is provided to the LLM to prompt an answer. For example Schedule a meeting between Alice and Bob tomorrow. And the expected output is mapped to the two types of actions that an agent can take: choosing tools, or returning a final answer.

  • When the test case Type is Tool choice, the evaluation verifies that the agent selects the Expected tool, and extracts the Expected parameters from the Input.
  • When the test case Type is Final answer, the evaluation may verify metrics with a much broader set of Reviewers.

An agent may have many datasets for many test cases.

Example dataset with several test cases.

Reviewers

Reviewers score the correctness of the agent's output, and provide an explanation for the score. During the Eval run, a reviewer accepts the output from the LLM, parameters that vary based on the reviewer type selected, and returns a score beetween 0 and 1.

Retool provides Programmatic and LLM as a Judge preconfigured reviewers you can use to help evaluate your agent.

Programmatic

Programmatic reviewers use code to score output based on predefined rules. Use programmatic reviewers when you can clearly define the agent's expected output.

Reviewer nameDescriptionParameters
Exact matchDetermines whether two values are the same.None
Valid JSONEvaluates the validity of the JSON output.None
Valid XMLEvaluates the validity of the XML output.None
String containsDetermines whether a string contains a certain value.searchString - the search string to look up
ignoreCase - ignore capitalization.
JSON Schema matchDetermines whether the output of the agent is valid JSON conforming to the schema you specify.JSON Schema - the format of your JSON schema.
LevenshteinA similarity score based on the Levenshtein distance between two strings with the formula 1 - (lev / length of longer string) = similarity score. For example, the distance between Hello World and hello borld is 1 and the output of the reviewer is 0.91None

LLM-as-a-judge

LLM-as-a-judge reviewers use an LLM to determine the agent's score based on what you define in the input Prompt and Choice score between 0 and 1. Use LLM as a Judge reviewers when the agent's expected output is not clearly defined.

Reviewer nameDescriptionParameters
Tone detectionEvaluates the tone of an agent's output based on the user input prompt.Prompt
Choice Scores
FactualityEvaluates whether an output is factual.Prompt
Choice Scores
Closed QAEvaluates whether an output answers the input.Prompt
Choice Scores