Skip to content

Tests

Tests are individual test entries within an evaluation file. Each test defines input messages, expected outcomes, and optional evaluator overrides.

tests:
- id: addition
criteria: Correctly calculates 15 + 27 = 42
input: What is 15 + 27?
expected_output: "42"
FieldRequiredDescription
idYesUnique identifier for the test
criteriaYesDescription of what a correct response should contain
inputYesInput sent to the target (string, object, or message array). Alias: input_messages
expected_outputNoExpected response for comparison (string, object, or message array). Alias: expected_messages
executionNoPer-case execution overrides (target, evaluators)
workspaceNoPer-case workspace config (overrides suite-level)
metadataNoArbitrary key-value pairs passed to setup/teardown scripts
rubricsNoStructured evaluation criteria
sidecarNoAdditional metadata passed to evaluators

The simplest form is a string, which expands to a single user message:

input: What is 15 + 27?

For multi-turn or system messages, use a message array:

input:
- role: system
content: You are a helpful math tutor.
- role: user
content: What is 15 + 27?

Optional reference response for comparison by evaluators. A string expands to a single assistant message:

expected_output: "42"

For structured or multi-message expected output, use a message array:

expected_output:
- role: assistant
content: "42"

Override the default target or evaluators for specific tests:

tests:
- id: complex-case
criteria: Provides detailed explanation
input: Explain quicksort algorithm
execution:
target: gpt4_target
evaluators:
- name: depth_check
type: llm_judge
prompt: ./judges/depth.md

Override the suite-level workspace config for individual tests. Test-level fields replace suite-level fields:

workspace:
setup:
script: ["bun", "run", "default-setup.ts"]
tests:
- id: case-1
criteria: Should work
input: Do something
workspace:
setup:
script: ["bun", "run", "custom-setup.ts"]
- id: case-2
criteria: Should also work
input: Do something else
# Inherits suite-level setup

See Workspace Setup/Teardown for the full workspace config reference.

Pass arbitrary key-value pairs to setup/teardown scripts via the metadata field. This is useful for benchmark datasets where each case needs repo info, commit hashes, or other context:

tests:
- id: sympy-20590
criteria: Bug should be fixed
input: Fix the diophantine equation bug
metadata:
repo: sympy/sympy
base_commit: "abc123def"
workspace:
setup:
script: ["python", "checkout_repo.py"]

The metadata field is included in the stdin JSON passed to setup and teardown scripts as case_metadata.

Pass additional context to evaluators via the sidecar field:

tests:
- id: code-gen
criteria: Generates valid Python
sidecar:
language: python
difficulty: medium
input: Write a function to sort a list