Tests
Tests are individual test entries within an evaluation file. Each test defines input messages, expected outcomes, and optional evaluator overrides.
Basic Structure
Section titled “Basic Structure”tests: - id: addition criteria: Correctly calculates 15 + 27 = 42
input: What is 15 + 27?
expected_output: "42"Fields
Section titled “Fields”| Field | Required | Description |
|---|---|---|
id | Yes | Unique identifier for the test |
criteria | Yes | Description of what a correct response should contain |
input | Yes | Input sent to the target (string, object, or message array). Alias: input_messages |
expected_output | No | Expected response for comparison (string, object, or message array). Alias: expected_messages |
execution | No | Per-case execution overrides (target, evaluators) |
workspace | No | Per-case workspace config (overrides suite-level) |
metadata | No | Arbitrary key-value pairs passed to setup/teardown scripts |
rubrics | No | Structured evaluation criteria |
sidecar | No | Additional metadata passed to evaluators |
The simplest form is a string, which expands to a single user message:
input: What is 15 + 27?For multi-turn or system messages, use a message array:
input: - role: system content: You are a helpful math tutor. - role: user content: What is 15 + 27?Expected Output
Section titled “Expected Output”Optional reference response for comparison by evaluators. A string expands to a single assistant message:
expected_output: "42"For structured or multi-message expected output, use a message array:
expected_output: - role: assistant content: "42"Per-Case Execution Overrides
Section titled “Per-Case Execution Overrides”Override the default target or evaluators for specific tests:
tests: - id: complex-case criteria: Provides detailed explanation input: Explain quicksort algorithm
execution: target: gpt4_target evaluators: - name: depth_check type: llm_judge prompt: ./judges/depth.mdPer-Case Workspace Config
Section titled “Per-Case Workspace Config”Override the suite-level workspace config for individual tests. Test-level fields replace suite-level fields:
workspace: setup: script: ["bun", "run", "default-setup.ts"]
tests: - id: case-1 criteria: Should work input: Do something workspace: setup: script: ["bun", "run", "custom-setup.ts"]
- id: case-2 criteria: Should also work input: Do something else # Inherits suite-level setupSee Workspace Setup/Teardown for the full workspace config reference.
Per-Case Metadata
Section titled “Per-Case Metadata”Pass arbitrary key-value pairs to setup/teardown scripts via the metadata field. This is useful for benchmark datasets where each case needs repo info, commit hashes, or other context:
tests: - id: sympy-20590 criteria: Bug should be fixed input: Fix the diophantine equation bug metadata: repo: sympy/sympy base_commit: "abc123def" workspace: setup: script: ["python", "checkout_repo.py"]The metadata field is included in the stdin JSON passed to setup and teardown scripts as case_metadata.
Sidecar Metadata
Section titled “Sidecar Metadata”Pass additional context to evaluators via the sidecar field:
tests: - id: code-gen criteria: Generates valid Python sidecar: language: python difficulty: medium input: Write a function to sort a list