Tests

Tests are individual test entries within an evaluation file. Each test defines input messages, expected outcomes, and optional evaluator overrides.

Basic Structure

tests:
  - id: addition
    criteria: Correctly calculates 15 + 27 = 42

    input: What is 15 + 27?

    expected_output: "42"

Fields

Field	Required	Description
`id`	Yes	Unique identifier for the test
`criteria`	Yes	Description of what a correct response should contain
`input`	Yes	Input sent to the target (string, object, or message array). Alias: `input_messages`
`expected_output`	No	Expected response for comparison (string, object, or message array). Alias: `expected_messages`
`execution`	No	Per-case execution overrides (target, evaluators)
`workspace`	No	Per-case workspace config (overrides suite-level)
`metadata`	No	Arbitrary key-value pairs passed to setup/teardown scripts
`rubrics`	No	Structured evaluation criteria
`sidecar`	No	Additional metadata passed to evaluators

Input

The simplest form is a string, which expands to a single user message:

input: What is 15 + 27?

For multi-turn or system messages, use a message array:

input:
  - role: system
    content: You are a helpful math tutor.
  - role: user
    content: What is 15 + 27?

Expected Output

Optional reference response for comparison by evaluators. A string expands to a single assistant message:

expected_output: "42"

For structured or multi-message expected output, use a message array:

expected_output:
  - role: assistant
    content: "42"

Per-Case Execution Overrides

Override the default target or evaluators for specific tests:

tests:
  - id: complex-case
    criteria: Provides detailed explanation
    input: Explain quicksort algorithm

    execution:
      target: gpt4_target
      evaluators:
        - name: depth_check
          type: llm_judge
          prompt: ./judges/depth.md

Per-Case Workspace Config

Override the suite-level workspace config for individual tests. Test-level fields replace suite-level fields:

workspace:
  setup:
    script: ["bun", "run", "default-setup.ts"]

tests:
  - id: case-1
    criteria: Should work
    input: Do something
    workspace:
      setup:
        script: ["bun", "run", "custom-setup.ts"]

  - id: case-2
    criteria: Should also work
    input: Do something else
    # Inherits suite-level setup

See Workspace Setup/Teardown for the full workspace config reference.

Per-Case Metadata

Pass arbitrary key-value pairs to setup/teardown scripts via the metadata field. This is useful for benchmark datasets where each case needs repo info, commit hashes, or other context:

tests:
  - id: sympy-20590
    criteria: Bug should be fixed
    input: Fix the diophantine equation bug
    metadata:
      repo: sympy/sympy
      base_commit: "abc123def"
    workspace:
      setup:
        script: ["python", "checkout_repo.py"]

The metadata field is included in the stdin JSON passed to setup and teardown scripts as case_metadata.

Sidecar Metadata

Pass additional context to evaluators via the sidecar field:

tests:
  - id: code-gen
    criteria: Generates valid Python
    sidecar:
      language: python
      difficulty: medium
    input: Write a function to sort a list