Example Configurations¶
Simple Eval Set¶
A minimal evaluation that runs two tasks against GPT-4o-mini:
simple.eval-set.yaml
tasks:
- package: "git+https://github.com/UKGovernmentBEIS/inspect_evals@dac86bcfdc090f78ce38160cef5d5febf0fb3670"
name: inspect_evals
items:
- name: mbpp
- name: class_eval
models:
- package: openai
name: openai
items:
- name: gpt-4o-mini
limit: 1
Submit it:
Eval Set with Secrets¶
Declares required secrets so Hawk catches missing credentials before the job starts:
simple-with-secrets.eval-set.yaml
tasks:
- package: "git+https://github.com/UKGovernmentBEIS/inspect_evals@dac86bcfdc090f78ce38160cef5d5febf0fb3670"
name: inspect_evals
items:
- name: mbpp
- name: class_eval
models:
- package: openai
name: openai
items:
- name: gpt-4o-mini
runner:
secrets:
- name: OPENAI_API_KEY
description: "OpenAI API key for model access"
- name: HF_TOKEN
description: "HuggingFace token for dataset access"
limit: 1
Submit with secrets:
hawk eval-set examples/simple-with-secrets.eval-set.yaml \
--secret OPENAI_API_KEY --secret HF_TOKEN
Docker-in-Sandbox¶
Demonstrates running Docker containers inside sandbox pods:
docker-in-sandbox.eval-set.yaml
name: docker_in_sandbox_demo
tasks:
- package: "git+https://github.com/METR/inspect-action.git@docker-in-sandbox-demo#subdirectory=demo/docker_in_sandbox_task"
name: docker_in_sandbox_task
items:
- name: docker_in_sandbox_hello
args:
sample_count: 1
models:
- package: "git+https://github.com/METR/inspect-test-utils@fc65be62a7f9135781c835c83cea1a33f9b20279"
name: hardcoded
items:
- name: hardcoded
args:
tool_calls:
- tool_name: bash
tool_args:
cmd: "dockerd-entrypoint.sh >/tmp/dockerd.log 2>&1 & ..."
answer: "hello"
Simple Scan¶
Runs Scout scanners over transcripts from a previous evaluation:
simple.scan.yaml
scanners:
- package: "git+https://github.com/METR/inspect-agents@metr_scanners/v0.1.0#subdirectory=packages/scanners"
name: metr_scanners
items:
- name: reward_hacking_scanner
- name: sandbagging_scanner
- name: broken_env_scanner
models:
- package: openai
name: openai
items:
- name: gpt-5
transcripts:
sources:
- eval_set_id: inspect-eval-set-t03dzj2ejftj506u
filter:
where:
- eval_status: success
limit: 10
shuffle: true
Submit it: