Debugging Stuck Evaluations¶
This guide documents debugging techniques for evaluations that aren't progressing.
Quick Diagnosis Checklist¶
- Verify authentication:
hawk auth access-token > /dev/null || echo "Run 'hawk login' first" - Check job status:
hawk status <eval-set-id>— JSON report with pod status, logs, metrics - View logs:
hawk logs <eval-set-id>orhawk logs -ffor follow mode - List samples:
hawk list samples <eval-set-id>— see which samples completed/failed - Get transcript:
hawk transcript <sample-uuid>— view conversation for a specific sample - Check buffer in S3: Download
.buffer/segments to analyze eval state - Test API directly: Reproduce the error outside Inspect to isolate the issue
System Architecture¶
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ hawk CLI │────>│ API Server │────>│ Helm │
│ (eval-set) │ │ (FastAPI) │ │ (releases) │
└──────────────┘ └──────────────┘ └──────────────┘
│
v
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ S3 Logs │<────│ Runner Pod │────>│ Sandbox Pods │
│ (.eval, │ │ (runner ns) │ │ (eval-set ns)│
│ .buffer/) │ └──────────────┘ └──────────────┘
└──────────────┘ │
v
┌──────────────┐ ┌──────────────┐
│ Middleman │────>│ Provider APIs│
│ (auth proxy)│ │ │
└──────────────┘ └──────────────┘
Common Error Patterns¶
API 500 Errors with Retries¶
What it looks like:
Error code: 500 - Internal server error
Retry attempt 45 of 50 failed. Waiting 1800 seconds before next retry.
Diagnosis:
- Download the
.buffer/from S3 to find the failing request - Test the request through Middleman:
curl https://middleman.internal.metr.org/... - Test directly against the provider API to isolate whether it's a Middleman issue
Note
500 errors are NOT token limit issues. Token limits return 400 errors.
Token/Context Limit Errors (400)¶
This IS a token limit issue. Check message count, configured token_limit, and model context window.
Retry Logs¶
Retry log messages include a context prefix identifying the sample:
[nWJu3Mz mmlu/42/1 openai/gpt-4o] -> openai/gpt-4o retry 3 (retrying in 24 seconds) [RateLimitError 429 rate_limit_exceeded]
The prefix format is [{sample_uuid} {task}/{sample_id}/{epoch} {model}].
Tip
The OpenAI SDK does not show the actual error in its retry messages. You must test with curl directly to see the real error.
Pod UID Mismatch¶
The sandbox pod was killed and restarted. There's nothing to do — Inspect will automatically retry the sample.
OOMKilled Pods¶
Check pod status via hawk status or:
Look for OOMKilled in the termination reason.
Accessing Eval State¶
From S3¶
# List eval set contents
aws s3 ls s3://<bucket>/evals/<eval-set-id>/
# Download buffer for analysis
aws s3 sync s3://<bucket>/evals/<eval-set-id>/.buffer/ /tmp/buffer/
# Download completed .eval files
aws s3 sync s3://<bucket>/evals/<eval-set-id>/ /tmp/results/ --exclude ".buffer/*"
Reading .eval Files¶
.eval files are zip archives. Read them using the inspect_ai library:
from inspect_ai.log import read_eval_log
log = read_eval_log("path/to/file.eval")
for sample in log.samples:
print(sample.id, sample.score)
Sample Buffer Analysis¶
The .buffer/ directory contains SQLite databases with eval state.
Check overall progress:
SELECT COUNT(*) as total_events FROM events;
SELECT json_extract(data, '$.state') as state, COUNT(*)
FROM samples GROUP BY state;
Detect if eval is progressing (FAIL-OK pattern):
SELECT
id,
CASE WHEN json_extract(data, '$.error') IS NULL THEN 'OK' ELSE 'FAIL' END as status
FROM events
WHERE json_extract(data, '$.event') = 'model_output'
ORDER BY id DESC LIMIT 50;
FAIL, OK, FAIL, OK...= Progressing (transient errors, retries succeed)FAIL, FAIL, FAIL...= Stuck (something changed, now all fail)
Find pending events (stuck API calls):
Direct API Testing¶
Test through Middleman to isolate whether the issue is Inspect AI, Middleman, or the provider.
TOKEN=$(hawk auth access-token)
# Test Anthropic
curl --max-time 300 -X POST https://middleman.internal.metr.org/anthropic/v1/messages \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "claude-sonnet-4-20250514",
"max_tokens": 100,
"messages": [{"role": "user", "content": "Say hello"}]
}'
# Test OpenAI-compatible APIs
curl --max-time 300 -X POST https://middleman.internal.metr.org/openai/v1/chat/completions \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"messages": [{"role": "user", "content": "Say hello"}],
"max_tokens": 100
}'
Interpretation:
| Result | Meaning |
|---|---|
| All tests pass | Issue is content-specific or Inspect AI bug |
| Middleman fails, direct to provider works | Middleman issue |
| Both fail | Provider API issue |
| Large request fails with 400 | Token limit |
| Large request fails with 500 | API bug with large context |
Early Warning Signs¶
HTTP Retry Count Growth¶
Task progress logs include an "HTTP retries" counter:
Tasks can complete successfully despite thousands of retries. A growing retry count indicates API instability, but isn't fatal until retries stop succeeding entirely.
Severity calculation: Retry count x wait time = stuck duration. E.g., 45 retries x 1800s = 22+ hours stuck.
Command Cheatsheet¶
Hawk CLI¶
hawk status <eval-set-id> # JSON monitoring report
hawk logs <eval-set-id> # View logs
hawk logs -f # Follow logs live
hawk list samples <eval-set-id> # List samples with status
hawk transcript <sample-uuid> # View sample conversation
hawk delete <eval-set-id> # Delete eval and cleanup
hawk eval-set <config.yaml> # Start/restart eval
S3 Access¶
aws s3 ls s3://<bucket>/evals/<eval-set-id>/
aws s3 sync s3://<bucket>/evals/<eval-set-id>/.buffer/ /tmp/buffer/
Kubectl (Advanced)¶
kubectl get pods -n <runner-ns> | grep <eval-set-id> # Find runner pod
kubectl logs -n <runner-ns> <pod-name> --tail=200 # Pod logs
kubectl get pods -n <eval-set-id> # Sandbox pods
kubectl describe pod -n <runner-ns> <pod-name> # Full pod details
Escalation Checklist¶
If you can't resolve the issue:
- Gather evidence:
hawk statusoutput, error patterns, sample buffer state, failing request payloads - Determine scope: One eval or multiple? One task or all? One model or all?
- Check external status: Provider status pages, AWS status, internal monitoring dashboards
- Document timeline: When did the eval start? When did it get stuck? Last successful operation?