Skip to content

Debugging Stuck Evaluations

This guide documents debugging techniques for evaluations that aren't progressing.

Quick Diagnosis Checklist

  1. Verify authentication: hawk auth access-token > /dev/null || echo "Run 'hawk login' first"
  2. Check job status: hawk status <eval-set-id> — JSON report with pod status, logs, metrics
  3. View logs: hawk logs <eval-set-id> or hawk logs -f for follow mode
  4. List samples: hawk list samples <eval-set-id> — see which samples completed/failed
  5. Get transcript: hawk transcript <sample-uuid> — view conversation for a specific sample
  6. Check buffer in S3: Download .buffer/ segments to analyze eval state
  7. Test API directly: Reproduce the error outside Inspect to isolate the issue

System Architecture

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│   hawk CLI   │────>│  API Server  │────>│    Helm      │
│ (eval-set)   │     │   (FastAPI)  │     │  (releases)  │
└──────────────┘     └──────────────┘     └──────────────┘
                                                v
┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  S3 Logs     │<────│ Runner Pod   │────>│ Sandbox Pods │
│ (.eval,      │     │ (runner ns)  │     │ (eval-set ns)│
│  .buffer/)   │     └──────────────┘     └──────────────┘
└──────────────┘            │
                            v
                     ┌──────────────┐     ┌──────────────┐
                     │  Middleman   │────>│ Provider APIs│
                     │  (auth proxy)│     │              │
                     └──────────────┘     └──────────────┘

Common Error Patterns

API 500 Errors with Retries

What it looks like:

Error code: 500 - Internal server error
Retry attempt 45 of 50 failed. Waiting 1800 seconds before next retry.

Diagnosis:

  1. Download the .buffer/ from S3 to find the failing request
  2. Test the request through Middleman: curl https://middleman.internal.metr.org/...
  3. Test directly against the provider API to isolate whether it's a Middleman issue

Note

500 errors are NOT token limit issues. Token limits return 400 errors.

Token/Context Limit Errors (400)

Error code: 400 - invalid_request_error

This IS a token limit issue. Check message count, configured token_limit, and model context window.

Retry Logs

Retry log messages include a context prefix identifying the sample:

[nWJu3Mz mmlu/42/1 openai/gpt-4o] -> openai/gpt-4o retry 3 (retrying in 24 seconds) [RateLimitError 429 rate_limit_exceeded]

The prefix format is [{sample_uuid} {task}/{sample_id}/{epoch} {model}].

Tip

The OpenAI SDK does not show the actual error in its retry messages. You must test with curl directly to see the real error.

Pod UID Mismatch

Pod UID mismatch: expected abc123, got def456

The sandbox pod was killed and restarted. There's nothing to do — Inspect will automatically retry the sample.

OOMKilled Pods

Check pod status via hawk status or:

kubectl describe pod -n <runner-namespace> <pod-name> | grep -A5 "Last State"

Look for OOMKilled in the termination reason.

Accessing Eval State

From S3

# List eval set contents
aws s3 ls s3://<bucket>/evals/<eval-set-id>/

# Download buffer for analysis
aws s3 sync s3://<bucket>/evals/<eval-set-id>/.buffer/ /tmp/buffer/

# Download completed .eval files
aws s3 sync s3://<bucket>/evals/<eval-set-id>/ /tmp/results/ --exclude ".buffer/*"

Reading .eval Files

.eval files are zip archives. Read them using the inspect_ai library:

from inspect_ai.log import read_eval_log

log = read_eval_log("path/to/file.eval")
for sample in log.samples:
    print(sample.id, sample.score)

Sample Buffer Analysis

The .buffer/ directory contains SQLite databases with eval state.

Check overall progress:

SELECT COUNT(*) as total_events FROM events;

SELECT json_extract(data, '$.state') as state, COUNT(*)
FROM samples GROUP BY state;

Detect if eval is progressing (FAIL-OK pattern):

SELECT
  id,
  CASE WHEN json_extract(data, '$.error') IS NULL THEN 'OK' ELSE 'FAIL' END as status
FROM events
WHERE json_extract(data, '$.event') = 'model_output'
ORDER BY id DESC LIMIT 50;
  • FAIL, OK, FAIL, OK... = Progressing (transient errors, retries succeed)
  • FAIL, FAIL, FAIL... = Stuck (something changed, now all fail)

Find pending events (stuck API calls):

SELECT COUNT(*) FROM events WHERE json_extract(data, '$.pending') = 1;

Direct API Testing

Test through Middleman to isolate whether the issue is Inspect AI, Middleman, or the provider.

TOKEN=$(hawk auth access-token)

# Test Anthropic
curl --max-time 300 -X POST https://middleman.internal.metr.org/anthropic/v1/messages \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4-20250514",
    "max_tokens": 100,
    "messages": [{"role": "user", "content": "Say hello"}]
  }'

# Test OpenAI-compatible APIs
curl --max-time 300 -X POST https://middleman.internal.metr.org/openai/v1/chat/completions \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "Say hello"}],
    "max_tokens": 100
  }'

Interpretation:

Result Meaning
All tests pass Issue is content-specific or Inspect AI bug
Middleman fails, direct to provider works Middleman issue
Both fail Provider API issue
Large request fails with 400 Token limit
Large request fails with 500 API bug with large context

Early Warning Signs

HTTP Retry Count Growth

Task progress logs include an "HTTP retries" counter:

{"message": "Task: search_server, Steps: 12/12 100%, ... HTTP retries: 5,710", ...}

Tasks can complete successfully despite thousands of retries. A growing retry count indicates API instability, but isn't fatal until retries stop succeeding entirely.

Severity calculation: Retry count x wait time = stuck duration. E.g., 45 retries x 1800s = 22+ hours stuck.

Command Cheatsheet

Hawk CLI

hawk status <eval-set-id>                    # JSON monitoring report
hawk logs <eval-set-id>                      # View logs
hawk logs -f                                 # Follow logs live
hawk list samples <eval-set-id>              # List samples with status
hawk transcript <sample-uuid>                # View sample conversation
hawk delete <eval-set-id>                    # Delete eval and cleanup
hawk eval-set <config.yaml>                  # Start/restart eval

S3 Access

aws s3 ls s3://<bucket>/evals/<eval-set-id>/
aws s3 sync s3://<bucket>/evals/<eval-set-id>/.buffer/ /tmp/buffer/

Kubectl (Advanced)

kubectl get pods -n <runner-ns> | grep <eval-set-id>   # Find runner pod
kubectl logs -n <runner-ns> <pod-name> --tail=200      # Pod logs
kubectl get pods -n <eval-set-id>                      # Sandbox pods
kubectl describe pod -n <runner-ns> <pod-name>         # Full pod details

Escalation Checklist

If you can't resolve the issue:

  1. Gather evidence: hawk status output, error patterns, sample buffer state, failing request payloads
  2. Determine scope: One eval or multiple? One task or all? One model or all?
  3. Check external status: Provider status pages, AWS status, internal monitoring dashboards
  4. Document timeline: When did the eval start? When did it get stuck? Last successful operation?