Skip to content

Inspect Hawk

Inspect Hawk

Run Inspect AI evaluations at scale on AWS.


Inspect Hawk is a platform for running Inspect AI evaluations on cloud infrastructure. You define tasks, agents, and models in a YAML config, and Hawk handles everything else: provisioning isolated Kubernetes pods, managing LLM API credentials, streaming logs, storing results in a PostgreSQL warehouse, and serving a web UI to browse them.

Hawk is built on Inspect AI, the open-source evaluation framework created by the UK AI Safety Institute. Inspect provides the evaluation primitives (tasks, solvers, scorers, sandboxes). Hawk provides the infrastructure to run those evaluations reliably at scale across multiple models and tasks, without manually provisioning machines or managing API keys.

The system is designed for teams that need to run evaluations regularly and at volume. It supports row-level security and access control per model, a managed LLM proxy, and a data warehouse for querying results across runs. It also supports Inspect Scout scans over previous evaluation transcripts.

Features

  • One YAML, full grid. Define tasks, agents, and models. Hawk runs every combination.
  • Kubernetes-native. Each eval gets its own pod and fresh virtualenv. Sandboxes run in separate pods with network isolation.
  • Built-in LLM proxy. Managed proxy for OpenAI, Anthropic, and Google Vertex with automatic token refresh. Bring your own keys if you prefer.
  • Live monitoring. hawk logs -f streams logs in real-time. hawk status returns a structured JSON report.
  • Web UI. Browse eval sets, filter samples by score and full-text search, compare across runs, export to CSV.
  • Scout scanning. Run scanners over transcripts from previous evals.
  • Data warehouse. Results land in PostgreSQL with trigram search and covering indexes.
  • Access control. Model group permissions gate who can run models, view logs, and scan eval sets.
  • Sample editing. Batch edit scores, invalidate samples. Full audit trail.
  • Local mode. hawk local eval-set runs the same config on your machine. --direct lets you attach a debugger.
  • Resumable scans. Configs save to S3. hawk scan resume picks up where you left off.

What Hawk Deploys

When you run pulumi up, Hawk creates the following infrastructure on AWS:

Component Service Purpose
Compute (evals) EKS Runs evaluation jobs as isolated Kubernetes pods
Compute (API) ECS Fargate Hosts the Hawk API server and LLM proxy
Database Aurora PostgreSQL Serverless v2 Results warehouse with IAM auth, auto-pauses when idle
Storage S3 Eval logs, written directly by Inspect AI
Event processing EventBridge + Lambda Imports logs into the warehouse, manages access control
Web viewer CloudFront Browse and analyze evaluation results
Networking VPC + ALB Internet-facing load balancer with TLS (configurable)
DNS Route53 Service discovery and public DNS

The infrastructure scales down to near-zero cost when idle (Aurora auto-pauses, Karpenter scales EKS nodes to zero) and scales up automatically when you submit evaluations.

Next Steps