Inspect Hawk¶
Run Inspect AI evaluations at scale on AWS.
Inspect Hawk is a platform for running Inspect AI evaluations on cloud infrastructure. You define tasks, agents, and models in a YAML config, and Hawk handles everything else: provisioning isolated Kubernetes pods, managing LLM API credentials, streaming logs, storing results in a PostgreSQL warehouse, and serving a web UI to browse them.
Hawk is built on Inspect AI, the open-source evaluation framework created by the UK AI Safety Institute. Inspect provides the evaluation primitives (tasks, solvers, scorers, sandboxes). Hawk provides the infrastructure to run those evaluations reliably at scale across multiple models and tasks, without manually provisioning machines or managing API keys.
The system is designed for teams that need to run evaluations regularly and at volume. It supports row-level security and access control per model, a managed LLM proxy, and a data warehouse for querying results across runs. It also supports Inspect Scout scans over previous evaluation transcripts.
Features¶
- One YAML, full grid. Define tasks, agents, and models. Hawk runs every combination.
- Kubernetes-native. Each eval gets its own pod and fresh virtualenv. Sandboxes run in separate pods with network isolation.
- Built-in LLM proxy. Managed proxy for OpenAI, Anthropic, and Google Vertex with automatic token refresh. Bring your own keys if you prefer.
- Live monitoring.
hawk logs -fstreams logs in real-time.hawk statusreturns a structured JSON report. - Web UI. Browse eval sets, filter samples by score and full-text search, compare across runs, export to CSV.
- Scout scanning. Run scanners over transcripts from previous evals.
- Data warehouse. Results land in PostgreSQL with trigram search and covering indexes.
- Access control. Model group permissions gate who can run models, view logs, and scan eval sets.
- Sample editing. Batch edit scores, invalidate samples. Full audit trail.
- Local mode.
hawk local eval-setruns the same config on your machine.--directlets you attach a debugger. - Resumable scans. Configs save to S3.
hawk scan resumepicks up where you left off.
What Hawk Deploys¶
When you run pulumi up, Hawk creates the following infrastructure on AWS:
| Component | Service | Purpose |
|---|---|---|
| Compute (evals) | EKS | Runs evaluation jobs as isolated Kubernetes pods |
| Compute (API) | ECS Fargate | Hosts the Hawk API server and LLM proxy |
| Database | Aurora PostgreSQL Serverless v2 | Results warehouse with IAM auth, auto-pauses when idle |
| Storage | S3 | Eval logs, written directly by Inspect AI |
| Event processing | EventBridge + Lambda | Imports logs into the warehouse, manages access control |
| Web viewer | CloudFront | Browse and analyze evaluation results |
| Networking | VPC + ALB | Internet-facing load balancer with TLS (configurable) |
| DNS | Route53 | Service discovery and public DNS |
The infrastructure scales down to near-zero cost when idle (Aurora auto-pauses, Karpenter scales EKS nodes to zero) and scales up automatically when you submit evaluations.
Next Steps¶
- Quick Start — Deploy Hawk from zero
- Installation — Install the CLI and run your first eval
- Architecture — Understand how the system works
- CLI Reference — Full command reference