Inspect Hawk¶

Run Inspect AI evaluations at scale on AWS.

Inspect Hawk is a platform for running Inspect AI evaluations on cloud infrastructure. You define tasks, agents, and models in a YAML config, and Hawk handles everything else: provisioning isolated Kubernetes pods, managing LLM API credentials, streaming logs, storing results in a PostgreSQL warehouse, and serving a web UI to browse them.

Hawk is built on Inspect AI, the open-source evaluation framework created by the UK AI Safety Institute. Inspect provides the evaluation primitives (tasks, solvers, scorers, sandboxes). Hawk provides the infrastructure to run those evaluations reliably at scale across multiple models and tasks, without manually provisioning machines or managing API keys.

The system is designed for teams that need to run evaluations regularly and at volume. It supports row-level security and access control per model, a managed LLM proxy, and a data warehouse for querying results across runs. It also supports Inspect Scout scans over previous evaluation transcripts.

Features¶

One YAML, full grid. Define tasks, agents, and models. Hawk runs every combination.
Kubernetes-native. Each eval gets its own pod and fresh virtualenv. Sandboxes run in separate pods with network isolation.
Built-in LLM proxy. Managed proxy for OpenAI, Anthropic, and Google Vertex with automatic token refresh. Bring your own keys if you prefer.
Live monitoring. hawk logs -f streams logs in real-time. hawk status returns a structured JSON report.
Web UI. Browse eval sets, filter samples by score and full-text search, compare across runs, export to CSV.
Scout scanning. Run scanners over transcripts from previous evals.
Data warehouse. Results land in PostgreSQL with trigram search and covering indexes.
Access control. Model group permissions gate who can run models, view logs, and scan eval sets.
Sample editing. Batch edit scores, invalidate samples. Full audit trail.
Local mode. hawk local eval-set runs the same config on your machine. --direct lets you attach a debugger.
Resumable scans. Configs save to S3. hawk scan resume picks up where you left off.

What Hawk Deploys¶

When you run pulumi up, Hawk creates the following infrastructure on AWS:

Component	Service	Purpose
Compute (evals)	EKS	Runs evaluation jobs as isolated Kubernetes pods
Compute (API)	ECS Fargate	Hosts the Hawk API server and LLM proxy
Database	Aurora PostgreSQL Serverless v2	Results warehouse with IAM auth, auto-pauses when idle
Storage	S3	Eval logs, written directly by Inspect AI
Event processing	EventBridge + Lambda	Imports logs into the warehouse, manages access control
Web viewer	CloudFront	Browse and analyze evaluation results
Networking	VPC + ALB	Internet-facing load balancer with TLS (configurable)
DNS	Route53	Service discovery and public DNS

The infrastructure scales down to near-zero cost when idle (Aurora auto-pauses, Karpenter scales EKS nodes to zero) and scales up automatically when you submit evaluations.

Next Steps¶

Quick Start — Deploy Hawk from zero
Installation — Install the CLI and run your first eval
Architecture — Understand how the system works
CLI Reference — Full command reference