Skip to content

Architecture

Hawk provides infrastructure for running Inspect AI evaluations in a Kubernetes environment using YAML configuration files.

Overview

flowchart TD
    User["<b>hawk eval-set config.yaml</b>"]
    API["Hawk API Server<br/><i>FastAPI on ECS Fargate</i>"]
    EKS["Kubernetes (EKS)<br/><i>Scaled by Karpenter</i>"]
    Runner["Runner Pod<br/><i>Creates virtualenv, runs inspect_ai.eval_set()</i>"]
    Sandbox["Sandbox Pod(s)<br/><i>Isolated execution · Cilium network policies</i>"]
    S3[("S3<br/><i>Eval logs</i>")]
    EB["EventBridge → Lambda<br/><i>Tag, import to warehouse</i>"]
    DB[("Aurora PostgreSQL<br/><i>Results warehouse</i>")]
    Viewer["Web Viewer<br/><i>CloudFront · Browse, filter, export</i>"]
    Middleman["Middleman<br/><i>LLM Proxy</i>"]
    LLMs["LLM Providers<br/><i>OpenAI · Anthropic · Google · etc.</i>"]

    User --> API
    API -- "Validates config & auth<br/>Creates Helm release" --> EKS
    EKS --> Runner
    EKS --> Sandbox
    Runner -- "Writes logs" --> S3
    Runner <-- "API calls" --> Middleman
    Middleman --> LLMs
    S3 -- "S3 event" --> EB
    EB --> DB
    Viewer --> DB

Detailed Architecture

graph TB
    subgraph "User's Computer"
        CLI[hawk eval-set]
    end

    subgraph "Hawk API Service"
        API[FastAPI Server]
        AUTH[JWT Auth]
    end

    subgraph "Kubernetes Control Plane"
        HELM1[Helm Release 1<br/>Hawk]
        CHART1[Helm Chart 1<br/>Inspect Runner Job Template]

        HELM2[Helm Release 2<br/>inspect-k8s-sandbox]
        CHART2[Helm Chart 2<br/>Sandbox Environment Template]
    end

    subgraph "Inspect Runner Pod"
        HAWKLOCAL[hawk local]
        VENV[Virtual Environment]
        RUNNER[hawk.runner]
        INSPECT[inspect_ai.eval_set]
        K8SSANDBOX[inspect_k8s_sandbox]
    end

    subgraph "Sandbox Environment"
        POD2[Sandbox Environment Pod]
    end

    subgraph "AWS Infrastructure"
        S3[S3 Bucket<br/>Log Storage]
        EB[EventBridge]
        L1[eval_updated<br/>Lambda]
        L2[eval_log_reader<br/>Lambda]
        L3[token_broker<br/>Lambda]
        OL[S3 Object Lambda<br/>Access Point]
    end

    CLI -->|HTTP Request| API
    API -->|Authenticate| AUTH
    API -->|Create Release| HELM1
    HELM1 -->|Deploy| CHART1
    CHART1 -->|Run| HAWKLOCAL
    HAWKLOCAL -->|Create venv| VENV
    VENV -->|Execute| RUNNER
    RUNNER -->|Call| INSPECT
    INSPECT -->|Invoke| K8SSANDBOX
    K8SSANDBOX -->|Create Release| HELM2
    HELM2 -->|Deploy| CHART2
    CHART2 -->|Create Pod| POD2
    INSPECT -->|Run Commands| POD2

    RUNNER -->|Get Scoped Credentials| L3
    L3 -->|Validate Permissions| S3
    INSPECT -->|Write Logs| S3
    S3 -->|Object Created Event| EB
    EB -->|Trigger| L1
    CLI -->|Read Logs| OL
    OL -->|Check Permissions| L2
    L2 -->|Authorized Access| S3

Components

CLI (hawk)

The CLI is the primary interface for users. It handles configuration file parsing, API communication, and credential storage using keyring.

API Server

FastAPI-based REST API running on ECS Fargate. Handles JWT authentication, job orchestration via Helm, and configuration validation using Pydantic models.

Helm Chart: Inspect Runner Job Template

Each job gets its own isolated namespace. Resources created per job:

  • Namespace — runner namespace, plus a separate sandbox namespace for eval sets
  • Job — the Kubernetes job that runs the evaluation
  • ConfigMap — stores the eval set configuration and per-job kubeconfig
  • Secret — API keys, common env vars, and user-provided secrets
  • ServiceAccount — with AWS IAM role annotation and RoleBinding
  • CiliumNetworkPolicy — network isolation allowing egress only to sandbox namespace, kube-dns, API server, and external services

Namespace Naming

  • Runner namespace: {prefix}-{job_id} (e.g., inspect-my-eval-123)
  • Sandbox namespace: {runner_namespace}-s (e.g., inspect-my-eval-123-s)
  • Maximum job ID: 43 characters (enforced to stay within Kubernetes' 63-char namespace limit)

Runner (hawk.runner.entrypoint)

The runner pod's entrypoint creates an isolated Python virtualenv, installs dependencies (inspect_k8s_sandbox, task/solver/model packages), then executes the evaluation. This isolation prevents dependency conflicts.

Sandbox Environments

Sandbox pods run within a StatefulSet (auto-recreated on crash) with resource constraints and limited network access, provided by inspect_k8s_sandbox.

Log Storage

Evaluation logs are written directly to S3 by Inspect AI:

  • *.eval — individual evaluation result files (zip archives)
  • logs.json — metadata mapping eval file paths to headers
  • .buffer/ — in-progress state (SQLite databases)

Lambda Functions

Lambda Purpose
eval_updated Triggered by S3 events on .eval and logs.json creation. Tags files by model, publishes completion events to EventBridge.
eval_log_reader S3 Object Lambda for authenticated log access. Checks user permissions via model group membership.
token_refresh Refreshes OAuth access tokens used by other Lambda functions.
token_broker Exchanges user JWTs for scoped AWS credentials, enforcing multi-tenant isolation at the credential level.

Log Access Flow

  1. User requests logs via hawk web or hawk transcript
  2. Request routes through S3 Object Lambda Access Point
  3. eval_log_reader Lambda validates user permissions against model groups
  4. Authorized users receive the requested log data

Users should never access the underlying S3 bucket directly — always through the Object Lambda Access Point.

Repository Structure

infra/                  Pulumi infrastructure (Python)
  __main__.py           Entrypoint
  core/                 VPC, EKS, ALB, ECS, RDS, Route53, S3
  k8s/                  Karpenter, Cilium, Datadog agent, RBAC
  hawk/                 Hawk API (ECS), Lambdas, EventBridge, CloudFront
  datadog/              Monitors, dashboards (optional)
  lib/                  Shared config, naming, tagging helpers

hawk/                   Hawk application (Python + React)
  cli/                  CLI (Click-based)
  api/                  API server (FastAPI)
  runner/               Kubernetes job runner
  core/                 Shared types, DB models, log importer
  www/                  Web viewer (React + TypeScript + Vite)
  terraform/            Lambda function modules (OpenTofu)
  examples/             Example eval and scan configs
  tests/                Unit, E2E, and smoke tests

middleman/              LLM proxy (OpenAI, Anthropic, Google Vertex)

Pulumi.example.yaml     Documented config reference