Phase 7: Benchmarks, CI Matrix, and Smarter Tooling

With the safety profile infrastructure in place (Phases 0-6), we need to actually measure their impact. This post covers the benchmark suite, cross-profile CI, and some quality-of-life tooling improvements.

Micro-benchmark suite

benchmarks/bench.c is a static musl binary included in the initramfs. It measures eight fundamental kernel operations:

BenchmarkWhat it measures
getpidBare syscall round-trip
read_nullread(/dev/null, 1) latency
write_nullwrite(/dev/null, 1) latency
pipepipe read/write throughput (4 KB chunks)
fork_exitfork() + waitpid() latency
open_closeopen() + close() a tmpfs file
mmap_faultAnonymous mmap + page fault throughput
statstat() latency

Output is machine-parseable: BENCH <name> <iters> <total_ns> <per_iter_ns>. A --quick flag reduces iteration counts for QEMU TCG, where emulation adds ~10,000x overhead.

Python runner and comparison

benchmarks/run-benchmarks.py wraps the whole flow:

# Run on Kevlar (builds, boots QEMU, parses output)
python3 benchmarks/run-benchmarks.py run --profile balanced

# Run on native Linux for baseline
python3 benchmarks/run-benchmarks.py linux --binary ./bench

# Compare JSON result files side-by-side
python3 benchmarks/run-benchmarks.py compare kevlar.json linux.json

# Run all four safety profiles
python3 benchmarks/run-benchmarks.py all-profiles

Or via Make:

make bench PROFILE=balanced
make bench-all
make bench-compare BENCH_FILES="a.json b.json"

CI matrix: all four profiles

The CI workflow now tests all four safety profiles in parallel:

strategy:
  fail-fast: false
  matrix:
    profile: [fortress, balanced, performance, ludicrous]

Each profile gets its own cargo check step using the correct target spec (x64-unwind.json for fortress/balanced, x64.json for performance/ludicrous). A separate clippy job runs on the balanced profile, and rustfmt runs independently.

QEMU port conflict handling

Previous QEMU sessions sometimes lingered, holding ports 20022 and 20080. run-qemu.py now detects port conflicts at startup using socket.bind(), identifies the holder via ss -tlnp, and kills stale QEMU processes automatically. This eliminates the "address already in use" failures that plagued iterative development.

Build system fixes

  • INIT_SCRIPT override: The Makefile now conditionally sets INIT_SCRIPT=/bin/sh only when not already set, so make bench can override it to /bin/bench.
  • build.rs env tracking: kernel/build.rs declares cargo::rerun-if-env-changed=INIT_SCRIPT so Cargo recompiles when the init script changes — no more stale binaries after switching between shell and bench modes.
  • Docker context: The build context is now the repo root (not testing/), allowing the Dockerfile to COPY benchmarks/bench.c directly.

Early results (QEMU TCG, quick mode)

These numbers are from software emulation and only useful for relative comparison between profiles, not absolute performance:

BenchmarkKevlar (ns/op)Linux (ns/op)
getpid2,233,600264
read_null4,289,000306
write_null4,164,600288
pipe36,718,7501,342

The ~10,000x factor is pure TCG overhead. Real performance comparison requires KVM (make run KVM=1) or native boot, which is where this infrastructure will shine as Kevlar matures.

What's next

  • Fix the GPF-in-userspace bug that crashes fork and later benchmarks
  • KVM-accelerated benchmark runs for meaningful Kevlar vs Linux numbers
  • Profile-to-profile comparison to quantify the cost of safety features