M10 Phase 9: BusyBox Tests, Benchmarks, and Three Kernel Bugs
We set out to make test-busybox pass and bench-busybox produce
comparable numbers to Linux on KVM. Along the way we found three kernel
bugs, removed Docker from the Linux build, and made KVM the default for
all test targets.
Bug 1: usercopy3 label misalignment
The most impactful bug. Every read from /dev/zero into a large buffer
crashed the kernel with a page fault panic.
The usercopy assembly in platform/x64/usercopy.S has labeled
instructions that the page fault handler recognizes as "safe" — if a
fault occurs at one of these labels, it's a user-space demand page fault
during a kernel usercopy, not a real kernel bug. The handler checks
frame.rip == usercopy3 to decide.
memset_user fills a user buffer with a byte value. It's used by
/dev/zero's read() to fill the user's buffer with zeros:
memset_user:
mov rcx, rdx
cld
usercopy3: ; <-- label HERE
mov al, sil ; <-- but THIS instruction doesn't fault
rep stosb ; <-- THIS one does (writes to user memory)
ret
The label pointed at mov al, sil (a register-to-register move that
never faults), but the actual user-space memory access is rep stosb
two bytes later. When rep stosb triggered a demand page fault, the
RIP was at usercopy3 + 2, the handler didn't match it, and the
kernel panicked.
The fix: move the label to the faulting instruction.
memset_user:
mov rcx, rdx
cld
mov al, sil
usercopy3: ; <-- label now at the faulting instruction
rep stosb
ret
This bug existed since the usercopy optimization pass (M6.6 Phase D)
but was invisible because /dev/zero reads only fault when the user
buffer straddles an unmapped page — which BusyBox dd does via
malloc (backed by mmap for large allocations) but the raw syscall
test doesn't (it uses stack buffers or pre-faulted heap).
Bug 2: kernel heap OOM on tmpfs writes
After fixing the usercopy crash, dd still panicked when writing 1MB
to tmpfs:
[PANIC] CPU=0 at platform/global_allocator.rs:24
tried to allocate too large object in the kernel heap (requested 2097152 bytes)
Tmpfs stores file data in a Vec<u8> on the kernel heap. Vec's growth
strategy doubles capacity: writing 4KB chunks to build a 1MB file
produces a Vec that goes 4K → 8K → 16K → ... → 512K → 1024K. At
1024K, Vec doubles to 2MB for the next resize — exceeding the 1MB heap
chunk limit.
Two fixes applied:
- Increased
KERNEL_HEAP_CHUNK_SIZEfrom 1MB to 4MB - Tmpfs
write()now usesreserve_exactinstead of letting Vec double:
#![allow(unused)] fn main() { let cap = data.capacity(); if new_len > cap { data.reserve_exact(new_len - cap); } data.resize(new_len, 0); }
This keeps tmpfs allocations tight to the actual file size. A 1MB file uses ~1MB of heap, not 2MB.
Bug 3: Docker caching failures
Docker's build context hashing invalidated the entire multi-stage build
whenever any file in testing/ changed. A one-line edit to
busybox_suite.c triggered a full rebuild of BusyBox, curl, dropbear,
bash, and systemd from source — minutes of wasted time.
Replaced the Docker pipeline with tools/build-initramfs.py, a native
Python builder that:
- Compiles test binaries directly with
musl-gcc/gcc(parallel) - Downloads and builds external packages once, cached in
build/native-cache/ext-bin/ - Downloads Alpine packages directly from the CDN
- Assembles the rootfs and creates the CPIO archive
Incremental rebuild times: 1.5 seconds when a .c file changes,
65ms when nothing changed. Docker fallback preserved via
USE_DOCKER=1.
KVM by default
All test and benchmark targets now use --kvm unconditionally. Tests
that previously ran on TCG (software emulation, ~100x slower than KVM)
now run at hardware speed. No more KVM=1 flag needed.
Results
BusyBox test suite: 101/101 pass (unchanged)
BusyBox benchmarks (Kevlar KVM vs Linux KVM, lower = faster):
| Benchmark | Kevlar | Linux | Ratio |
|---|---|---|---|
| bb_exec_true | 340µs | 1.78ms | 0.19x |
| bb_shell_noop | 610µs | 3.66ms | 0.17x |
| bb_echo | 335µs | 1.88ms | 0.18x |
| bb_cp_small | 526µs | 2.97ms | 0.18x |
| bb_dd | 6.15ms | 4.89ms | 1.26x |
| bb_find_tree | 600µs | 3.14ms | 0.19x |
| bb_gzip | 1.27ms | 3.96ms | 0.32x |
| bb_tar_extract | 1.64ms | 6.44ms | 0.25x |
Kevlar is 2-6x faster than Linux across most BusyBox workloads. The
one exception is bb_dd (1.26x slower) which is dominated by tmpfs
Vec::resize allocations — a known area for future optimization with
page-backed storage.
Micro-benchmarks (42 syscalls, Kevlar KVM vs Linux KVM):
- 19 faster, 14 at parity, 5 marginally slower, 4 regressions
- Key wins:
brk450x,mmap_munmap5x,signal_delivery2x,mprotect1.6x,stat1.4x - Regressions in workload benchmarks (
exec_true2.6x,shell_noop5.4x,pipe_grep15x,sed_pipeline21x) — these are fork+exec heavy and will be addressed in M9.6
Source fixes
Four test files had compilation errors masked by Docker's older musl:
benchmarks/fork_micro.c: missing#include <sys/stat.h>testing/mini_storage.c:struct statxguarded with#ifndef STATX_BASIC_STATSfor newer musltesting/busybox_suite.c: function namedo_dd_diagused as lvalue, fixed to usedd_diag_modevariabletesting/contracts/scheduling/futex_requeue.c: missing#include <time.h>
What's next
The micro-benchmark regressions in fork+exec workloads point to overhead in the process creation and pipe paths. M9.6 will be a focused optimization pass to bring these back to Linux parity. The Alpine integration test (layers 3-7) depends on chroot + dynamic linking from ext2, which is the next area of investigation.