M6.6: Syscall Performance Benchmarking — Final Results
M6.6 expanded the benchmark suite to 28 syscalls, established a fair Linux-under-KVM baseline, and optimized every regression we could. 27/28 benchmarks are within 10% of Linux KVM. The one exception — demand paging — is a structural Rust codegen cost that requires huge page support to resolve (tracked for M10).
Methodology
All benchmarks use KVM with -mem-prealloc and CPU pinning (taskset -c 0). Linux baseline runs inside the same QEMU/KVM setup as Kevlar
for a fair comparison. 5+ runs per benchmark, best and median reported.
Final results: 27/28 within 10%
| Benchmark | Linux KVM | Kevlar KVM | Ratio | |
|---|---|---|---|---|
| getpid | 93 | 61 | 0.66x | FASTER |
| gettid | 92 | 65 | 0.71x | FASTER |
| clock_gettime | 20 | 10 | 0.50x | FASTER |
| read_null | 104 | 96 | 0.92x | FASTER |
| write_null | 104 | 97 | 0.93x | FASTER |
| pread | 103 | 91 | 0.88x | FASTER |
| writev | 151 | 116 | 0.77x | FASTER |
| pipe | 379 | 355 | 0.94x | OK |
| open_close | 731 | 519 | 0.71x | FASTER |
| stat | 449 | 255 | 0.57x | FASTER |
| fstat | 159 | 115 | 0.72x | FASTER |
| lseek | 96 | 76 | 0.79x | FASTER |
| fcntl_getfl | 98 | 79 | 0.81x | FASTER |
| dup_close | 220 | 166 | 0.75x | FASTER |
| getcwd | 300 | 125 | 0.42x | FASTER |
| access | 363 | 207 | 0.57x | FASTER |
| readlink | 438 | 414 | 0.95x | OK |
| fork_exit | 54,814 | 54,502 | 0.99x | OK |
| mmap_munmap | 1,394 | 246 | 0.18x | FASTER |
| mmap_fault | 1,730 | 1,938 | 1.12x | SLOW |
| mprotect | 2,065 | 1,193 | 0.58x | FASTER |
| brk | 2,323 | 6 | 0.003x | FASTER |
| uname | 169 | 86 | 0.51x | FASTER |
| sigaction | 124 | 120 | 0.97x | OK |
| sigprocmask | 248 | 169 | 0.68x | FASTER |
| sched_yield | 157 | 165 | 1.05x | OK |
| getpriority | 95 | 64 | 0.67x | FASTER |
| read_zero | 199 | 126 | 0.63x | FASTER |
| signal_delivery | 1,204 | 498 | 0.41x | FASTER |
22 FASTER, 5 OK, 1 SLOW.
The mmap_fault gap: root cause analysis
After 12 optimization attempts, we have a thorough understanding of why demand paging is 12-15% slower than Linux under KVM.
What we tried (12 approaches, all exhausted)
| # | Approach | Result | Why |
|---|---|---|---|
| 1 | Buddy allocator | Neutral | PAGE_CACHE hides allocator; >95% cache hit |
| 2 | Per-CPU page cache | Worse | preempt_disable+cpu_id costs 8ns > 5ns lock |
| 3 | Batch PTE writes | Neutral | Repeated traversals hit L1; batch adds overhead |
| 4 | Pre-zeroed cache | Broken | free() returns dirty pages, mixed invariant |
| 5 | Zero hoisting | Worse | 64KB zeroing thrashes 32KB L1 data cache |
| 6 | Unconditional PTE writes | Worse | Cache line dirtying > branch prediction cost |
| 7 | Signal fast-path | ~1% | Skip PtRegs copy when signal_pending==0 |
| 8 | traverse() inline | Neutral | Compiler already inlines at opt-level 2 |
| 9 | Cold kernel fault path | ~1% | Moves 60 lines of debug dump out of icache |
| 10 | Fault-around 8/32 | Neutral | Per-page cost dominates; batch size irrelevant |
| 11 | #[cold] on File VMA | ~1% | Helps compiler place anonymous path compactly |
| 12 | opt-level = 3 | Worse | More aggressive inlining increases icache pressure |
Root cause: Rust codegen → icache pressure
The page fault handler's hot path in Rust generates ~40% more instructions than equivalent C. Sources:
matchonVmAreaTypeenum: discriminant load + branch even for the common Anonymous caseOption::unwrap(): generates a panic cold path that the compiler can't always prove unreachableResultpropagation: each?generates a branch + error path- Bounds-checked VMA indexing:
vm_areas[idx]generates a compare- panic branch
AtomicRefCellborrow: dynamic borrow checking at runtime
The cumulative effect is ~2-3 additional L1 icache misses per page fault compared to Linux's C handler. Each L1 icache miss costs ~5ns on modern Intel CPUs. With 256 faults: 256 × 3 × 5ns = 3.8µs total, or ~1ns/page. This alone doesn't explain the full gap.
The larger factor is that the Rust handler's code size (~2KB) exceeds one L1 icache way (1KB), causing self-eviction during the fault-around loop (17 iterations of alloc+zero+traverse+map). Linux's equivalent C handler fits in ~1KB.
Why this can't be fixed with local optimizations
Every L1-data-cache optimization we tried (batch PTE, pre-zero, zero hoist) failed because the data access pattern is already optimal: repeated page table traversals hit L1, page zeroing is sequential, and the allocator cache provides O(1) pops.
The icache problem requires either:
- Reducing code size (assembly handler, PGO) — not safe for all profiles
- Reducing fault count (huge pages) — eliminates 97% of faults for 2MB+ mappings
Resolution: tracked for M10
Huge page support (2MB pages for large anonymous mappings) will be implemented as part of M10 (GPU driver prerequisites). This eliminates the page fault overhead entirely for the benchmark workload: 4096 pages → 2 huge pages → 2 faults instead of 256.
For real GPU workloads, huge pages are essential anyway — GPU memory allocations are typically 2MB-256MB. The mmap_fault benchmark is the worst case for small-page demand paging; it does not represent actual GPU driver behavior.
Fixes shipped in M6.6
Syscall fixes
- tkill: musl's raise() uses tkill; was missing → 261µs serial spam
- /dev/zero fill(): 16 usercopies → 1; read_zero 473→126ns
- uname single-copy: 6 usercopies → 1; uname 181→86ns
- sigaction batch-read: 3 reads → 1; sigaction 136→120ns
- fcntl/readlink/mprotect lock_no_irq: skip cli/sti
- sched_yield PROCESSES skip: reuse Arc on same-PID pick
- mprotect VMA fast-path: in-place update, no Vec allocation
Architectural improvements
- Buddy allocator: O(1) single-page alloc/free, zero metadata overhead
- Signal fast-path: skip PtRegs on interrupt return when no signals
- Cold kernel fault path: #[cold] #[inline(never)] for icache
- setitimer(ITIMER_REAL): real SIGALRM delivery for alarm/setitimer
Contract test fixes
- sa_restart: rewritten with fork+kill (avoids musl setitimer issues)
- 19/19 contracts PASS — zero divergences
All 4 profiles equivalent
| Profile | getpid | mmap_fault | mprotect | sched_yield |
|---|---|---|---|---|
| Fortress | 64ns | 1,843ns | 1,213ns | 161ns |
| Balanced | 61ns | 1,876ns | 1,193ns | 165ns |
| Performance | 65ns | 1,920ns | 1,224ns | 165ns |
| Ludicrous | 64ns | 1,886ns | 1,189ns | 170ns |