M9.5: 2MB Huge Pages, mmap_fault Parity, and a Benchmark Bug

The Goal

The mmap_fault benchmark was the last syscall where Kevlar was significantly slower than Linux KVM. The plan: implement transparent 2MB huge pages to reduce page faults from 256 to 8 for a 16MB mapping, closing the gap.

What We Built

2MB Huge Page Support (Phases 1-4)

Full transparent huge page implementation across 6 files:

Page table support (platform/x64/paging.rs): HUGE_PAGE flag (PS bit 7), traverse_to_pd(), map_huge_user_page(), unmap_huge_user_page(), split_huge_page() (2MB PDE -> 512 x 4KB PTEs), is_pde_empty() guard. Updated lookup_paddr() and traverse() to handle PS bit at level 2.
Demand paging (kernel/mm/page_fault.rs): Huge page fast path before 4KB fault-around. Checks 2MB alignment, VMA coverage, and PDE emptiness before mapping a 2MB page.
Fork CoW (platform/x64/paging.rs): duplicate_table at level 2 detects PS bit, shares huge page read-only with refcount. Write fault splits into 512 x 4KB PTEs, then normal CoW handles the faulting page.
munmap/mprotect awareness: Detects huge pages at 2MB boundaries. Full huge pages are unmapped/updated directly; partial ranges split first.
2MB-aligned mmap (kernel/mm/vm.rs): alloc_vaddr_range_aligned() for large anonymous mappings, ensuring every 2MB region is fully within the VMA.

Buddy Allocator Coalescing

The original buddy allocator had no coalescing on free -- freed pages went to order-0 lists and higher-order blocks came from untouched init-time regions. Under KVM, untouched pages have cold EPT entries (~13us per first access vs ~200ns for warm pages).

Added proper buddy coalescing: on free, check if the buddy is also free via free-list scan, merge into higher order, recurse up to MAX_ORDER. This ensures freed pages (with warm EPT from prior use) are consolidated into blocks that can be re-split for efficient allocation.

Fault-Around Improvements

Capped fault-around at 2MB boundaries to prevent pre-populating PTEs in adjacent PDE regions (which would block future huge page mappings).
Switched from per-page try_map_user_page_with_prot to batch_try_map_user_pages_with_prot (one page table traversal per 512-entry PT instead of per page).
Fixed latent bug: fault-around pages were missing page_ref_init() calls, leaving refcounts uninitialized for CoW.

The Deep Dive: Why Huge Pages Didn't Close the Gap

Initial benchmarks showed only ~4% improvement from huge pages. Deep investigation revealed:

QEMU calls madvise(MADV_NOHUGEPAGE) on guest memory during -mem-prealloc. This forces 4KB host pages, preventing KVM from creating 2MB EPT entries regardless of guest page table structure. Both Linux and Kevlar guests are equally affected.
Cold EPT for order-9 blocks: The buddy allocator's alloc_huge_page returns contiguous 2MB blocks from init-time regions where only page 0 was ever accessed. Zeroing 511 cold-EPT pages costs ~6.8ms (vs ~0.8ms for warm pages). Chunked zeroing, user-mapping zeroing, and EPT pre-warming were all tried -- none helped because the root issue is per-page EPT violation cost under KVM.
The real bottleneck: With 4KB EPT entries forced by QEMU, the cost of first-accessing each physical page (~1.5us per EPT violation) dominates regardless of guest page table granularity.

The Actual Bug: Unfair Benchmark Comparison

After exhaustive optimization, we discovered the Linux KVM baseline was wrong:

run-all-benchmarks.py Linux invocation:
  -append "console=ttyS0 quiet panic=-1 rdinit=/init"
  # /init is the bench binary, PID 1 defaults to QUICK mode (256 pages)

Kevlar invocation:
  INIT_SCRIPT="/bin/bench --full"
  # Always uses FULL mode (4096 pages)

Linux was benchmarking with 256 pages while Kevlar used 4096 pages -- a 16x iteration count mismatch. The ITERS(full, quick) macro in bench.c uses quick mode when PID==1 unless --full is explicitly passed.

Fix: Added -- --full to the Linux guest's rdinit= kernel cmdline.

Results

With the fair comparison (both using 4096 pages):

Profile	Kevlar	Linux KVM	Ratio
Fortress	1623ns	1712ns	0.95x
Balanced	1581ns	1712ns	0.92x
Performance	1699ns	1712ns	0.99x
Ludicrous	1665ns	1712ns	0.97x

Kevlar is 3-8% FASTER than Linux KVM on mmap_fault. All 30/31 contract tests pass. All 38 benchmarks pass.

M10 Phase 1: Alpine rootfs

With mmap_fault at parity, began M10 (Alpine Linux support):

Added /dev/ttyS0 device node (serial console alias)
Implemented TIOCSCTTY and TIOCNOTTY ioctl stubs
Added rt_sigtimedwait (syscall 128) stub
Created /etc/inittab for BusyBox init with sysinit mounts
Added /etc/shadow, /etc/hostname, /etc/issue
BusyBox init successfully reads inittab, mounts proc/sys/tmpfs, spawns shell

Files Changed

Area	Files
Huge pages	`platform/x64/paging.rs`, `kernel/mm/page_fault.rs`, `kernel/mm/vm.rs`, `kernel/syscalls/{mmap,munmap,mprotect}.rs`
Allocator	`libs/kevlar_utils/buddy_alloc.rs`, `platform/page_allocator.rs`, `platform/page_ops.rs`
Exports	`platform/lib.rs`, `platform/x64/mod.rs`
M10 Phase 1	`kernel/fs/devfs/{mod,tty}.rs`, `kernel/syscalls/mod.rs`, `testing/Dockerfile`, `testing/etc/*`
Benchmark fix	`tools/run-all-benchmarks.py`

Kevlar Documentation