M9.5: 2MB Huge Pages, mmap_fault Parity, and a Benchmark Bug

The Goal

The mmap_fault benchmark was the last syscall where Kevlar was significantly slower than Linux KVM. The plan: implement transparent 2MB huge pages to reduce page faults from 256 to 8 for a 16MB mapping, closing the gap.

What We Built

2MB Huge Page Support (Phases 1-4)

Full transparent huge page implementation across 6 files:

  • Page table support (platform/x64/paging.rs): HUGE_PAGE flag (PS bit 7), traverse_to_pd(), map_huge_user_page(), unmap_huge_user_page(), split_huge_page() (2MB PDE -> 512 x 4KB PTEs), is_pde_empty() guard. Updated lookup_paddr() and traverse() to handle PS bit at level 2.

  • Demand paging (kernel/mm/page_fault.rs): Huge page fast path before 4KB fault-around. Checks 2MB alignment, VMA coverage, and PDE emptiness before mapping a 2MB page.

  • Fork CoW (platform/x64/paging.rs): duplicate_table at level 2 detects PS bit, shares huge page read-only with refcount. Write fault splits into 512 x 4KB PTEs, then normal CoW handles the faulting page.

  • munmap/mprotect awareness: Detects huge pages at 2MB boundaries. Full huge pages are unmapped/updated directly; partial ranges split first.

  • 2MB-aligned mmap (kernel/mm/vm.rs): alloc_vaddr_range_aligned() for large anonymous mappings, ensuring every 2MB region is fully within the VMA.

Buddy Allocator Coalescing

The original buddy allocator had no coalescing on free -- freed pages went to order-0 lists and higher-order blocks came from untouched init-time regions. Under KVM, untouched pages have cold EPT entries (~13us per first access vs ~200ns for warm pages).

Added proper buddy coalescing: on free, check if the buddy is also free via free-list scan, merge into higher order, recurse up to MAX_ORDER. This ensures freed pages (with warm EPT from prior use) are consolidated into blocks that can be re-split for efficient allocation.

Fault-Around Improvements

  • Capped fault-around at 2MB boundaries to prevent pre-populating PTEs in adjacent PDE regions (which would block future huge page mappings).
  • Switched from per-page try_map_user_page_with_prot to batch_try_map_user_pages_with_prot (one page table traversal per 512-entry PT instead of per page).
  • Fixed latent bug: fault-around pages were missing page_ref_init() calls, leaving refcounts uninitialized for CoW.

The Deep Dive: Why Huge Pages Didn't Close the Gap

Initial benchmarks showed only ~4% improvement from huge pages. Deep investigation revealed:

  1. QEMU calls madvise(MADV_NOHUGEPAGE) on guest memory during -mem-prealloc. This forces 4KB host pages, preventing KVM from creating 2MB EPT entries regardless of guest page table structure. Both Linux and Kevlar guests are equally affected.

  2. Cold EPT for order-9 blocks: The buddy allocator's alloc_huge_page returns contiguous 2MB blocks from init-time regions where only page 0 was ever accessed. Zeroing 511 cold-EPT pages costs ~6.8ms (vs ~0.8ms for warm pages). Chunked zeroing, user-mapping zeroing, and EPT pre-warming were all tried -- none helped because the root issue is per-page EPT violation cost under KVM.

  3. The real bottleneck: With 4KB EPT entries forced by QEMU, the cost of first-accessing each physical page (~1.5us per EPT violation) dominates regardless of guest page table granularity.

The Actual Bug: Unfair Benchmark Comparison

After exhaustive optimization, we discovered the Linux KVM baseline was wrong:

run-all-benchmarks.py Linux invocation:
  -append "console=ttyS0 quiet panic=-1 rdinit=/init"
  # /init is the bench binary, PID 1 defaults to QUICK mode (256 pages)

Kevlar invocation:
  INIT_SCRIPT="/bin/bench --full"
  # Always uses FULL mode (4096 pages)

Linux was benchmarking with 256 pages while Kevlar used 4096 pages -- a 16x iteration count mismatch. The ITERS(full, quick) macro in bench.c uses quick mode when PID==1 unless --full is explicitly passed.

Fix: Added -- --full to the Linux guest's rdinit= kernel cmdline.

Results

With the fair comparison (both using 4096 pages):

ProfileKevlarLinux KVMRatio
Fortress1623ns1712ns0.95x
Balanced1581ns1712ns0.92x
Performance1699ns1712ns0.99x
Ludicrous1665ns1712ns0.97x

Kevlar is 3-8% FASTER than Linux KVM on mmap_fault. All 30/31 contract tests pass. All 38 benchmarks pass.

M10 Phase 1: Alpine rootfs

With mmap_fault at parity, began M10 (Alpine Linux support):

  • Added /dev/ttyS0 device node (serial console alias)
  • Implemented TIOCSCTTY and TIOCNOTTY ioctl stubs
  • Added rt_sigtimedwait (syscall 128) stub
  • Created /etc/inittab for BusyBox init with sysinit mounts
  • Added /etc/shadow, /etc/hostname, /etc/issue
  • BusyBox init successfully reads inittab, mounts proc/sys/tmpfs, spawns shell

Files Changed

AreaFiles
Huge pagesplatform/x64/paging.rs, kernel/mm/page_fault.rs, kernel/mm/vm.rs, kernel/syscalls/{mmap,munmap,mprotect}.rs
Allocatorlibs/kevlar_utils/buddy_alloc.rs, platform/page_allocator.rs, platform/page_ops.rs
Exportsplatform/lib.rs, platform/x64/mod.rs
M10 Phase 1kernel/fs/devfs/{mod,tty}.rs, kernel/syscalls/mod.rs, testing/Dockerfile, testing/etc/*
Benchmark fixtools/run-all-benchmarks.py