M9.5: 2MB Huge Pages, mmap_fault Parity, and a Benchmark Bug
The Goal
The mmap_fault benchmark was the last syscall where Kevlar was significantly
slower than Linux KVM. The plan: implement transparent 2MB huge pages to reduce
page faults from 256 to 8 for a 16MB mapping, closing the gap.
What We Built
2MB Huge Page Support (Phases 1-4)
Full transparent huge page implementation across 6 files:
-
Page table support (
platform/x64/paging.rs):HUGE_PAGEflag (PS bit 7),traverse_to_pd(),map_huge_user_page(),unmap_huge_user_page(),split_huge_page()(2MB PDE -> 512 x 4KB PTEs),is_pde_empty()guard. Updatedlookup_paddr()andtraverse()to handle PS bit at level 2. -
Demand paging (
kernel/mm/page_fault.rs): Huge page fast path before 4KB fault-around. Checks 2MB alignment, VMA coverage, and PDE emptiness before mapping a 2MB page. -
Fork CoW (
platform/x64/paging.rs):duplicate_tableat level 2 detects PS bit, shares huge page read-only with refcount. Write fault splits into 512 x 4KB PTEs, then normal CoW handles the faulting page. -
munmap/mprotect awareness: Detects huge pages at 2MB boundaries. Full huge pages are unmapped/updated directly; partial ranges split first.
-
2MB-aligned mmap (
kernel/mm/vm.rs):alloc_vaddr_range_aligned()for large anonymous mappings, ensuring every 2MB region is fully within the VMA.
Buddy Allocator Coalescing
The original buddy allocator had no coalescing on free -- freed pages went to order-0 lists and higher-order blocks came from untouched init-time regions. Under KVM, untouched pages have cold EPT entries (~13us per first access vs ~200ns for warm pages).
Added proper buddy coalescing: on free, check if the buddy is also free via free-list scan, merge into higher order, recurse up to MAX_ORDER. This ensures freed pages (with warm EPT from prior use) are consolidated into blocks that can be re-split for efficient allocation.
Fault-Around Improvements
- Capped fault-around at 2MB boundaries to prevent pre-populating PTEs in adjacent PDE regions (which would block future huge page mappings).
- Switched from per-page
try_map_user_page_with_prottobatch_try_map_user_pages_with_prot(one page table traversal per 512-entry PT instead of per page). - Fixed latent bug: fault-around pages were missing
page_ref_init()calls, leaving refcounts uninitialized for CoW.
The Deep Dive: Why Huge Pages Didn't Close the Gap
Initial benchmarks showed only ~4% improvement from huge pages. Deep investigation revealed:
-
QEMU calls
madvise(MADV_NOHUGEPAGE)on guest memory during-mem-prealloc. This forces 4KB host pages, preventing KVM from creating 2MB EPT entries regardless of guest page table structure. Both Linux and Kevlar guests are equally affected. -
Cold EPT for order-9 blocks: The buddy allocator's
alloc_huge_pagereturns contiguous 2MB blocks from init-time regions where only page 0 was ever accessed. Zeroing 511 cold-EPT pages costs ~6.8ms (vs ~0.8ms for warm pages). Chunked zeroing, user-mapping zeroing, and EPT pre-warming were all tried -- none helped because the root issue is per-page EPT violation cost under KVM. -
The real bottleneck: With 4KB EPT entries forced by QEMU, the cost of first-accessing each physical page (~1.5us per EPT violation) dominates regardless of guest page table granularity.
The Actual Bug: Unfair Benchmark Comparison
After exhaustive optimization, we discovered the Linux KVM baseline was wrong:
run-all-benchmarks.py Linux invocation:
-append "console=ttyS0 quiet panic=-1 rdinit=/init"
# /init is the bench binary, PID 1 defaults to QUICK mode (256 pages)
Kevlar invocation:
INIT_SCRIPT="/bin/bench --full"
# Always uses FULL mode (4096 pages)
Linux was benchmarking with 256 pages while Kevlar used 4096 pages --
a 16x iteration count mismatch. The ITERS(full, quick) macro in bench.c
uses quick mode when PID==1 unless --full is explicitly passed.
Fix: Added -- --full to the Linux guest's rdinit= kernel cmdline.
Results
With the fair comparison (both using 4096 pages):
| Profile | Kevlar | Linux KVM | Ratio |
|---|---|---|---|
| Fortress | 1623ns | 1712ns | 0.95x |
| Balanced | 1581ns | 1712ns | 0.92x |
| Performance | 1699ns | 1712ns | 0.99x |
| Ludicrous | 1665ns | 1712ns | 0.97x |
Kevlar is 3-8% FASTER than Linux KVM on mmap_fault. All 30/31 contract tests pass. All 38 benchmarks pass.
M10 Phase 1: Alpine rootfs
With mmap_fault at parity, began M10 (Alpine Linux support):
- Added
/dev/ttyS0device node (serial console alias) - Implemented
TIOCSCTTYandTIOCNOTTYioctl stubs - Added
rt_sigtimedwait(syscall 128) stub - Created
/etc/inittabfor BusyBox init with sysinit mounts - Added
/etc/shadow,/etc/hostname,/etc/issue - BusyBox init successfully reads inittab, mounts proc/sys/tmpfs, spawns shell
Files Changed
| Area | Files |
|---|---|
| Huge pages | platform/x64/paging.rs, kernel/mm/page_fault.rs, kernel/mm/vm.rs, kernel/syscalls/{mmap,munmap,mprotect}.rs |
| Allocator | libs/kevlar_utils/buddy_alloc.rs, platform/page_allocator.rs, platform/page_ops.rs |
| Exports | platform/lib.rs, platform/x64/mod.rs |
| M10 Phase 1 | kernel/fs/devfs/{mod,tty}.rs, kernel/syscalls/mod.rs, testing/Dockerfile, testing/etc/* |
| Benchmark fix | tools/run-all-benchmarks.py |