M9.6: Page Cache, Exec Prefaulting, and the Permission Bug That Hid Everything
Blog post 070 ended with a table of shame: pipe_grep at 15x slower than
Linux, sed_pipeline at 21x. Every benchmark that touched fork+exec was
an order of magnitude off. We set out to profile, fix, and verify — and
ended up finding that a latent VMA permissions bug was masking every
optimization we tried.
The profile says: page faults dominate
We added TSC-based page fault counters to the existing syscall profiler.
Two global atomics (PAGE_FAULT_COUNT, PAGE_FAULT_CYCLES) accumulate
across all CPUs. The profiler dump now includes a page_faults entry
alongside the per-syscall breakdown.
The numbers confirmed the hypothesis: each exec of BusyBox triggers ~100-300 demand-paging faults for text and rodata pages. Under KVM, each fault is a VM exit (~200ns) + handler (~300ns) + VM entry (~200ns) = ~700ns per page. At 300 pages, that's ~200µs per exec — more than 3x what Linux spends on the entire fork+exec+wait cycle.
Fix 1: initramfs page cache
Linux keeps file pages in a global page cache so repeated execs of
/bin/busybox hit cached physical pages instead of re-reading from disk.
Kevlar's initramfs files are &'static [u8] — truly immutable. We can
do even better than Linux: share the physical pages directly across
processes, zero-copy.
The cache is a HashMap<(usize, usize), PAddr> keyed by (file_data_ptr, page_index) behind a single SpinLock. The file_data_ptr is the thin
pointer from Arc::as_ptr() on the VMA's Arc<dyn FileLike> — stable
because initramfs files are never deallocated.
Three paths through the page fault handler:
- Cache miss: allocate page, read from file, insert into cache.
page_ref_init(paddr)thenpage_ref_inc(paddr)gives refcount 2 (one for the mapping, one for the cache). - Cache hit, read-only VMA: free the pre-allocated page, bump the cached page's refcount, map it directly. No allocation, no copy.
- Cache hit, writable VMA: copy from cached page to the fresh page. Skips the file read but still allocates. CoW handles later writes.
We added is_content_immutable() to the FileLike trait (defaults to
false), overriding to true in the initramfs. Only immutable files
enter the cache.
Result: pipe_grep 979µs → 825µs (16% faster), sed_pipeline 1370µs → 949µs (31% faster). Good, but still 10-15x off Linux.
Fix 2: exec-time prefaulting
The page cache eliminates the file-read overhead but not the VM exits.
Each demand-paging fault still costs ~700ns for the exit/entry round-trip.
Linux avoids this by mapping cached pages at execve() time, before the
process starts running.
We added prefault_cached_pages() to the exec path, called from
do_elf_binfmt() after load_elf_segments() creates the VMAs. It holds
the page cache lock once, iterates through file-backed VMAs, and for each
page-aligned full-page region checks the cache. Hits get mapped directly
via try_map_user_page_with_prot() with page_ref_inc() for the new
mapping.
A critical detail: prefaulted pages are mapped read-only
(PROT_READ|PROT_EXEC) regardless of the VMA's write permission. If the
process writes to a prefaulted page, the CoW path in the fault handler
allocates a private copy. This prevents shared-writable corruption across
processes.
First attempt: zero improvement. The prefault function showed
checked=0.
The bug: all VMAs were writable
load_elf_segments() created file-backed VMAs via add_vm_area(), which
defaults to PROT_READ | PROT_WRITE | PROT_EXEC. Every VMA — including
BusyBox's .text segment — appeared writable.
This broke two things:
- The demand-paging cache path always took the "writable VMA" branch, copying from cache to a fresh page instead of sharing.
- Prefaulting skipped all VMAs (our safety filter excluded writable ones).
The fix: convert ELF p_flags to proper MMapProt values.
#![allow(unused)] fn main() { fn elf_flags_to_prot(p_flags: u32) -> MMapProt { let mut prot = MMapProt::empty(); if p_flags & 4 != 0 { prot |= MMapProt::PROT_READ; } if p_flags & 2 != 0 { prot |= MMapProt::PROT_WRITE; } if p_flags & 1 != 0 { prot |= MMapProt::PROT_EXEC; } prot } }
And use add_vm_area_with_prot() instead of add_vm_area() for
file-backed segments.
Fix 3: intermediate page table attributes
When the ELF prot fix went in, we found that read-only/NX leaf PTEs were propagating their restrictions upward through the page table hierarchy. On x86-64, effective permissions are the intersection of all four levels (PML4 → PDPT → PD → PT). If a PDE was written with NX set because the first mapping through it was NX, all subsequent sibling PTEs in that PD inherited the NX restriction — silently breaking execute permission for adjacent code pages.
The fix: intermediate entries (PML4E, PDPTE, PDE) always use permissive
flags (PRESENT | USER | WRITABLE, no NO_EXECUTE). Only leaf PTEs
carry the restrictive attributes from the VMA's protection flags.
This also improved the traverse() hot path: we now only conditionally
write back an intermediate entry if it doesn't already have the expected
permissive flags, avoiding unnecessary stores on the common path.
Fix 4: minor optimizations
Tmpfs read lock scope: for reads ≤ 4096 bytes, copy data to a stack
buffer under the spinlock, drop the lock, then usercopy. Reduces lock
hold time from the usercopy duration to a fast memcpy.
Page fault profiler: accumulates TSC cycles per fault with
near-zero overhead when disabled (single AtomicBool check on the
fast path).
Fix 5: fork CoW bulk memcpy
The duplicate_table_cow function walked all 512 entries of each page
table level, zero-filled the new table first, then conditionally copied
non-null entries one at a time. For a sparse address space (BusyBox uses
~30 pages out of 512 possible per PT), that's 512 reads + ~30 writes +
a wasted 4KB zero-fill per level.
The fix replaces the zero+iterate pattern with a single 4KB
ptr::copy_nonoverlapping (bulk memcpy), then a fixup pass that only
touches entries needing modification:
- Read-only user pages: already correct from the copy, just need
page_ref_inc. No write to the child table. - Writable user pages: clear WRITABLE in both parent and child for CoW. Only these entries trigger writes.
- Kernel pages: shared, already correct from the copy.
The function also separates leaf (level 1) from intermediate paths at the top level, avoiding a per-entry level check in the inner loop.
Page table teardown (work in progress)
We implemented teardown_user_pages() — a recursive page table walk
that decrements refcounts and frees intermediate table pages when a Vm is
dropped. Without it, every fork()+exec() leaks the old page table
pages and leaves stale refcounts on cached pages.
The implementation works for simple cases but causes hangs in the BusyBox test suite. It's disabled pending investigation. The leak is bounded (a few KB per process exit) and doesn't affect correctness for the benchmarks.
kwab crash dump integration
We integrated kwab, a structured crash dump manager built alongside Kevlar. kwab provides:
- kwab-format:
no_stdbinary format with CRC32-checksummed sections for registers, syscall traces, flight recorder events, and memory maps - kwab-cli: import Kevlar's JSONL debug events, inspect dumps, export to JSON, and browse crashes in a TUI
Kevlar already emits structured DBG events over serial for crashes,
panics, and syscall profiles. kwab can import these directly:
kwab import serial.log -o crash.kwab
kwab inspect crash.kwab
kwab tui crashes/
The next step is adding kwab-format as a kernel dependency (it's
no_std) for direct binary emission, bypassing the JSONL intermediate.
Results
BusyBox test suite: 101/101 pass (unchanged)
Workload benchmarks (fork+exec-heavy, Kevlar KVM):
| Benchmark | Before | After | Speedup |
|---|---|---|---|
| exec_true | 177µs | 118µs | 1.50x |
| shell_noop | 345µs | 162µs | 2.13x |
| pipe_grep | 979µs | 429µs | 2.28x |
| sed_pipeline | 1370µs | 526µs | 2.60x |
| fork_exit | 55µs | 43µs | 1.28x |
Syscall micro-benchmarks (selected, Kevlar KVM):
| Benchmark | Before | After | Speedup |
|---|---|---|---|
| getpid | 116ns | 86ns | 1.35x |
| pipe | 528ns | 411ns | 1.28x |
| open_close | 759ns | 624ns | 1.22x |
| mmap_fault | 2040ns | 1830ns | 1.11x |
| mprotect | 1657ns | 1264ns | 1.31x |
| clock_gettime | 14ns | 11ns | 1.27x |
The intermediate page table fix had a surprisingly broad impact — every operation that traverses the page table (which is most of them) got faster. The fork CoW bulk-copy optimization shaved a further ~2µs off fork_exit.
What's next
The workload benchmarks are still 2-8x slower than Linux's ~65µs. The remaining gap is:
- Exec path overhead: ELF parsing + VMA creation + path resolution = ~70µs per exec. Linux does this in ~25µs.
- Page cache coverage: only ~62/289 BusyBox file pages are currently cached (the rest are partial pages at segment boundaries). Relaxing the full-page requirement would increase coverage.
- Page table teardown: fixing the hang to eliminate refcount leaks and reclaim memory on process exit.
- Fork optimization: 42µs per fork; sharing read-only intermediate page table pages could cut this further.