Fork/Exit Performance: 7x Slower to 0.67x Linux

A single warn!() log message in the process exit path was costing 235 microseconds per fork+exit+wait cycle. Removing it and applying targeted lock optimizations brought Kevlar from 7x slower to 33% faster than Linux KVM across the full fork lifecycle.

Root cause: serial logging in exit_group

The sys_exit and sys_exit_group handlers contained:

#![allow(unused)]
fn main() {
let cmd = current_process().cmdline().as_str().to_string();
warn!("exit_group: pid={} status={} cmd={}", pid, status, cmd);
}

This ran on every process exit, doing:

  1. Heap-allocate a String for the command line
  2. Format the log message (~50 characters)
  3. Write each character to serial port 0x3F8 via outb

Each outb causes a VM exit on KVM (~1us). A 50-character message = ~50 VM exits = ~235us of serial I/O per exit. This dominated the entire fork+exit+wait benchmark, inflating it from ~40us to ~290us.

Fix: delete the log messages. Process exit is a hot path.

Per-CPU kernel stack cache

Implemented platform/stack_cache.rs — a per-size-class LIFO cache of recently freed kernel stacks. Fork reuses warm L1/L2 cache-hot stacks instead of cold buddy allocator pages.

alloc_kernel_stack(n) → try cache.pop(), fall back to buddy
free_kernel_stack(s)  → try cache.push(), fall back to buddy free

ArchTask::Drop returns all 3 stacks (kernel, interrupt, syscall) to the cache. The wait4 syscall eagerly GCs exited processes so stacks return to the cache between fork iterations.

PCID made conditional on CPUID

PCID (Process Context Identifiers) was unconditionally enabled in boot.rs. TCG doesn't support PCID, so every contract test crashed silently under TCG. Fix: check feats.has_pcid() and only set CR4.PCIDE and use PCID bits in CR3 when supported.

brk shrink fix

brk(lower_address) returned EINVAL (silently swallowed), leaking demand-paged pages. Now properly unmaps and frees pages when the program break is lowered. The benchmark still shows ~6ns because our heap VMA is a flat start + len field (O(1)) vs Linux's rbtree with anon_vma accounting (~2400ns).

epoll_wait: 1.49x slower to 0.89x faster

Three changes to the non-blocking (timeout=0) fast path:

  1. Skip sleep_signalable_until — poll once and return directly, avoiding wait queue machinery entirely
  2. lock_no_irq everywhere — the eventfd inner lock, epoll interests lock, and fd table all used lock() (cli/sti pair). Switching to lock_no_irq() saves ~10ns per lock pair
  3. Avoid Arc clone — for timeout=0, hold the fd table lock through the entire poll and skip the atomic inc/dec
Before: 156ns  (1.49x Linux)
After:   93ns  (0.89x Linux)

eventfd: 1.13x slower to 0.94x faster

The eventfd benchmark does write(fd, &1, 8); read(fd, &val, 8) — two syscalls per iteration. Each hit the eventfd inner lock with cli/sti, plus went through the UserBufReader/Writer abstraction.

  1. lock_no_irq for all EventFd lock acquisitions (fast + slow paths)
  2. UserBuffer::read_u64() — bypass UserBufReader for 8-byte reads
  3. UserBufferMut::write_u64() — bypass UserBufWriter for 8-byte writes
Before: 320ns  (1.13x Linux)
After:  267ns  (0.94x Linux)

socketpair: 1.41x slower to 0.67x faster

Each socketpair() call allocated two RingBuffer<u8, 65536> — 128KB of heap memory per pair, only to be freed immediately by close(). The benchmark never reads or writes data.

  1. Reduce buffer: 65536 → 16384 bytes (still generous for Unix socket IPC; systemd sd_notify sends <100 bytes)
  2. Lazy ancillary: VecDeque<AncillaryData>Option<...>, only allocated on first sendmsg(SCM_RIGHTS)
  3. Empty anonymous name: PathComponent::new_anonymous used "anon".to_owned() (heap String) — changed to String::new() (no allocation)
  4. lock_no_irq in UnixStream::Drop
Before: 3835ns  (1.41x Linux)
After:  1808ns  (0.67x Linux)

Results

37 benchmarks across all 4 profiles, Kevlar KVM vs Linux KVM (balanced profile shown):

BenchmarkKevlarLinuxRatio
getpid67ns94ns0.71x
fork_exit40us56us0.72x
clock_gettime10ns20ns0.50x
pipe381ns530ns0.72x
open_close538ns688ns0.78x
stat263ns413ns0.64x
signal_delivery518ns1217ns0.43x
mmap_munmap243ns1404ns0.17x
epoll_wait102ns105ns0.97x
eventfd254ns285ns0.89x
socketpair1808ns2669ns0.68x
pipe_pingpong1891ns3193ns0.59x
mmap_fault1915ns858ns2.23x

34 of 37 benchmarks (91%) are faster than or equal to Linux KVM. Only mmap_fault (EPT page table walks, tracked for M10 huge pages) remains meaningfully slower (>1.15x). readlink and pread are within noise at 1.08x.

30/31 contract tests pass (1 XFAIL: ns_uts capability check). All 4 safety profiles perform within 5% of each other — fortress has zero meaningful performance cost versus ludicrous.