Fork/Exit Performance: 7x Slower to 0.67x Linux

A single warn!() log message in the process exit path was costing 235 microseconds per fork+exit+wait cycle. Removing it and applying targeted lock optimizations brought Kevlar from 7x slower to 33% faster than Linux KVM across the full fork lifecycle.

Root cause: serial logging in exit_group

The sys_exit and sys_exit_group handlers contained:

#![allow(unused)]
fn main() {
let cmd = current_process().cmdline().as_str().to_string();
warn!("exit_group: pid={} status={} cmd={}", pid, status, cmd);
}

This ran on every process exit, doing:

Heap-allocate a String for the command line
Format the log message (~50 characters)
Write each character to serial port 0x3F8 via outb

Each outb causes a VM exit on KVM (~1us). A 50-character message = ~50 VM exits = ~235us of serial I/O per exit. This dominated the entire fork+exit+wait benchmark, inflating it from ~40us to ~290us.

Fix: delete the log messages. Process exit is a hot path.

Per-CPU kernel stack cache

Implemented platform/stack_cache.rs — a per-size-class LIFO cache of recently freed kernel stacks. Fork reuses warm L1/L2 cache-hot stacks instead of cold buddy allocator pages.

alloc_kernel_stack(n) → try cache.pop(), fall back to buddy
free_kernel_stack(s)  → try cache.push(), fall back to buddy free

ArchTask::Drop returns all 3 stacks (kernel, interrupt, syscall) to the cache. The wait4 syscall eagerly GCs exited processes so stacks return to the cache between fork iterations.

PCID made conditional on CPUID

PCID (Process Context Identifiers) was unconditionally enabled in boot.rs. TCG doesn't support PCID, so every contract test crashed silently under TCG. Fix: check feats.has_pcid() and only set CR4.PCIDE and use PCID bits in CR3 when supported.

brk shrink fix

brk(lower_address) returned EINVAL (silently swallowed), leaking demand-paged pages. Now properly unmaps and frees pages when the program break is lowered. The benchmark still shows ~6ns because our heap VMA is a flat start + len field (O(1)) vs Linux's rbtree with anon_vma accounting (~2400ns).

epoll_wait: 1.49x slower to 0.89x faster

Three changes to the non-blocking (timeout=0) fast path:

Skip sleep_signalable_until — poll once and return directly, avoiding wait queue machinery entirely
lock_no_irq everywhere — the eventfd inner lock, epoll interests lock, and fd table all used lock() (cli/sti pair). Switching to lock_no_irq() saves ~10ns per lock pair
Avoid Arc clone — for timeout=0, hold the fd table lock through the entire poll and skip the atomic inc/dec

Before: 156ns  (1.49x Linux)
After:   93ns  (0.89x Linux)

eventfd: 1.13x slower to 0.94x faster

The eventfd benchmark does write(fd, &1, 8); read(fd, &val, 8) — two syscalls per iteration. Each hit the eventfd inner lock with cli/sti, plus went through the UserBufReader/Writer abstraction.

lock_no_irq for all EventFd lock acquisitions (fast + slow paths)
UserBuffer::read_u64() — bypass UserBufReader for 8-byte reads
UserBufferMut::write_u64() — bypass UserBufWriter for 8-byte writes

Before: 320ns  (1.13x Linux)
After:  267ns  (0.94x Linux)

socketpair: 1.41x slower to 0.67x faster

Each socketpair() call allocated two RingBuffer<u8, 65536> — 128KB of heap memory per pair, only to be freed immediately by close(). The benchmark never reads or writes data.

Reduce buffer: 65536 → 16384 bytes (still generous for Unix socket IPC; systemd sd_notify sends <100 bytes)
Lazy ancillary: VecDeque<AncillaryData> → Option<...>, only allocated on first sendmsg(SCM_RIGHTS)
Empty anonymous name: PathComponent::new_anonymous used "anon".to_owned() (heap String) — changed to String::new() (no allocation)
lock_no_irq in UnixStream::Drop

Before: 3835ns  (1.41x Linux)
After:  1808ns  (0.67x Linux)

Results

37 benchmarks across all 4 profiles, Kevlar KVM vs Linux KVM (balanced profile shown):

Benchmark	Kevlar	Linux	Ratio
getpid	67ns	94ns	0.71x
fork_exit	40us	56us	0.72x
clock_gettime	10ns	20ns	0.50x
pipe	381ns	530ns	0.72x
open_close	538ns	688ns	0.78x
stat	263ns	413ns	0.64x
signal_delivery	518ns	1217ns	0.43x
mmap_munmap	243ns	1404ns	0.17x
epoll_wait	102ns	105ns	0.97x
eventfd	254ns	285ns	0.89x
socketpair	1808ns	2669ns	0.68x
pipe_pingpong	1891ns	3193ns	0.59x
mmap_fault	1915ns	858ns	2.23x

34 of 37 benchmarks (91%) are faster than or equal to Linux KVM. Only mmap_fault (EPT page table walks, tracked for M10 huge pages) remains meaningfully slower (>1.15x). readlink and pread are within noise at 1.08x.

30/31 contract tests pass (1 XFAIL: ns_uts capability check). All 4 safety profiles perform within 5% of each other — fortress has zero meaningful performance cost versus ludicrous.

Kevlar Documentation