Fork/Exit Performance: 7x Slower to 0.67x Linux
A single warn!() log message in the process exit path was costing
235 microseconds per fork+exit+wait cycle. Removing it and applying
targeted lock optimizations brought Kevlar from 7x slower to 33%
faster than Linux KVM across the full fork lifecycle.
Root cause: serial logging in exit_group
The sys_exit and sys_exit_group handlers contained:
#![allow(unused)] fn main() { let cmd = current_process().cmdline().as_str().to_string(); warn!("exit_group: pid={} status={} cmd={}", pid, status, cmd); }
This ran on every process exit, doing:
- Heap-allocate a
Stringfor the command line - Format the log message (~50 characters)
- Write each character to serial port 0x3F8 via
outb
Each outb causes a VM exit on KVM (~1us). A 50-character message =
~50 VM exits = ~235us of serial I/O per exit. This dominated the
entire fork+exit+wait benchmark, inflating it from ~40us to ~290us.
Fix: delete the log messages. Process exit is a hot path.
Per-CPU kernel stack cache
Implemented platform/stack_cache.rs — a per-size-class LIFO cache
of recently freed kernel stacks. Fork reuses warm L1/L2 cache-hot
stacks instead of cold buddy allocator pages.
alloc_kernel_stack(n) → try cache.pop(), fall back to buddy
free_kernel_stack(s) → try cache.push(), fall back to buddy free
ArchTask::Drop returns all 3 stacks (kernel, interrupt, syscall)
to the cache. The wait4 syscall eagerly GCs exited processes so
stacks return to the cache between fork iterations.
PCID made conditional on CPUID
PCID (Process Context Identifiers) was unconditionally enabled in
boot.rs. TCG doesn't support PCID, so every contract test crashed
silently under TCG. Fix: check feats.has_pcid() and only set
CR4.PCIDE and use PCID bits in CR3 when supported.
brk shrink fix
brk(lower_address) returned EINVAL (silently swallowed), leaking
demand-paged pages. Now properly unmaps and frees pages when the
program break is lowered. The benchmark still shows ~6ns because
our heap VMA is a flat start + len field (O(1)) vs Linux's rbtree
with anon_vma accounting (~2400ns).
epoll_wait: 1.49x slower to 0.89x faster
Three changes to the non-blocking (timeout=0) fast path:
- Skip sleep_signalable_until — poll once and return directly, avoiding wait queue machinery entirely
- lock_no_irq everywhere — the eventfd inner lock, epoll
interests lock, and fd table all used
lock()(cli/sti pair). Switching tolock_no_irq()saves ~10ns per lock pair - Avoid Arc clone — for timeout=0, hold the fd table lock through the entire poll and skip the atomic inc/dec
Before: 156ns (1.49x Linux)
After: 93ns (0.89x Linux)
eventfd: 1.13x slower to 0.94x faster
The eventfd benchmark does write(fd, &1, 8); read(fd, &val, 8) —
two syscalls per iteration. Each hit the eventfd inner lock with
cli/sti, plus went through the UserBufReader/Writer abstraction.
- lock_no_irq for all EventFd lock acquisitions (fast + slow paths)
- UserBuffer::read_u64() — bypass UserBufReader for 8-byte reads
- UserBufferMut::write_u64() — bypass UserBufWriter for 8-byte writes
Before: 320ns (1.13x Linux)
After: 267ns (0.94x Linux)
socketpair: 1.41x slower to 0.67x faster
Each socketpair() call allocated two RingBuffer<u8, 65536> —
128KB of heap memory per pair, only to be freed immediately by
close(). The benchmark never reads or writes data.
- Reduce buffer: 65536 → 16384 bytes (still generous for Unix socket IPC; systemd sd_notify sends <100 bytes)
- Lazy ancillary:
VecDeque<AncillaryData>→Option<...>, only allocated on firstsendmsg(SCM_RIGHTS) - Empty anonymous name:
PathComponent::new_anonymousused"anon".to_owned()(heap String) — changed toString::new()(no allocation) - lock_no_irq in UnixStream::Drop
Before: 3835ns (1.41x Linux)
After: 1808ns (0.67x Linux)
Results
37 benchmarks across all 4 profiles, Kevlar KVM vs Linux KVM (balanced profile shown):
| Benchmark | Kevlar | Linux | Ratio |
|---|---|---|---|
| getpid | 67ns | 94ns | 0.71x |
| fork_exit | 40us | 56us | 0.72x |
| clock_gettime | 10ns | 20ns | 0.50x |
| pipe | 381ns | 530ns | 0.72x |
| open_close | 538ns | 688ns | 0.78x |
| stat | 263ns | 413ns | 0.64x |
| signal_delivery | 518ns | 1217ns | 0.43x |
| mmap_munmap | 243ns | 1404ns | 0.17x |
| epoll_wait | 102ns | 105ns | 0.97x |
| eventfd | 254ns | 285ns | 0.89x |
| socketpair | 1808ns | 2669ns | 0.68x |
| pipe_pingpong | 1891ns | 3193ns | 0.59x |
| mmap_fault | 1915ns | 858ns | 2.23x |
34 of 37 benchmarks (91%) are faster than or equal to Linux KVM.
Only mmap_fault (EPT page table walks, tracked for M10 huge pages)
remains meaningfully slower (>1.15x). readlink and pread are
within noise at 1.08x.
30/31 contract tests pass (1 XFAIL: ns_uts capability check). All 4 safety profiles perform within 5% of each other — fortress has zero meaningful performance cost versus ludicrous.