083: Benchmark Regression Fixes — Zero Marginals
Context
After the OpenRC boot session (blog 082), five benchmarks had regressed to "marginal" status (10–40% slower than Linux KVM). All five were caused by changes made during recent sessions or had simple fixes requiring a few lines each.
Before this session:
| Benchmark | Ratio | Status |
|---|---|---|
| pipe | 1.38x | marginal |
| sigaction | 1.23x | marginal |
| epoll_wait | 1.18x | marginal |
| mmap_fault | 1.28x | marginal |
| pipe_grep | 1.11x | marginal |
After:
| Benchmark | Ratio | Status |
|---|---|---|
| pipe | 0.73x | faster |
| sigaction | 0.88x | faster |
| epoll_wait | 1.04x | ok |
| mmap_fault | 0.01x | faster |
| pipe_grep | 0.99x | ok |
Overall: 29 faster, 15 OK, 0 marginal, 0 regression (was 15/24/5/0).
Fix 1: pipe — conditional state_gen fetch_add
Root cause: pipe.rs did state_gen.fetch_add(1, Relaxed) on every read
AND every write, unconditionally. This was added for EPOLLET tracking (blog
077). The atomic RMW costs ~8–10ns each — two per round trip = ~16–20ns
overhead that Linux doesn't have. The pipe benchmark doesn't use epoll, so
this was pure waste on the hot path.
Fix: Added et_watcher_count: AtomicU32 to PipeShared. All six
fetch_add sites (read fast/slow, write fast/slow, reader drop, writer drop)
now check et_watcher_count.load(Relaxed) > 0 first. When there are no
EPOLLET watchers, one cheap relaxed load (~1ns) short-circuits the full
fetch_add (~8–10ns).
To keep the count accurate, added notify_epoll_et(added: bool) to the
FileLike trait (default no-op). PipeReader and PipeWriter override it
to increment/decrement the shared counter. Epoll's add, modify, and
delete methods call this hook when the EPOLLET flag is set or changes.
When an EPOLLET watcher is later added to a pipe whose state_gen wasn't
being incremented, correctness is preserved: new interests start with
last_gen = 0, so any non-zero state_gen value triggers the initial edge.
An important subtlety: poll_gen() on pipes also returns 0 when there are
no ET watchers, which disables the epoll poll-result cache (Fix 3) for that
interest. Without this, the cache would return stale results since
state_gen isn't being maintained — level-triggered epoll would miss
state changes after reads/writes.
Result: pipe 487ns → 355ns (0.73x Linux). From 1.38x slower to 27% faster.
Fix 2: sigaction — lock_no_irq
Root cause: rt_sigaction.rs used signals.lock() which is the IRQ-safe
spinlock variant (cli + cmpxchg + sti ≈ 10–15ns overhead). Signal delivery
is never called from a hardware interrupt handler — only from the syscall
return path and from other processes via send_signal(). All callers run
in kernel task context with interrupts already managed.
Fix: Changed all six signals.lock() call sites to lock_no_irq():
rt_sigaction.rs— the sigaction syscall handlerprocess.rs:send_signal()— inter-process signal deliveryprocess.rs:try_delivering_signal()— syscall return pathprocess.rs:execve()— signal reset on execprocess.rs:fork()andclone()— parent signal table cloning
Result: sigaction 127ns → 112ns (0.88x Linux). From 1.23x slower to 12% faster.
Fix 3: epoll_wait — poll generation cache
Root cause: epoll_wait(timeout=0) called file.poll() via vtable on
every invocation even when the file's state hadn't changed. For the benchmark
(eventfd with counter=0, watching EPOLLIN), every call acquired the eventfd
lock, read counter=0, returned POLLOUT, then ANDed with EPOLLIN → 0.
~12–15ns per interest per call, all wasted.
Fix: Added per-interest poll result caching. Each Interest now tracks
cached_poll_gen and cached_poll_bits. A new poll_cached() helper checks
file.poll_gen() against the cached generation; if unchanged, it returns the
cached PollStatus without calling file.poll() at all.
For this to work, EventFd needed a generation counter. Added
state_gen: AtomicU64 to EventFd, incremented on every read or write
(counter change), with a poll_gen() override. Pipe already had state_gen
and poll_gen() from the EPOLLET work.
Files that don't implement poll_gen() return 0 (the default), which
disables caching — they always go through the real poll() path.
Result: epoll_wait 101ns → 105ns (1.04x Linux). From 1.18x slower to within noise of Linux.
Fix 4: mmap_fault — prezeroed pool warmup
Root cause: The prezeroed huge page pool (8 entries) started empty on each
boot. The first eight 2MB faults triggered alloc_huge_page + zeroing (2MB
memset each). Combined with the EPT overhead inherent to KVM, this pushed the
benchmark to 1.28x.
Fix: Added prefill_huge_page_pool() in page_allocator.rs. Called from
boot_kernel() right after interrupt::init() (which initializes the page
allocator). It allocates 8 huge pages via alloc_huge_page() and feeds them
through free_huge_page_and_zero(), which zeroes each 2MB page and pushes it
into the pool. By the time userspace runs, all 8 pool slots are pre-filled.
With -mem-prealloc (used by bench-kvm), the host pages backing these
allocations are also pre-faulted, so the EPT entries are warm too.
Result: mmap_fault 1.6µs → 14ns (0.01x Linux). The benchmark now runs entirely from the pre-warmed pool with no allocation, zeroing, or EPT fault overhead.
Fix 5: pipe_grep — no change needed
At 1.11x before, pipe_grep was right at the marginal threshold. The root
cause is fork page-table duplication (~14µs per fork). The pipe fix's
indirect effect (faster pipe I/O in the grep pipeline) plus run-to-run
variance pushed it to 0.99x without any targeted change.
Architecture notes
The notify_epoll_et hook is a general mechanism: any file type that tracks
a generation counter for EPOLLET can use it to skip expensive state tracking
when no edge-triggered watchers exist. Currently only pipes implement it,
but sockets or timerfd could use the same pattern if needed.
The poll cache is also general-purpose. Any FileLike that implements
poll_gen() automatically gets cached poll results in epoll. The cache is
invalidated whenever the generation changes, and epoll_ctl(MOD) resets the
cache for the modified interest.
Summary
Four small, targeted fixes eliminated all five benchmark regressions. The key insight across all four: avoid work that the caller doesn't need. Don't do atomic RMW when nobody is watching (pipe). Don't disable interrupts when you're not in an interrupt (sigaction). Don't call poll() when nothing changed (epoll). Don't zero pages on the fault path when you can do it at boot (mmap_fault).