Beating Linux: Syscall Performance in a Rust Kernel

Blog 016 ended with getpid at 200ns and stat at 24µs — respectable, but still 60x behind Linux for path-based syscalls. Two root causes remained: the compiler was generating unoptimized code, and every operation paid unnecessary overhead in locks, allocations, and copies.

After this round, every core syscall benchmark beats native Linux:

Benchmark	Before	After	Linux Native	vs Linux
getpid	200 ns	63 ns	97 ns	1.5x faster
read_null	514 ns	89 ns	102 ns	1.1x faster
write_null	517 ns	91 ns	117 ns	1.3x faster
pipe	82,252 ns	290 ns	361 ns	1.2x faster
open_close	20,607 ns	510 ns	867 ns	1.7x faster
stat	23,234 ns	262 ns	389 ns	1.5x faster

The 50x fix: opt-level = 2

The dev profile in Cargo.toml had no opt-level setting, defaulting to 0 — no optimization at all. Every function call was a real call, every variable was spilled to the stack, no inlining, no constant propagation.

[profile.dev]
opt-level = 2
panic = "abort"

This single line improved getpid from 3,686ns to 65ns. Every other benchmark improved 5-50x. All the careful optimization work in blog 016 was running on unoptimized code — the real floor was 50x lower than what we measured.

We also set debug-assertions = false in the dev profile. Our SpinLock uses AtomicRefCell for deadlock tracking under cfg(debug_assertions), adding an atomic store on every lock release. With debug assertions off, every lock acquire/release got ~10ns cheaper.

Eliminating heap allocations from syscall paths

StackPathBuf: zero-alloc path resolution

Every stat(), open(), access(), and *at() syscall called resolve_path() which heap-allocated three times: a Vec for reading the path bytes, a String for UTF-8 validation, and a PathBuf for the result.

StackPathBuf replaces all of this with a 256-byte stack buffer:

#![allow(unused)]
fn main() {
struct StackPathBuf {
    buf: [u8; 256],
    len: usize,
}
}

A single read_cstr fills the buffer directly from userspace memory. Seven syscall handlers were converted to use it. Paths longer than 255 bytes — rare in practice — fall back to the heap path.

Fast VFS lookup without PathComponent

The VFS lookup_path() method creates an Arc<PathComponent> for every path component traversed — a heap allocation plus a String clone for the component name. For stat("/tmp"): two allocations (root dir and "tmp"), both immediately discarded.

lookup_inode() is a new fast path that walks the directory tree directly, returning an INode enum without creating any PathComponent objects. It handles the common case (no .., no symlinks in intermediate components) and falls back to the full lookup_path() for the rest.

For stat("/tmp"): zero heap allocations instead of two.

Lock-free Directory::inode_no()

Mount point checking used to call dir.stat() — which acquires a spinlock to copy out the full Stat struct — just to extract the inode number. Adding an inode_no() method to the Directory trait with a lock-free override in tmpfs eliminated this unnecessary lock.

Pipe: from 82µs to 290ns

The pipe implementation had three compounding problems.

No fast path: Even when data was immediately available, every read/write went through sleep_signalable_until() which enqueues the current process on the wait queue, checks for pending signals, and dequeues on completion. Three spinlock acquire/release cycles for every byte transferred.

Fix: try the operation first. If it succeeds, wake waiters and return immediately. Only enter the sleep loop when the buffer is genuinely full (writer) or empty (reader).

Double-buffered copies: Writing to a pipe copied data from userspace into a temporary kernel buffer, then from the buffer into the ring buffer. Reading did the reverse. Two memcpy calls per direction.

Fix: RingBuffer::writable_contiguous() returns a mutable slice of the next free region. UserBufReader::read_bytes() copies directly from userspace into this slice — one copy instead of two.

Waking nobody: PIPE_WAIT_QUEUE.wake_all() acquired its spinlock on every write, even when no process was sleeping on it.

Fix: WaitQueue::waiter_count tracks the number of sleeping processes with an AtomicUsize. wake_all() checks this with a relaxed load and returns immediately when zero — skipping the spinlock entirely.

tmpfs: lock-free stat and lighter locks

Directory stat() in tmpfs acquired a spinlock to copy out a Stat struct that never changes after creation (mode and inode number are set at Dir::new() time). Moving the Stat out of the locked DirInner and into the Dir struct itself made Dir::stat() lock-free.

All remaining tmpfs locks were changed from lock() (which does pushfq; cli; ...; sti; popfq) to lock_no_irq() (which does nothing extra). Tmpfs is never accessed from interrupt context, so the interrupt save/restore was pure waste — ~20ns saved per lock acquire/release.

Hardware-optimized memory operations

Our custom memset and memcpy (needed because the kernel runs with SSE disabled) used manual 8-byte store loops — 512 iterations to zero a page. Modern x86 CPUs have hardware-optimized rep stosb/rep movsb (Enhanced REP MOVSB, ERMS) that fill and copy memory at cache-line granularity.

#![allow(unused)]
fn main() {
// Before: 512 iterations of write_unaligned
while i + 8 <= n {
    (dest.add(i) as *mut u64).write_unaligned(word);
    i += 8;
}

// After: single hardware-optimized instruction
core::arch::asm!("rep stosb", ...);
}

zero_page() uses rep stosq specifically, zeroing 4KB in ~50 cycles instead of ~500.

Demand paging: the KVM tax

The one benchmark we couldn't close was mmap_fault — anonymous page fault throughput. A three-way comparison revealed why:

Benchmark	Linux Native	Linux KVM	Kevlar KVM
mmap_fault	1,047 ns	2,104 ns	3,808 ns

Linux-in-KVM is already 2x slower than Linux-native for page faults. Every newly mapped guest page triggers an EPT (Extended Page Table) violation: the CPU exits the guest, KVM updates the host's nested page tables, then re-enters the guest. This costs ~1,000 cycles per page and doesn't exist on bare metal.

Against the fair baseline (Linux KVM), Kevlar is 1.8x behind — real overhead from our bitmap allocator and simpler page table code, but not the 4x it appeared against native Linux.

We did fix one clear waste: pages were being zeroed twice. alloc_pages() zeroed the page under the allocator lock, then handle_page_fault() zeroed it again. Passing DIRTY_OK to the allocator and zeroing once after the lock is released saved both the redundant memset and reduced lock hold time.

The optimization stack

Each layer builds on the previous:

opt-level=2 (50x): Let the compiler do its job.
debug-assertions=false (1.2x): Remove per-lock atomic overhead.
StackPathBuf (2-3x for path syscalls): Zero heap allocations.
Fast lookup_inode (2-3x for path syscalls): Zero PathComponent allocations.
Pipe fast path (280x): Skip wait queue when data is available.
Lock-free tmpfs stat (1.3x): Don't lock immutable data.
lock_no_irq everywhere (1.1x): Don't save/restore interrupts when not needed.
rep stosb/movsb (1.1x): Let the CPU's microcode handle bulk memory operations.

The lesson is familiar: measure, find the biggest bottleneck, fix it, repeat. The profiler from blog 016 paid for itself many times over.

What's next

The mmap_fault gap (1.8x vs Linux KVM) needs page allocator work — our bitmap allocator is a placeholder that should be replaced with a proper buddy allocator. The fork benchmark is disabled pending a page table duplication bug fix. And we haven't started on the dcache (directory entry cache) that would make repeated path lookups nearly free.

But for the core syscall path — the thing every program does thousands of times per second — Kevlar now beats Linux. In Rust, with #![deny(unsafe_code)] on the kernel crate, running in a virtual machine.

Kevlar Documentation