From 13µs to 200ns: Four Rounds of KVM Performance Work

Our benchmarks showed getpid taking 13,000 ns per call on KVM — about 65x slower than native Linux. read(/dev/null) was 26 µs, stat was 264 µs. The kernel was functionally correct but unusably slow under virtualization.

Four rounds of targeted optimization, guided by a new profiling infrastructure we built along the way, brought these numbers down to near-Linux performance:

Benchmark	Start	Final	Speedup
getpid	13,000 ns	200 ns	65x
read_null	26,000 ns	514 ns	51x
write_null	28,000 ns	517 ns	54x
pipe	625,000 ns	82,252 ns	7.6x
stat	264,000 ns	23,234 ns	11x
open_close	95,000 ns	20,607 ns	4.6x

Round 1: Eliminating VM exits

Under KVM, port I/O (in/out) and MMIO writes cause VM exits — 1-10 µs each. We were generating thousands of unnecessary exits per second.

Serial TX busy-wait: QEMU's virtual UART is always ready, but we polled inb(LSR) before every character. Each poll is a VM exit. Fix: skip the poll, write directly.

VGA cursor updates: Every character printed to serial was also sent to VGA, where move_cursor() does 4 outb() calls. For 80 characters of output: 320 wasted VM exits. Fix: VGA only used at boot.

Interrupt trace logging: An unconditional trace!() in the interrupt handler wrote formatted strings to serial on every non-timer IRQ. Fix: remove; the structured debug event system handles tracing when explicitly enabled.

1000 Hz timer: One PIT interrupt per millisecond, each causing a VM exit for delivery plus MMIO for EOI acknowledgment. Fix: reduce to 100 Hz (same 30 ms preemption interval, 3 ticks instead of 30).

APIC spinlock: Every IRQ did APIC.lock().write_eoi() — our SpinLock disables interrupts, checks for deadlocks, acquires the lock, does the MMIO write, releases the lock, restores interrupts. On a single-CPU kernel with interrupts already disabled: pure overhead. Fix: inline the EOI write.

Signal spinlock per syscall: Every syscall exit acquired a spinlock to check for pending signals — even when none were pending. Fix: AtomicU32 mirror of the pending bitmask, checked with a relaxed load.

Result: getpid went from 13,000 ns to 200 ns. Everything else improved 1.5-5x. But we couldn't measure precisely — our clock only had 10 ms resolution.

Round 2: Nanosecond clock and profiling infrastructure

TSC calibration

clock_gettime(CLOCK_MONOTONIC) was tick-based at 100 Hz — 10 ms granularity. We calibrated the TSC against PIT channel 2 during early boot:

#![allow(unused)]
fn main() {
// Program PIT channel 2 for ~10ms one-shot
let tsc_start = rdtscp();
while inb(0x61) & 0x20 == 0 { spin_loop(); }  // wait for terminal count
let tsc_end = rdtscp();
let freq = (tsc_end - tsc_start) * PIT_HZ / pit_count;
}

Now nanoseconds_since_boot() is a single rdtscp instruction with lock-free atomic reads. Wired into clock_gettime(CLOCK_MONOTONIC) for ns-resolution userspace timing. Also fixed a latent bug where tv_nsec returned total nanoseconds instead of the sub-second component.

Per-syscall cycle profiler

512-entry fixed array indexed by syscall number, lock-free atomics tracking total cycles, call count, min, and max per syscall. Two rdtscp calls bracketing do_dispatch() — ~10 ns overhead when enabled, zero when disabled (single atomic bool check).

Enabled via KEVLAR_DEBUG="profile". On init process exit, dumps JSONL:

{"nr":39,"name":"getpid","calls":10001,"avg_ns":49,"min_ns":38,"max_ns":9950}
{"nr":0,"name":"read","calls":5032,"avg_ns":12798,"min_ns":11329,"max_ns":126032}

The profiler immediately revealed the next bottleneck: every syscall that touches a file pays ~12 µs for spinlock overhead, while getpid (no locks) costs only 49 ns. The lock is the problem.

Round 3: The spinlock backtrace tax

The profiler showed read/write/close all clustered at ~13 µs regardless of what the actual syscall did. /dev/null read returns Ok(0) immediately — the 13 µs was entirely in the surrounding infrastructure.

The culprit was in our SpinLock::lock():

#![allow(unused)]
fn main() {
// In debug builds, EVERY lock acquire:
#[cfg(debug_assertions)]
if is_kernel_heap_enabled() {
    *self.locked_by.borrow_mut() = Some(CapturedBacktrace::capture());
}
}

CapturedBacktrace::capture() does:

Box::new(ArrayVec::new()) — heap allocation
Walk the entire call stack frame by frame
Resolve each frame's symbol via the kernel symbol table

This ran on every lock acquire, even when uncontended. On a single-CPU kernel, locks are never contended (contention = deadlock). The backtrace was only useful when the deadlock detector fired — which never happens in normal operation.

Fix: remove the per-acquire capture. The deadlock detector still works (it prints the warning when is_locked() is true on entry).

Also removed unconditional trace!() calls from sys_read, sys_write, and sys_open that formatted PID, cmdline, inode Debug, and length on every call.

Result: read dropped from 12,798 to 391 ns (36x). The profiler paid for itself immediately.

Round 4: Eliminating hidden costs

The profiler showed the next bottlenecks clearly:

getpid:         49 ns   — pure syscall overhead floor
read:          391 ns   — fd table lock + dyn dispatch
clock_gettime: 1,702 ns — TSC read + usercopy

Three targeted fixes:

Fixed-point TSC conversion: nanoseconds_since_boot() was doing two u64 divisions per call — delta / freq and remainder * 10^9 / freq. Each div r64 is 30-80 cycles on x86_64. Fix: precompute a fixed-point multiplier during calibration (mult = 10^9 << 32 / freq), then convert via a single u128 multiply: ns = (delta * mult) >> 32. Two divisions (~100 cycles) replaced by one multiply (~6 cycles).

lock_no_irq() spinlock variant: Our SpinLock saves RFLAGS, disables interrupts (cli), acquires the lock, and restores interrupts (sti) on release. For locks never touched by interrupt handlers — fd tables, root_fs, signal state — the cli/sti is wasted work. lock_no_irq() skips the interrupt save/restore while keeping the deadlock detector.

Single usercopy in clock_gettime: Two separate 8-byte writes (tv_sec, tv_nsec) each paid the access_ok check and function call overhead. Packing both into a single 16-byte Timespec struct and writing it in one usercopy halved the overhead.

Result: clock_gettime dropped from 1,702 to ~750 ns (56% faster). read dropped from 391 to 311 ns (20% faster). getpid from 279 to 200 ns (28% faster from userspace).

The profiler's view of the final state

getpid:         45 ns   — pure syscall overhead floor
read:          311 ns   — fd table lock_no_irq + dyn dispatch
write:         806 ns   — fd table lock_no_irq + dyn dispatch + output
close:       1,513 ns   — fd table lock_no_irq + cleanup
clock_gettime:  750 ns  — fixed-point TSC + single usercopy
open:       19,021 ns   — path resolution dominates
stat:       23,928 ns   — path resolution + inode stat
fork:    2,820,909 ns   — page table copy + allocation

The gap between getpid (45 ns) and read (311 ns) is now ~7x — the fd table spinlock acquire + Arc clone + virtual dispatch through FileLike. Further closing this gap would require lock-free fd table access (safe on single-CPU) or amortizing the lock across multiple operations.

The gap between read (311 ns) and stat (24 µs) is path resolution — the VFS walk through string comparisons and directory inode lookups. Linux uses a dcache (directory entry cache) with RCU-protected hash lookups to make this fast. Building an equivalent is the next major optimization target.

What we learned

Measure before optimizing. The TSC profiler cost us ~30 minutes to build and immediately identified the backtrace capture as the bottleneck — something we'd never have found by reading code.
Debug instrumentation must be zero-cost when disabled. Our trace!() macros, backtrace capture, and VGA output all ran unconditionally. Each was "just a few microseconds" but they compounded to 100x overhead.
VM exits are the KVM tax. Every in/out instruction, every MMIO write, every interrupt costs 1-10 µs. Linux kernels are carefully optimized to minimize these; we had them scattered everywhere.
Division is the hidden tax. Two u64 divisions in the TSC conversion cost ~100 cycles — invisible until the profiler pointed at clock_gettime. Fixed-point arithmetic (precomputed multiply + shift) is standard in Linux's timekeeping for exactly this reason.
Not all locks need interrupt safety. Our SpinLock always did cli/sti, but most kernel locks are never touched by interrupt handlers. A lock_no_irq() variant that skips the interrupt save/restore gave 20% improvement on every fd-table-touching syscall.
The profiler is permanent infrastructure. Every future optimization can be validated with KEVLAR_DEBUG="profile" — we'll never again wonder "is this syscall slow?" without data.

Kevlar Documentation