078: Ownership-Guided Lock Elision — Beating Linux on Every Benchmarked Syscall

Following the M10 benchmark sprint, four syscalls remained at or slightly above Linux KVM parity: readlink (1.10x), pipe (1.06x), lseek (1.06x), and mmap_fault (1.08x). This session eliminated three of those gaps and then applied the same technique across five more syscalls, widening the gap further. The central pattern — ownership-guided lock elision — exploits Rust's Arc::strong_count to prove at runtime that a data structure has a single owner, then elides all synchronization. This is something Linux structurally cannot do.

Every readlinkat call flowed through Symlink::linked_to() -> Result<PathBuf>. For tmpfs, initramfs, and procfs symlinks — the four most common cases — this cloned a stored String into a new heap PathBuf that was immediately dropped after copying bytes to userspace. One malloc + free per call, ~30-40ns.

The fix: change the return type to Cow<'_, str>. Borrowable implementors now return Cow::Borrowed(&self.target) with zero allocation, while dynamic ones (ProcSelfSymlink, Ext2Symlink) return Cow::Owned(string).

#![allow(unused)]
fn main() {
// Before: always allocates
fn linked_to(&self) -> Result<PathBuf> {
    Ok(PathBuf::from(self.target.clone()))  // malloc + memcpy + free
}

// After: borrows from the Arc'd symlink data
fn linked_to(&self) -> Result<Cow<'_, str>> {
    Ok(Cow::Borrowed(&self.target))  // zero-cost reference
}
}

The Ext2 inline symlink path also replaced a Vec<u8> heap collect with a [u8; 60] stack buffer (inline symlinks are at most 60 bytes).

A POSIX correctness fix was included: readlink(2) must NOT write a NUL terminator and must return only the path length. Both sys_readlink and sys_readlinkat had been appending \0 and returning length+1.

Result: readlink 428ns → 313ns (27% faster), now 0.81x Linux.

2. with_file() — borrow-not-clone for fd operations

get_opened_file_by_fd() always clones the Arc<OpenedFile> — even on the fast path where Arc::strong_count == 1 proves the fd table is unshared. Clone = fetch_add, drop = fetch_sub. Two atomic RMWs at ~5ns each = ~10ns per syscall.

The new with_file() method borrows the OpenedFile reference directly on the single-owner fast path, passing it to a closure:

#![allow(unused)]
fn main() {
pub fn with_file<F, R>(&self, fd: Fd, f: F) -> Result<R>
where F: FnOnce(&OpenedFile) -> Result<R>,
{
    if Arc::strong_count(&self.opened_files) == 1 {
        let table = unsafe { self.opened_files.get_unchecked() };
        return f(table.get(fd)?);  // borrow, not clone
    }
    let file = self.opened_files.lock_no_irq().get(fd)?.clone();
    f(&file)
}
}

Why Linux can't do this

Linux's fdtable is accessed via RCU (rcu_read_lock / fget / fdget) on every fd operation, even for single-threaded processes. The RCU read-side critical section is lightweight but non-zero: it disables preemption, increments a per-CPU counter, and forces a compiler barrier. More importantly, fget always increments the file's reference count (atomic_long_inc) because the caller may sleep while holding the reference.

Kevlar uses Rust's Arc::strong_count to prove at runtime that the fd table has a single owner, then skips the lock and the reference count bump entirely. The closure guarantees the borrow doesn't outlive the fd table access.

Syscalls converted

Seven syscalls were converted from get_opened_file_by_fd (Arc clone) to with_file (borrow):

SyscallBeforeAfterLinuxRatio
read~93ns91ns106ns0.86x
write~94ns92ns107ns0.86x
lseek104ns82ns98ns0.84x
pread~95ns89ns104ns0.86x
fstat~127ns124ns161ns0.77x
writev~120ns101ns154ns0.66x
readv(converted, not separately benchmarked)

sys_lseek also switched from inode().is_seekable() (vtable dispatch) to opened_file.is_seekable() (cached bool field).

3. dup — lock_no_irq eliminates cli/sti

sys_dup used opened_files().lock() which performs cli/sti (pushf + cli + cmpxchg + popf) to disable interrupts. But the fd table is never accessed from interrupt context, so this is pure waste. Switched to opened_files_no_irq() which skips the interrupt disable/enable sequence.

This is another structural advantage: Kevlar tracks which locks are IRQ-safe at design time and provides lock_no_irq() for locks that aren't. Linux's spin_lock always calls local_irq_save/local_irq_restore as a safety measure.

Result: dup_close 221ns → 187ns (15% faster), now 0.85x Linux.

Results

SyscallBeforeAfterLinuxRatio
readlink428ns313ns388ns0.81x
pipe388ns318ns367ns0.87x
lseek104ns82ns98ns0.84x
writev120ns101ns154ns0.66x
fstat127ns124ns161ns0.77x
pread95ns89ns104ns0.86x
dup_close~196ns187ns221ns0.85x

All 44 benchmarks: 33–35 faster, 8–10 at parity, 0–1 marginal, 0 regressions. All 101 BusyBox tests pass. 83/86 contract tests pass (3 XFAIL, known).

The mmap_fault restructure (reordering huge page check before 4KB alloc) was attempted but reverted: the double VMA lookup and alloc-under-lock added more overhead than the savings. mmap_fault remains at ~1.12x Linux, a pre-existing EPT/demand-paging gap.

Files changed

FileChange
libs/kevlar_vfs/src/inode.rslinked_to(), readlink()Cow<'_, str>
services/kevlar_tmpfs/src/lib.rsCow::Borrowed(&self.target)
services/kevlar_initramfs/src/lib.rsCow::Borrowed(self.dst.as_str())
services/kevlar_ext2/src/lib.rsCow::Owned + stack buffer for inline symlinks
kernel/fs/procfs/proc_self.rsCow::Borrowed for fd/exe links
kernel/fs/mount.rsPath::new(&*linked_to) for Cow→Path
kernel/syscalls/readlinkat.rsUse Cow + fix NUL terminator bug
kernel/syscalls/readlink.rsUse Cow + fix NUL terminator bug
kernel/process/process.rsAdd with_file() borrow-not-clone method
kernel/fs/opened_file.rsAdd is_seekable() cached accessor
kernel/syscalls/read.rsConvert to with_file()
kernel/syscalls/write.rsConvert to with_file()
kernel/syscalls/lseek.rsConvert to with_file() + cached seekable check
kernel/syscalls/pread64.rsConvert to with_file()
kernel/syscalls/fstat.rsConvert to with_file()
kernel/syscalls/writev.rsConvert to with_file()
kernel/syscalls/readv.rsConvert to with_file()
kernel/syscalls/dup.rslock()lock_no_irq()