078: Ownership-Guided Lock Elision — Beating Linux on Every Benchmarked Syscall

Following the M10 benchmark sprint, four syscalls remained at or slightly above Linux KVM parity: readlink (1.10x), pipe (1.06x), lseek (1.06x), and mmap_fault (1.08x). This session eliminated three of those gaps and then applied the same technique across five more syscalls, widening the gap further. The central pattern — ownership-guided lock elision — exploits Rust's Arc::strong_count to prove at runtime that a data structure has a single owner, then elides all synchronization. This is something Linux structurally cannot do.

1. readlink — Cow eliminates heap allocation

Every readlinkat call flowed through Symlink::linked_to() -> Result<PathBuf>. For tmpfs, initramfs, and procfs symlinks — the four most common cases — this cloned a stored String into a new heap PathBuf that was immediately dropped after copying bytes to userspace. One malloc + free per call, ~30-40ns.

The fix: change the return type to Cow<'_, str>. Borrowable implementors now return Cow::Borrowed(&self.target) with zero allocation, while dynamic ones (ProcSelfSymlink, Ext2Symlink) return Cow::Owned(string).

#![allow(unused)]
fn main() {
// Before: always allocates
fn linked_to(&self) -> Result<PathBuf> {
    Ok(PathBuf::from(self.target.clone()))  // malloc + memcpy + free
}

// After: borrows from the Arc'd symlink data
fn linked_to(&self) -> Result<Cow<'_, str>> {
    Ok(Cow::Borrowed(&self.target))  // zero-cost reference
}
}

The Ext2 inline symlink path also replaced a Vec<u8> heap collect with a [u8; 60] stack buffer (inline symlinks are at most 60 bytes).

A POSIX correctness fix was included: readlink(2) must NOT write a NUL terminator and must return only the path length. Both sys_readlink and sys_readlinkat had been appending \0 and returning length+1.

Result: readlink 428ns → 313ns (27% faster), now 0.81x Linux.

2. with_file() — borrow-not-clone for fd operations

get_opened_file_by_fd() always clones the Arc<OpenedFile> — even on the fast path where Arc::strong_count == 1 proves the fd table is unshared. Clone = fetch_add, drop = fetch_sub. Two atomic RMWs at ~5ns each = ~10ns per syscall.

The new with_file() method borrows the OpenedFile reference directly on the single-owner fast path, passing it to a closure:

#![allow(unused)]
fn main() {
pub fn with_file<F, R>(&self, fd: Fd, f: F) -> Result<R>
where F: FnOnce(&OpenedFile) -> Result<R>,
{
    if Arc::strong_count(&self.opened_files) == 1 {
        let table = unsafe { self.opened_files.get_unchecked() };
        return f(table.get(fd)?);  // borrow, not clone
    }
    let file = self.opened_files.lock_no_irq().get(fd)?.clone();
    f(&file)
}
}

Why Linux can't do this

Linux's fdtable is accessed via RCU (rcu_read_lock / fget / fdget) on every fd operation, even for single-threaded processes. The RCU read-side critical section is lightweight but non-zero: it disables preemption, increments a per-CPU counter, and forces a compiler barrier. More importantly, fget always increments the file's reference count (atomic_long_inc) because the caller may sleep while holding the reference.

Kevlar uses Rust's Arc::strong_count to prove at runtime that the fd table has a single owner, then skips the lock and the reference count bump entirely. The closure guarantees the borrow doesn't outlive the fd table access.

Syscalls converted

Seven syscalls were converted from get_opened_file_by_fd (Arc clone) to with_file (borrow):

Syscall	Before	After	Linux	Ratio
read	~93ns	91ns	106ns	0.86x
write	~94ns	92ns	107ns	0.86x
lseek	104ns	82ns	98ns	0.84x
pread	~95ns	89ns	104ns	0.86x
fstat	~127ns	124ns	161ns	0.77x
writev	~120ns	101ns	154ns	0.66x
readv	(converted, not separately benchmarked)

sys_lseek also switched from inode().is_seekable() (vtable dispatch) to opened_file.is_seekable() (cached bool field).

3. dup — lock_no_irq eliminates cli/sti

sys_dup used opened_files().lock() which performs cli/sti (pushf + cli + cmpxchg + popf) to disable interrupts. But the fd table is never accessed from interrupt context, so this is pure waste. Switched to opened_files_no_irq() which skips the interrupt disable/enable sequence.

This is another structural advantage: Kevlar tracks which locks are IRQ-safe at design time and provides lock_no_irq() for locks that aren't. Linux's spin_lock always calls local_irq_save/local_irq_restore as a safety measure.

Result: dup_close 221ns → 187ns (15% faster), now 0.85x Linux.

Results

Syscall	Before	After	Linux	Ratio
readlink	428ns	313ns	388ns	0.81x
pipe	388ns	318ns	367ns	0.87x
lseek	104ns	82ns	98ns	0.84x
writev	120ns	101ns	154ns	0.66x
fstat	127ns	124ns	161ns	0.77x
pread	95ns	89ns	104ns	0.86x
dup_close	~196ns	187ns	221ns	0.85x

All 44 benchmarks: 33–35 faster, 8–10 at parity, 0–1 marginal, 0 regressions. All 101 BusyBox tests pass. 83/86 contract tests pass (3 XFAIL, known).

The mmap_fault restructure (reordering huge page check before 4KB alloc) was attempted but reverted: the double VMA lookup and alloc-under-lock added more overhead than the savings. mmap_fault remains at ~1.12x Linux, a pre-existing EPT/demand-paging gap.

Files changed

File	Change
`libs/kevlar_vfs/src/inode.rs`	`linked_to()`, `readlink()` → `Cow<'_, str>`
`services/kevlar_tmpfs/src/lib.rs`	`Cow::Borrowed(&self.target)`
`services/kevlar_initramfs/src/lib.rs`	`Cow::Borrowed(self.dst.as_str())`
`services/kevlar_ext2/src/lib.rs`	`Cow::Owned` + stack buffer for inline symlinks
`kernel/fs/procfs/proc_self.rs`	`Cow::Borrowed` for fd/exe links
`kernel/fs/mount.rs`	`Path::new(&*linked_to)` for Cow→Path
`kernel/syscalls/readlinkat.rs`	Use Cow + fix NUL terminator bug
`kernel/syscalls/readlink.rs`	Use Cow + fix NUL terminator bug
`kernel/process/process.rs`	Add `with_file()` borrow-not-clone method
`kernel/fs/opened_file.rs`	Add `is_seekable()` cached accessor
`kernel/syscalls/read.rs`	Convert to `with_file()`
`kernel/syscalls/write.rs`	Convert to `with_file()`
`kernel/syscalls/lseek.rs`	Convert to `with_file()` + cached seekable check
`kernel/syscalls/pread64.rs`	Convert to `with_file()`
`kernel/syscalls/fstat.rs`	Convert to `with_file()`
`kernel/syscalls/writev.rs`	Convert to `with_file()`
`kernel/syscalls/readv.rs`	Convert to `with_file()`
`kernel/syscalls/dup.rs`	`lock()` → `lock_no_irq()`

Kevlar Documentation