Blog 113: ext4 performance — 105x faster creates, reads at 1.3x Linux

Date: 2026-03-23 Milestone: M10 Alpine Linux

Summary

Three ext4 optimizations close the performance gap with Linux from 375-3600x down to 5-7x for metadata operations and 1.3x for sequential reads. File creation improved 105x, deletion 253x, open+close 81x. Sequential reads with large buffers reached 4.3 GB/s — within 30% of Linux KVM.

The Problem

Benchmarking Kevlar's ext4 implementation against Linux under identical KVM/QEMU conditions revealed catastrophic performance gaps:

Operation	Linux KVM	Kevlar	Ratio
seq_write (4K buf)	~3 GB/s	0.8 MB/s	3600x
seq_read (4K buf)	~5.4 GB/s	87 MB/s	62x
file create	~5 us	3,782 us	760x
open+close	~3 us	1,131 us	375x

Root causes: no block caching, synchronous metadata flush on every allocation, linear-scan data structures.

Optimization 1: Block Read Cache (LRU, 512 entries)

Added a 512-entry LRU read cache to Ext2Inner alongside the existing dirty write cache. Inode table blocks and directory blocks are read repeatedly during path resolution — the same block is re-read dozens of times for a single ls -la. The cache eliminates redundant disk reads.

read_block() now checks: dirty cache (BTreeMap, O(log n)) → read cache (Vec with access_count eviction) → block device.

Impact: stat improved from ~100us to ~5us (mostly from caching inode table blocks).

Optimization 2: Deferred Metadata Flush

The original code called flush_metadata() after every block or inode allocation. This wrote the entire superblock + group descriptor table to disk — 2 disk reads + multiple disk writes per allocation. Writing a 1MB file (256 block allocations) triggered 512 extra disk reads and 512 extra disk writes just for metadata.

Replaced all 5 flush_metadata() call sites in alloc_block, alloc_block_near, free_block, alloc_inode, and free_inode with a single mark_metadata_dirty() flag. The actual superblock + GDT write is deferred until flush_all(), called from fsync().

This is the single highest-impact change: file creation dropped from 3,782us to 36us (105x).

Optimization 3: BTreeMap Dirty Cache with Sorted Flush

Replaced the Vec<DirtyBlock> dirty write cache with BTreeMap<u64, Vec<u8>>:

O(log n) lookup instead of O(n) linear scan for duplicate detection
Naturally sorted iteration — flush writes blocks in ascending order, giving the block device sequential I/O patterns
Increased capacity from 64 to 1024 entries (4MB buffer before forced flush)

The sorted flush is important because virtio-blk batch reads are aligned to sector boundaries. Sequential writes hit the same batch window, reducing individual I/O requests.

Results

All 29 ext4 tests + Alpine apk install + curl HTTP pass.

Benchmark	Before	After	Speedup	vs Linux
seq_write (4K buf)	837 KB/s	1,110 KB/s	1.3x	~2700x
seq_write (128K buf)	1,719 KB/s	3,396 KB/s	2.0x	~880x
seq_read (4K buf)	110 MB/s	252 MB/s	2.3x	~21x
seq_read (32K buf)	161 MB/s	3.9 GB/s	24x	1.4x
seq_read (128K buf)	156 MB/s	4.3 GB/s	28x	1.3x
create	3,782 us	36 us	105x	~7x
delete	2,275 us	9 us	253x	—
open+close	1,131 us	14 us	81x	~5x
stat	4.7 us	4.6 us	~same	~9x

Sequential reads with 128K buffers (4.3 GB/s) are within 30% of Linux KVM (5.4 GB/s). This is near-parity — the remaining gap is VFS overhead and the Vec<u8> clone per block in read_block().

Remaining Gaps

Writes (~860x off): Every write still allocates a Vec<u8>, copies data into the BTreeMap dirty cache, and synchronously flushes to disk when the 1024- entry cache fills. To reach write parity, we need a VFS-level page cache (write to physical memory pages, background writeback) and async virtio-blk I/O.

Metadata (5-9x off): Create, open, and stat still re-read and re-parse inodes from block cache on every access. An in-memory inode cache and dentry cache (path → inode mapping) would eliminate most of this overhead.

Technical Notes

All code is clean-room (MIT/Apache-2.0/BSD-2-Clause), no GPL ext4 code
#![forbid(unsafe_code)] on the ext2 service crate
BTreeMap from alloc::collections works in no_std
The read cache uses access_count-based eviction (not true LRU, but simpler and effective for the hot-set workload pattern)
Dirty cache flush drains the entire BTreeMap, so concurrent writes during flush create fresh entries — no data loss race

Files Changed

services/kevlar_ext2/src/lib.rs — block read cache, BTreeMap dirty cache, deferred metadata flush, flush_all() method
Makefile — fixed test-ext4 init script path

Kevlar Documentation