Blog 113: ext4 performance — 105x faster creates, reads at 1.3x Linux

Date: 2026-03-23 Milestone: M10 Alpine Linux

Summary

Three ext4 optimizations close the performance gap with Linux from 375-3600x down to 5-7x for metadata operations and 1.3x for sequential reads. File creation improved 105x, deletion 253x, open+close 81x. Sequential reads with large buffers reached 4.3 GB/s — within 30% of Linux KVM.

The Problem

Benchmarking Kevlar's ext4 implementation against Linux under identical KVM/QEMU conditions revealed catastrophic performance gaps:

OperationLinux KVMKevlarRatio
seq_write (4K buf)~3 GB/s0.8 MB/s3600x
seq_read (4K buf)~5.4 GB/s87 MB/s62x
file create~5 us3,782 us760x
open+close~3 us1,131 us375x

Root causes: no block caching, synchronous metadata flush on every allocation, linear-scan data structures.

Optimization 1: Block Read Cache (LRU, 512 entries)

Added a 512-entry LRU read cache to Ext2Inner alongside the existing dirty write cache. Inode table blocks and directory blocks are read repeatedly during path resolution — the same block is re-read dozens of times for a single ls -la. The cache eliminates redundant disk reads.

read_block() now checks: dirty cache (BTreeMap, O(log n)) → read cache (Vec with access_count eviction) → block device.

Impact: stat improved from ~100us to ~5us (mostly from caching inode table blocks).

Optimization 2: Deferred Metadata Flush

The original code called flush_metadata() after every block or inode allocation. This wrote the entire superblock + group descriptor table to disk — 2 disk reads + multiple disk writes per allocation. Writing a 1MB file (256 block allocations) triggered 512 extra disk reads and 512 extra disk writes just for metadata.

Replaced all 5 flush_metadata() call sites in alloc_block, alloc_block_near, free_block, alloc_inode, and free_inode with a single mark_metadata_dirty() flag. The actual superblock + GDT write is deferred until flush_all(), called from fsync().

This is the single highest-impact change: file creation dropped from 3,782us to 36us (105x).

Optimization 3: BTreeMap Dirty Cache with Sorted Flush

Replaced the Vec<DirtyBlock> dirty write cache with BTreeMap<u64, Vec<u8>>:

  • O(log n) lookup instead of O(n) linear scan for duplicate detection
  • Naturally sorted iteration — flush writes blocks in ascending order, giving the block device sequential I/O patterns
  • Increased capacity from 64 to 1024 entries (4MB buffer before forced flush)

The sorted flush is important because virtio-blk batch reads are aligned to sector boundaries. Sequential writes hit the same batch window, reducing individual I/O requests.

Results

All 29 ext4 tests + Alpine apk install + curl HTTP pass.

BenchmarkBeforeAfterSpeedupvs Linux
seq_write (4K buf)837 KB/s1,110 KB/s1.3x~2700x
seq_write (128K buf)1,719 KB/s3,396 KB/s2.0x~880x
seq_read (4K buf)110 MB/s252 MB/s2.3x~21x
seq_read (32K buf)161 MB/s3.9 GB/s24x1.4x
seq_read (128K buf)156 MB/s4.3 GB/s28x1.3x
create3,782 us36 us105x~7x
delete2,275 us9 us253x
open+close1,131 us14 us81x~5x
stat4.7 us4.6 us~same~9x

Sequential reads with 128K buffers (4.3 GB/s) are within 30% of Linux KVM (5.4 GB/s). This is near-parity — the remaining gap is VFS overhead and the Vec<u8> clone per block in read_block().

Remaining Gaps

Writes (~860x off): Every write still allocates a Vec<u8>, copies data into the BTreeMap dirty cache, and synchronously flushes to disk when the 1024- entry cache fills. To reach write parity, we need a VFS-level page cache (write to physical memory pages, background writeback) and async virtio-blk I/O.

Metadata (5-9x off): Create, open, and stat still re-read and re-parse inodes from block cache on every access. An in-memory inode cache and dentry cache (path → inode mapping) would eliminate most of this overhead.

Technical Notes

  • All code is clean-room (MIT/Apache-2.0/BSD-2-Clause), no GPL ext4 code
  • #![forbid(unsafe_code)] on the ext2 service crate
  • BTreeMap from alloc::collections works in no_std
  • The read cache uses access_count-based eviction (not true LRU, but simpler and effective for the hot-set workload pattern)
  • Dirty cache flush drains the entire BTreeMap, so concurrent writes during flush create fresh entries — no data loss race

Files Changed

  • services/kevlar_ext2/src/lib.rs — block read cache, BTreeMap dirty cache, deferred metadata flush, flush_all() method
  • Makefile — fixed test-ext4 init script path