Blog 113: ext4 performance — 105x faster creates, reads at 1.3x Linux
Date: 2026-03-23 Milestone: M10 Alpine Linux
Summary
Three ext4 optimizations close the performance gap with Linux from 375-3600x down to 5-7x for metadata operations and 1.3x for sequential reads. File creation improved 105x, deletion 253x, open+close 81x. Sequential reads with large buffers reached 4.3 GB/s — within 30% of Linux KVM.
The Problem
Benchmarking Kevlar's ext4 implementation against Linux under identical KVM/QEMU conditions revealed catastrophic performance gaps:
| Operation | Linux KVM | Kevlar | Ratio |
|---|---|---|---|
| seq_write (4K buf) | ~3 GB/s | 0.8 MB/s | 3600x |
| seq_read (4K buf) | ~5.4 GB/s | 87 MB/s | 62x |
| file create | ~5 us | 3,782 us | 760x |
| open+close | ~3 us | 1,131 us | 375x |
Root causes: no block caching, synchronous metadata flush on every allocation, linear-scan data structures.
Optimization 1: Block Read Cache (LRU, 512 entries)
Added a 512-entry LRU read cache to Ext2Inner alongside the existing dirty
write cache. Inode table blocks and directory blocks are read repeatedly during
path resolution — the same block is re-read dozens of times for a single
ls -la. The cache eliminates redundant disk reads.
read_block() now checks: dirty cache (BTreeMap, O(log n)) → read cache
(Vec with access_count eviction) → block device.
Impact: stat improved from ~100us to ~5us (mostly from caching inode table blocks).
Optimization 2: Deferred Metadata Flush
The original code called flush_metadata() after every block or inode
allocation. This wrote the entire superblock + group descriptor table to disk —
2 disk reads + multiple disk writes per allocation. Writing a 1MB file (256
block allocations) triggered 512 extra disk reads and 512 extra disk writes
just for metadata.
Replaced all 5 flush_metadata() call sites in alloc_block, alloc_block_near,
free_block, alloc_inode, and free_inode with a single
mark_metadata_dirty() flag. The actual superblock + GDT write is deferred
until flush_all(), called from fsync().
This is the single highest-impact change: file creation dropped from 3,782us to 36us (105x).
Optimization 3: BTreeMap Dirty Cache with Sorted Flush
Replaced the Vec<DirtyBlock> dirty write cache with BTreeMap<u64, Vec<u8>>:
- O(log n) lookup instead of O(n) linear scan for duplicate detection
- Naturally sorted iteration — flush writes blocks in ascending order, giving the block device sequential I/O patterns
- Increased capacity from 64 to 1024 entries (4MB buffer before forced flush)
The sorted flush is important because virtio-blk batch reads are aligned to sector boundaries. Sequential writes hit the same batch window, reducing individual I/O requests.
Results
All 29 ext4 tests + Alpine apk install + curl HTTP pass.
| Benchmark | Before | After | Speedup | vs Linux |
|---|---|---|---|---|
| seq_write (4K buf) | 837 KB/s | 1,110 KB/s | 1.3x | ~2700x |
| seq_write (128K buf) | 1,719 KB/s | 3,396 KB/s | 2.0x | ~880x |
| seq_read (4K buf) | 110 MB/s | 252 MB/s | 2.3x | ~21x |
| seq_read (32K buf) | 161 MB/s | 3.9 GB/s | 24x | 1.4x |
| seq_read (128K buf) | 156 MB/s | 4.3 GB/s | 28x | 1.3x |
| create | 3,782 us | 36 us | 105x | ~7x |
| delete | 2,275 us | 9 us | 253x | — |
| open+close | 1,131 us | 14 us | 81x | ~5x |
| stat | 4.7 us | 4.6 us | ~same | ~9x |
Sequential reads with 128K buffers (4.3 GB/s) are within 30% of Linux KVM
(5.4 GB/s). This is near-parity — the remaining gap is VFS overhead and
the Vec<u8> clone per block in read_block().
Remaining Gaps
Writes (~860x off): Every write still allocates a Vec<u8>, copies data
into the BTreeMap dirty cache, and synchronously flushes to disk when the 1024-
entry cache fills. To reach write parity, we need a VFS-level page cache
(write to physical memory pages, background writeback) and async virtio-blk I/O.
Metadata (5-9x off): Create, open, and stat still re-read and re-parse inodes from block cache on every access. An in-memory inode cache and dentry cache (path → inode mapping) would eliminate most of this overhead.
Technical Notes
- All code is clean-room (MIT/Apache-2.0/BSD-2-Clause), no GPL ext4 code
#![forbid(unsafe_code)]on the ext2 service crate- BTreeMap from
alloc::collectionsworks in no_std - The read cache uses access_count-based eviction (not true LRU, but simpler and effective for the hot-set workload pattern)
- Dirty cache flush drains the entire BTreeMap, so concurrent writes during flush create fresh entries — no data loss race
Files Changed
services/kevlar_ext2/src/lib.rs— block read cache, BTreeMap dirty cache, deferred metadata flush,flush_all()methodMakefile— fixedtest-ext4init script path