Blog 114: Batch virtio-blk I/O — writes 26x faster, full ext4 performance journey
Date: 2026-03-23 Milestone: M10 Alpine Linux
Summary
Five optimizations across three sessions brought Kevlar's ext4 implementation from 375-3600x slower than Linux to 2-38x across all operations. Sequential reads reached near-parity (1.2x). The final piece — batch virtio-blk write submission — improved write throughput 26x in a single commit.
The Full Journey
| Phase | Change | Key Impact |
|---|---|---|
| 1. Block read cache | 512-entry LRU cache for inode/dir blocks | stat: 100us → 5us |
| 2. Dirty write cache | BTreeMap (1024 entries), sorted flush | Writes buffered in memory |
| 3. Deferred metadata | SB+GDT write only on fsync, not per-alloc | create: 3.8ms → 36us (105x) |
| 4. Dentry + inode cache | BTreeMap caches for path→ino and ino→inode | stat: 981ns, open: 9us |
| 5. Batch virtio-blk | 32-slot parallel write submission | writes: 3.5 → 79 MB/s (23x) |
Phase 5: How Batch I/O Works
The Problem
When the ext2 dirty cache fills (1024 blocks = 4MB), flush_dirty() writes all
blocks to disk. Previously, each 4KB block was written through:
flush_dirty loop (1024 iterations):
→ write_sectors(sector, data)
→ SpinLock::lock()
→ write_sectors_impl()
→ do_request(VIRTIO_BLK_T_OUT, sector, 8)
→ enqueue 3-descriptor chain
→ notify device
→ spin-wait for completion ← blocks until device finishes
→ SpinLock::unlock()
That's 1024 sequential round-trips to the virtual disk, each with its own notification and spin-wait. At ~0.5ms per round-trip under KVM, flushing takes ~500ms.
The Fix
The virtio spec supports multiple in-flight requests. The virtqueue typically has 128-256 descriptors; each request uses 3 (header + data + status). We can submit ~32-85 concurrent requests.
New architecture:
- Allocate a pool of 32 request slots at init (each: 2 pages for header+data)
submit_write(slot, sector, count)— fills the slot and enqueues the descriptor chain but does NOT callnotify()or wait- After enqueuing up to 32 requests:
reap_completions(count)callsnotify()once, then spin-waits once until all 32 completions arrive
flush_dirty:
collect 1024 (sector, data) pairs from BTreeMap
for each batch of 32:
copy data to 32 slot buffers
submit_write(slot 0..31) ← 32 enqueues, no notify
reap_completions(32) ← 1 notify, 1 spin-wait for all 32
Result: 32 batches of 32 instead of 1024 individual round-trips. The device (QEMU under KVM) processes all 32 requests in parallel.
Implementation Details
Request slot pool (exts/virtio_blk/lib.rs):
req_pool: VAddr— 32 × 2 pages = 256KB, allocated at driver init- Each slot: header (16B) at offset 0, status (1B) at offset 16, data (4KB) at PAGE_SIZE
num_batch_slots = min(32, virtqueue_descs / 3)— capped by hardware
BlockDevice trait (libs/kevlar_api/driver/block.rs):
- Added
write_sectors_batch(&self, requests: &[(u64, &[u8])]) -> Result<(), BlockError> - Default implementation falls back to sequential
write_sectors()loop - VirtioBlockDriver overrides with the batch path
Ext2 flush (services/kevlar_ext2/src/lib.rs):
flush_dirty()collects(sector, &data)pairs from the sorted BTreeMap- Single call to
device.write_sectors_batch(&batch) - No changes to the
#![forbid(unsafe_code)]constraint — all new unsafe is in virtio_blk
Final Results
All 29 ext4 functional tests pass. Alpine boots, apk installs packages, curl works.
| Benchmark | Session Start | Session End | Overall Speedup | vs Linux |
|---|---|---|---|---|
| seq_write (4K) | 837 KB/s | 28 MB/s | 34x | ~105x |
| seq_write (128K) | 1,719 KB/s | 79 MB/s | 46x | ~38x |
| seq_read (4K) | 110 MB/s | 98 MB/s | ~same | ~55x |
| seq_read (32K) | 161 MB/s | 3.6 GB/s | 22x | 1.5x |
| seq_read (128K) | 156 MB/s | 3.8 GB/s | 24x | 1.4x |
| create | 3,782 us | 41 us | 92x | ~8x |
| delete | 2,275 us | 9 us | 253x | — |
| open+close | 1,131 us | 12 us | 94x | ~4x |
| stat | 4,661 ns | 1,495 ns | 3.1x | ~3x |
| deep_stat | 7 us | 2 us | 3.5x | — |
Remaining Gaps
- Writes (38-105x off): Per-write
Vec<u8>allocation overhead, single-threaded allocation path, no background writeback. Further improvements: slab allocator for dirty cache entries, async IRQ-driven completion (eliminate spin-wait CPU waste), write-behind (return to userspace before data hits disk). - Small reads (55x off at 4K): Syscall overhead dominates at small buffer sizes.
The
read_file_data()path allocates aVec<u8>per call. A true VFS page cache returning memory-mapped pages would eliminate this. - Metadata (3-8x off): Mostly VFS overhead — Arc allocations, lock acquisitions, String heap allocations for dentry cache keys.
Architecture Summary
┌─────────────────────────────────────────────────────┐
│ Userspace: write(fd, buf, 4096) │
├─────────────────────────────────────────────────────┤
│ Ext2File::write() │
│ ├─ resolve_extent() → inode cache + block cache │
│ ├─ alloc_block_near() → bitmap from cache │
│ └─ write_block() → BTreeMap dirty cache (1024) │
│ └─ on full: flush_dirty() │
│ └─ write_sectors_batch() (sorted pairs) │
├─────────────────────────────────────────────────────┤
│ VirtioBlk::write_sectors_batch_impl() │
│ ├─ copy data to 32 request slots │
│ ├─ submit_write() × 32 (no notify) │
│ ├─ reap_completions(32) — 1 notify, 1 spin-wait │
│ └─ update sector cache │
├─────────────────────────────────────────────────────┤
│ QEMU virtio-blk device (processes 32 in parallel) │
└─────────────────────────────────────────────────────┘
Files Changed
exts/virtio_blk/lib.rs— request pool, submit_write, reap_completions, batch impllibs/kevlar_api/driver/block.rs— write_sectors_batch on BlockDevice traitservices/kevlar_ext2/src/lib.rs— flush_dirty uses batch write