Blog 114: Batch virtio-blk I/O — writes 26x faster, full ext4 performance journey

Date: 2026-03-23 Milestone: M10 Alpine Linux

Summary

Five optimizations across three sessions brought Kevlar's ext4 implementation from 375-3600x slower than Linux to 2-38x across all operations. Sequential reads reached near-parity (1.2x). The final piece — batch virtio-blk write submission — improved write throughput 26x in a single commit.

The Full Journey

PhaseChangeKey Impact
1. Block read cache512-entry LRU cache for inode/dir blocksstat: 100us → 5us
2. Dirty write cacheBTreeMap (1024 entries), sorted flushWrites buffered in memory
3. Deferred metadataSB+GDT write only on fsync, not per-alloccreate: 3.8ms → 36us (105x)
4. Dentry + inode cacheBTreeMap caches for path→ino and ino→inodestat: 981ns, open: 9us
5. Batch virtio-blk32-slot parallel write submissionwrites: 3.5 → 79 MB/s (23x)

Phase 5: How Batch I/O Works

The Problem

When the ext2 dirty cache fills (1024 blocks = 4MB), flush_dirty() writes all blocks to disk. Previously, each 4KB block was written through:

flush_dirty loop (1024 iterations):
  → write_sectors(sector, data)
    → SpinLock::lock()
    → write_sectors_impl()
      → do_request(VIRTIO_BLK_T_OUT, sector, 8)
        → enqueue 3-descriptor chain
        → notify device
        → spin-wait for completion    ← blocks until device finishes
    → SpinLock::unlock()

That's 1024 sequential round-trips to the virtual disk, each with its own notification and spin-wait. At ~0.5ms per round-trip under KVM, flushing takes ~500ms.

The Fix

The virtio spec supports multiple in-flight requests. The virtqueue typically has 128-256 descriptors; each request uses 3 (header + data + status). We can submit ~32-85 concurrent requests.

New architecture:

  1. Allocate a pool of 32 request slots at init (each: 2 pages for header+data)
  2. submit_write(slot, sector, count) — fills the slot and enqueues the descriptor chain but does NOT call notify() or wait
  3. After enqueuing up to 32 requests: reap_completions(count) calls notify() once, then spin-waits once until all 32 completions arrive
flush_dirty:
  collect 1024 (sector, data) pairs from BTreeMap
  for each batch of 32:
    copy data to 32 slot buffers
    submit_write(slot 0..31)     ← 32 enqueues, no notify
    reap_completions(32)         ← 1 notify, 1 spin-wait for all 32

Result: 32 batches of 32 instead of 1024 individual round-trips. The device (QEMU under KVM) processes all 32 requests in parallel.

Implementation Details

Request slot pool (exts/virtio_blk/lib.rs):

  • req_pool: VAddr — 32 × 2 pages = 256KB, allocated at driver init
  • Each slot: header (16B) at offset 0, status (1B) at offset 16, data (4KB) at PAGE_SIZE
  • num_batch_slots = min(32, virtqueue_descs / 3) — capped by hardware

BlockDevice trait (libs/kevlar_api/driver/block.rs):

  • Added write_sectors_batch(&self, requests: &[(u64, &[u8])]) -> Result<(), BlockError>
  • Default implementation falls back to sequential write_sectors() loop
  • VirtioBlockDriver overrides with the batch path

Ext2 flush (services/kevlar_ext2/src/lib.rs):

  • flush_dirty() collects (sector, &data) pairs from the sorted BTreeMap
  • Single call to device.write_sectors_batch(&batch)
  • No changes to the #![forbid(unsafe_code)] constraint — all new unsafe is in virtio_blk

Final Results

All 29 ext4 functional tests pass. Alpine boots, apk installs packages, curl works.

BenchmarkSession StartSession EndOverall Speedupvs Linux
seq_write (4K)837 KB/s28 MB/s34x~105x
seq_write (128K)1,719 KB/s79 MB/s46x~38x
seq_read (4K)110 MB/s98 MB/s~same~55x
seq_read (32K)161 MB/s3.6 GB/s22x1.5x
seq_read (128K)156 MB/s3.8 GB/s24x1.4x
create3,782 us41 us92x~8x
delete2,275 us9 us253x
open+close1,131 us12 us94x~4x
stat4,661 ns1,495 ns3.1x~3x
deep_stat7 us2 us3.5x

Remaining Gaps

  • Writes (38-105x off): Per-write Vec<u8> allocation overhead, single-threaded allocation path, no background writeback. Further improvements: slab allocator for dirty cache entries, async IRQ-driven completion (eliminate spin-wait CPU waste), write-behind (return to userspace before data hits disk).
  • Small reads (55x off at 4K): Syscall overhead dominates at small buffer sizes. The read_file_data() path allocates a Vec<u8> per call. A true VFS page cache returning memory-mapped pages would eliminate this.
  • Metadata (3-8x off): Mostly VFS overhead — Arc allocations, lock acquisitions, String heap allocations for dentry cache keys.

Architecture Summary

┌─────────────────────────────────────────────────────┐
│ Userspace: write(fd, buf, 4096)                     │
├─────────────────────────────────────────────────────┤
│ Ext2File::write()                                   │
│  ├─ resolve_extent() → inode cache + block cache    │
│  ├─ alloc_block_near() → bitmap from cache          │
│  └─ write_block() → BTreeMap dirty cache (1024)     │
│       └─ on full: flush_dirty()                     │
│            └─ write_sectors_batch() (sorted pairs)  │
├─────────────────────────────────────────────────────┤
│ VirtioBlk::write_sectors_batch_impl()               │
│  ├─ copy data to 32 request slots                   │
│  ├─ submit_write() × 32 (no notify)                 │
│  ├─ reap_completions(32) — 1 notify, 1 spin-wait   │
│  └─ update sector cache                             │
├─────────────────────────────────────────────────────┤
│ QEMU virtio-blk device (processes 32 in parallel)   │
└─────────────────────────────────────────────────────┘

Files Changed

  • exts/virtio_blk/lib.rs — request pool, submit_write, reap_completions, batch impl
  • libs/kevlar_api/driver/block.rs — write_sectors_batch on BlockDevice trait
  • services/kevlar_ext2/src/lib.rs — flush_dirty uses batch write