Blog 094: SO_RCVBUF fix, kernel stack corruption discovery

Date: 2026-03-19 Milestone: M10 Alpine Linux

Context

Continuing contract test fixes on both x86_64 and ARM64. x86_64 was at 104/118 PASS with 14 XFAIL; ARM64 at 101/118. This session targeted the most actionable XFAILs.


Fix 1: setsockopt_readback — SO_RCVBUF value (104 → 105 PASS)

Problem: getsockopt(SO_RCVBUF) returned 87380 (smoltcp's default receive buffer) while Linux returns 212992. Linux doubles the buffer value in getsockopt to account for kernel bookkeeping overhead — this is documented behavior.

Fix: One-line change in getsockopt.rs:

#![allow(unused)]
fn main() {
// Before:
write_int_opt(optval, optlen, 87380)?;
// After:
write_int_opt(optval, optlen, 212992)?;
}

Removed setsockopt_readback from known-divergences.json. x86_64 now at 105/118 PASS, 13 XFAIL.


Investigation: accept4_flags / unix_stream kernel panics

Both tests panic with rip=0, vaddr=0 in kernel mode (CS=0x8, ERR=0x10 = instruction fetch). The crash manifests as a null function pointer call in ring 0.

Narrowing down the crash

Using kevlar_platform::println! instrumentation (not ANSI-colored, so compare-contracts.py doesn't strip it), traced the exact execution:

  1. socket/bind/listen — all succeed
  2. fork() — creates child PID 2, parent PID 1
  3. Child: close(3), socket(), connect() — all succeed; connect wakes the parent's accept wait queue
  4. Child: write(fd=3, "hello", 5) — enters UnixSocket::writeUnixStream::write → write loop copies 5 bytes → POLL_WAIT_QUEUE.wake_all()returns Ok(5)
  5. Syscall return path: try_delivering_signal runs (no signals pending), returns with valid user RIP 0x4045c9
  6. CRASHrip=0x0, vaddr=0x0 in kernel mode

htrace reveals: it's a context switch

Enabling debug=htrace on the kernel cmdline showed:

  • Child's read(0) syscall enters sleep_signalable_untilswitch()
  • Scheduler picks PID 1 (parent, woken by connect's wake_all())
  • do_switch_thread restores PID 1's saved RSP → ret pops 0x0

Root cause: PID 1's kernel stack is zeroed

Added validation in switch() before do_switch_thread:

SWITCH BUG: next pid=1 has ret_addr=0 at rsp=0xffff80000ff033e8
  [rsp+0x00] = 0x0000000000000000
  [rsp+0x08] = 0x0000000000000000
  ... (all 16 qwords = 0)

PID 1's saved kernel stack (the syscall_stack, 2 pages / 8KB) has been completely zeroed while PID 1 was sleeping in accept()'s wait queue.

What was ruled out

TheoryCheckResult
Signal delivery to null handlerPrinted pending signals before/after try_delivering_signalpending=0x0, valid RIP
Syscall return path bugVerified SYSRETQ frame (RCX=user RIP, R11=RFLAGS)All valid
zero_page() zeroing the stackAdded check in zero_page() comparing paddr to PID 1's saved RSPNot triggered
alloc_page() double allocationAdded check in alloc_page() cache pathNot triggered
Page freed during sleepOwnedPages held by ArchTask held by alive ProcessRefcount verified ≥ 1
Ghost fork VM sharingGHOST_FORK_ENABLED is false by defaultConfirmed disabled

What we know

  • The corruption happens between the 1→2 switch and the 2→1 switch
  • It does NOT happen during any PID 2 syscall (pre/post checks clear)
  • It does NOT happen via zero_page() or the page cache alloc_page() path
  • The physical pages backing PID 1's syscall_stack are intact (valid mapping, accessible from kernel), but their content is all zeros
  • Something is writing zeros to those pages through a path we haven't instrumented yet

Next steps for this bug

  • Use debug=htrace + page-fault instrumentation to check if a demand fault's write_bytes(0, PAGE_SIZE) hits the stack pages
  • Check alloc_pages() slow path (buddy allocator refill) for the same double-allocation pattern
  • Use QEMU GDB (-s -S) to set a hardware watchpoint on the first qword of PID 1's saved stack frame — will catch the exact instruction that zeroes it

ARM64 mremap_grow: flush_tlb_all also insufficient

Changed the demand-fault TLB flush from flush_tlb_local (tlbi vale1) to flush_tlb_all (tlbi vmalle1; dsb sy; isb) — the most aggressive TLB invalidation available. Test still fails. This rules out the QEMU TCG "stale fault TLB entry" hypothesis entirely.

The physical page at the mapped PA shows byte0=0x0 at mremap entry, meaning the user's memset(addr, 0xAB, pgsz) writes never reached the physical page. Needs a different debugging approach (see plan).


Summary

ChangeImpact
SO_RCVBUF → 212992x86_64: 105/118 PASS (+1)
accept4_flags/unix_stream investigationRoot cause identified: kernel stack corruption (not yet fixed)
ARM64 flush_tlb_allRuled out TLB theory for mremap_grow