Blog 094: SO_RCVBUF fix, kernel stack corruption discovery
Date: 2026-03-19 Milestone: M10 Alpine Linux
Context
Continuing contract test fixes on both x86_64 and ARM64. x86_64 was at 104/118 PASS with 14 XFAIL; ARM64 at 101/118. This session targeted the most actionable XFAILs.
Fix 1: setsockopt_readback — SO_RCVBUF value (104 → 105 PASS)
Problem: getsockopt(SO_RCVBUF) returned 87380 (smoltcp's default
receive buffer) while Linux returns 212992. Linux doubles the buffer
value in getsockopt to account for kernel bookkeeping overhead — this is
documented behavior.
Fix: One-line change in getsockopt.rs:
#![allow(unused)] fn main() { // Before: write_int_opt(optval, optlen, 87380)?; // After: write_int_opt(optval, optlen, 212992)?; }
Removed setsockopt_readback from known-divergences.json.
x86_64 now at 105/118 PASS, 13 XFAIL.
Investigation: accept4_flags / unix_stream kernel panics
Both tests panic with rip=0, vaddr=0 in kernel mode (CS=0x8, ERR=0x10
= instruction fetch). The crash manifests as a null function pointer
call in ring 0.
Narrowing down the crash
Using kevlar_platform::println! instrumentation (not ANSI-colored, so
compare-contracts.py doesn't strip it), traced the exact execution:
- socket/bind/listen — all succeed
- fork() — creates child PID 2, parent PID 1
- Child: close(3), socket(), connect() — all succeed; connect wakes the parent's accept wait queue
- Child: write(fd=3, "hello", 5) — enters
UnixSocket::write→UnixStream::write→ write loop copies 5 bytes →POLL_WAIT_QUEUE.wake_all()→ returns Ok(5) - Syscall return path:
try_delivering_signalruns (no signals pending), returns with valid user RIP 0x4045c9 - CRASH —
rip=0x0, vaddr=0x0in kernel mode
htrace reveals: it's a context switch
Enabling debug=htrace on the kernel cmdline showed:
- Child's
read(0)syscall enterssleep_signalable_until→switch() - Scheduler picks PID 1 (parent, woken by connect's
wake_all()) do_switch_threadrestores PID 1's saved RSP →retpops 0x0
Root cause: PID 1's kernel stack is zeroed
Added validation in switch() before do_switch_thread:
SWITCH BUG: next pid=1 has ret_addr=0 at rsp=0xffff80000ff033e8
[rsp+0x00] = 0x0000000000000000
[rsp+0x08] = 0x0000000000000000
... (all 16 qwords = 0)
PID 1's saved kernel stack (the syscall_stack, 2 pages / 8KB) has been completely zeroed while PID 1 was sleeping in accept()'s wait queue.
What was ruled out
| Theory | Check | Result |
|---|---|---|
| Signal delivery to null handler | Printed pending signals before/after try_delivering_signal | pending=0x0, valid RIP |
| Syscall return path bug | Verified SYSRETQ frame (RCX=user RIP, R11=RFLAGS) | All valid |
zero_page() zeroing the stack | Added check in zero_page() comparing paddr to PID 1's saved RSP | Not triggered |
alloc_page() double allocation | Added check in alloc_page() cache path | Not triggered |
| Page freed during sleep | OwnedPages held by ArchTask held by alive Process | Refcount verified ≥ 1 |
| Ghost fork VM sharing | GHOST_FORK_ENABLED is false by default | Confirmed disabled |
What we know
- The corruption happens between the 1→2 switch and the 2→1 switch
- It does NOT happen during any PID 2 syscall (pre/post checks clear)
- It does NOT happen via
zero_page()or the page cachealloc_page()path - The physical pages backing PID 1's syscall_stack are intact (valid mapping, accessible from kernel), but their content is all zeros
- Something is writing zeros to those pages through a path we haven't instrumented yet
Next steps for this bug
- Use
debug=htrace+ page-fault instrumentation to check if a demand fault'swrite_bytes(0, PAGE_SIZE)hits the stack pages - Check
alloc_pages()slow path (buddy allocator refill) for the same double-allocation pattern - Use QEMU GDB (
-s -S) to set a hardware watchpoint on the first qword of PID 1's saved stack frame — will catch the exact instruction that zeroes it
ARM64 mremap_grow: flush_tlb_all also insufficient
Changed the demand-fault TLB flush from flush_tlb_local (tlbi vale1)
to flush_tlb_all (tlbi vmalle1; dsb sy; isb) — the most aggressive
TLB invalidation available. Test still fails. This rules out the
QEMU TCG "stale fault TLB entry" hypothesis entirely.
The physical page at the mapped PA shows byte0=0x0 at mremap entry, meaning the user's memset(addr, 0xAB, pgsz) writes never reached the physical page. Needs a different debugging approach (see plan).
Summary
| Change | Impact |
|---|---|
| SO_RCVBUF → 212992 | x86_64: 105/118 PASS (+1) |
| accept4_flags/unix_stream investigation | Root cause identified: kernel stack corruption (not yet fixed) |
| ARM64 flush_tlb_all | Ruled out TLB theory for mremap_grow |