M6 Phase 3.5: SMP Debug Tooling and the WaitQueue Race

After Phase 3 landed, 12/12 threading tests passed on a single vCPU. Under -smp 4 they hung — specifically at test 6, the mutex test, which would block forever waiting for a pthread_join that never returned.

A hanging mutex test on SMP almost always means a thread is lost: no longer in any scheduler queue or wait queue, so nobody will ever wake it. Diagnosing why required better crash-time visibility than we had, so we shipped four debug tooling improvements before touching any threading code.

Improvement 1: kernel register dump on fault

Before: a kernel page fault or general protection fault would print a one-line panic message and halt.

After: the interrupt handler dumps the full register set, the fault address (CR2), and the kernel stack contents at RSP before calling panic!:

kernel page fault — register dump:
  RIP=ffffffff80123456  RSP=ffffffff8012a000  RBP=ffffffff8012a0f0
  RAX=0000000000000000  RBX=ffff800040001234  RCX=0000000000000003  RDX=0000000000000000
  RSI=0000000000000001  RDI=ffff800040001234  R8 =0000000000000000  R9 =0000000000000000
  R10=0000000000000000  R11=0000000000000000  R12=0000000000000001  R13=0000000000000000
  R14=0000000000000000  R15=0000000000000000
  CS=0x8 (ring 0)  SS=0x10  RFLAGS=0x00000046  ERR=0x2
  CR2 (fault vaddr) = 0000000000000000
  kernel stack at RSP (ffffffff8012a000):
    [rsp+0x00] = ffffffff80123456
    [rsp+0x08] = 0000000000000000
    [rsp+0x10] = ffff800040001234
    …

The stack dump is particularly useful for identifying null function-pointer crashes: if RIP is 0, the return address chain in the stack usually points to the actual caller.

The same treatment was applied to GPF, invalid opcode, and the other synchronous exceptions — anything that previously just panicked with a bare {:?} of the packed InterruptFrame.

Improvement 2: unconditional page poison

Before: freed pages were only poisoned in debug builds. Release builds returned clean (or zero) memory, hiding use-after-free bugs until they caused data corruption far from the original site.

After: every freed page is written with 0xa5 in all build profiles, including profile-performance and profile-ludicrous. The cost is roughly one cache miss per freed page — negligible for kernel workloads.

The immediate effect: a use-after-free that previously looked like "wrong but plausible data" now produces a crash with RIP or a pointer containing 0xa5a5a5a5a5a5a5a5. Much faster to diagnose.

Improvement 3: per-CPU lock-free flight recorder

The most useful addition. The flight recorder is a fixed-size circular buffer of recent events per CPU, written at interrupt speed and dumped by the panic handler after all other CPUs are halted.

Design

platform/flight_recorder.rs:
  MAX_CPUS  = 8
  RING_SIZE = 64 entries per CPU

Entry layout (32 bytes = 4 × u64):
  [0] tsc         : u64   — raw TSC timestamp
  [1] kind:u8 | cpu:u8 | _pad:u16 | data0:u32  — packed descriptor
  [2] data1       : u64
  [3] data2       : u64

The static mut RINGS array is indexed [cpu][entry][word]. Only CPU n writes to RINGS[n] — so no synchronisation is needed on the write path. The index counter IDX[n] uses a single relaxed atomic increment. The dump path is safe because all peer CPUs are halted before dump() runs.

#![allow(unused)]
fn main() {
#[inline(always)]
pub fn record(kind: u8, data0: u32, data1: u64, data2: u64) {
    let cpu = crate::arch::cpu_id() as usize % MAX_CPUS;
    let raw_idx = IDX[cpu].fetch_add(1, Ordering::Relaxed);
    let idx = raw_idx % RING_SIZE;
    let tsc = crate::arch::read_clock_counter();
    unsafe {
        let slot = &mut RINGS[cpu][idx];
        slot[0] = tsc;
        slot[1] = ((kind as u64) << 56) | ((cpu as u64) << 48) | (data0 as u64);
        slot[2] = data1;
        slot[3] = data2;
    }
}
}

dump() collects all non-zero entries from all CPUs, insertion-sorts them by TSC (≤512 entries, O(n²) is fine in the panic path), and prints a cross-CPU timeline:

[FLIGHT RECORDER — last 64 events per CPU, sorted by TSC]
  (base TSC=0x1234abcd, showing 47 events)
  +       0 ticks  CPU=0  CTX_SWITCH   CPU=0 CTX_SWITCH  from_pid=1 to_pid=2
  +     412 ticks  CPU=1  PREEMPT      CPU=1 PREEMPT     pid=3
  +     430 ticks  CPU=1  CTX_SWITCH   CPU=1 CTX_SWITCH  from_pid=3 to_pid=4
  +    1024 ticks  CPU=0  SYSCALL_IN   CPU=0 SYSCALL_IN  nr=202 arg0=0x7f00
  …

Integration points

Location	Event	Data
`kernel/process/switch.rs`	`CTX_SWITCH`	from_pid, to_pid
`platform/x64/apic.rs` (`tlb_shootdown`)	`TLB_SEND`	target CPU mask, vaddr
`platform/x64/apic.rs` (`tlb_remote_full_flush`)	`TLB_SEND`	target CPU mask, 0
`platform/x64/interrupt.rs` (TLB IPI handler)	`TLB_RECV`	vaddr invalidated
`platform/x64/interrupt.rs` (LAPIC preempt)	`PREEMPT`	CPU id
`platform/x64/idle.rs`	`IDLE_ENTER` / `IDLE_EXIT`	—

The recorder costs nothing at runtime on the non-panicking path — no locks, no branches, no conditional compilation.

Improvement 4: serial-based crash dump

The original crash dump mechanism used boot2dump — a mini bootloader embedded in the binary that, on panic, wrote the kernel log to an ext4 file on a virtio-blk device and then rebooted. This never worked in our QEMU test setup (no virtio-blk) and added ~800 KB to the binary.

Replacement: the panic handler base64-encodes the KernelDump struct (magic + log length + 4 KiB of log) and emits it over the existing serial debug printer, framed by sentinel lines:

===KEVLAR_CRASH_DUMP_BEGIN===
AAECAw...base64...
===KEVLAR_CRASH_DUMP_END===

The encoder runs inline in the panic handler with no allocation — just a const alphabet slice and a loop over 3-byte groups.

run-qemu.py gains a --save-dump FILE flag. When set, it spawns a thread that intercepts QEMU's stdout, scans for the sentinels, base64-decodes on the fly, and writes the decoded bytes to FILE. make run now passes --save-dump kevlar.dump automatically, so crash dumps land in the working directory without any user action.

The bug: WaitQueue lost-thread race

With the tooling in place, we could observe what was actually happening during the mutex test hang. The flight recorder showed context switches between the four threads, but one thread's PID simply stopped appearing — it had been scheduled out and never rescheduled.

How threads sleep on a mutex

musl's pthread_mutex_lock eventually calls futex(addr, FUTEX_WAIT, val). The kernel's sys_futex creates or retrieves a WaitQueue for that address, then calls sleep_signalable_until. Here is the original code:

#![allow(unused)]
fn main() {
pub fn sleep_signalable_until<F, R>(&self, mut sleep_if_none: F) -> Result<R>
where F: FnMut() -> Result<Option<R>>
{
    loop {
        // ← WINDOW OPENS HERE
        current_process().set_state(ProcessState::BlockedSignalable); // (1)
        // ← LAPIC PREEMPT CAN FIRE HERE
        {
            let mut q = self.queue.lock();
            q.push_back(current_process().clone());          // (2)
            self.waiter_count.fetch_add(1, Ordering::Relaxed);
        }
        // …
        switch();
    }
}
}

The race

On x86_64, SpinLock::lock() calls cli before spinning, disabling hardware interrupts. The LAPIC preemption timer fires as an interrupt. So:

Thread A on CPU 1:
  set_state(BlockedSignalable)          ← removed from run queue
  [LAPIC timer IRQ fires — IF=1 here]
    → CPU 1 enters x64_handle_interrupt
    → LAPIC_PREEMPT_VECTOR handler
    → handle_ap_preempt() → switch()
    → switch() reads prev_state == BlockedSignalable
    → BlockedSignalable ≠ Runnable, so does NOT re-enqueue thread A
    → switches to thread B
  [IRQ returns — thread A is suspended mid-function]

Thread A, when eventually rescheduled to a CPU:
  push_back(current_process())          ← thread A is now in WaitQueue

But by the time thread A resumes and calls push_back, thread B may have already released the mutex and called wake_all() on the WaitQueue. wake_all finds an empty queue (thread A hasn't pushed yet) and returns. Thread A then pushes itself into the WaitQueue and goes to sleep — with nobody left to wake it. The mutex call that would wake it has already happened.

The thread is now permanently lost: not in any scheduler queue (because set_state(BlockedSignalable) removed it), not in the WaitQueue (it arrived after wake_all). Any thread waiting for it — via pthread_join — blocks forever.

The fix

Hold the WaitQueue's SpinLock across both set_state and push_back. SpinLock::lock() calls cli, so the LAPIC timer cannot fire between the two operations. They are atomic with respect to preemption:

#![allow(unused)]
fn main() {
{
    let mut q = self.queue.lock();    // ← cli
    current_process().set_state(ProcessState::BlockedSignalable);
    q.push_back(current_process().clone());
    self.waiter_count.fetch_add(1, Ordering::Relaxed);
}   // ← sti (SpinLock Drop restores IF)
}

Now the wake-versus-sleep ordering is guaranteed: either the thread is in the WaitQueue before wake_all runs (and will be woken), or wake_all runs first and the thread will re-check the condition in sleep_if_none on the next iteration (and return without sleeping).

A secondary fix in the early-return paths of sleep_signalable_until: where the condition is already met (so we don't actually need to sleep), the original code called resume() on the current process. resume() sets state to Runnable and then enqueues the process in the scheduler — but the process is already running, so it ends up in the scheduler queue twice. The fix is to call set_state(Runnable) directly, which changes the state without re-enqueueing.

Lock ordering

The fix holds queue.lock() while calling set_state, which takes no other locks. wake_all() holds queue.lock() while calling resume(), which acquires SCHEDULER.lock(). switch() acquires SCHEDULER.lock() and does not touch the WaitQueue. So the ordering queue → SCHEDULER is consistent and deadlock-free.

Results

After the WaitQueue fix:

=== Kevlar M6 Threading Tests (4 vCPUs) ===

TEST_PASS thread_create_join
TEST_PASS gettid_unique
TEST_PASS getpid_same
TEST_PASS shared_memory
TEST_PASS atomic_counter
TEST_PASS mutex
TEST_PASS tls
TEST_PASS condvar
TEST_PASS signal_group
TEST_PASS tgkill
TEST_PASS mmap_shared
TEST_PASS fork_from_thread

TEST_END 12/12

All four safety profiles (fortress, balanced, performance, ludicrous) compile cleanly with the flight recorder and serial dump active.

What's next

With solid crash-time diagnostics and the WaitQueue race fixed, the SMP threading substrate is stable enough to build on. Next: TLB shootdown infrastructure.

When one thread unmaps a page, the page-table change is immediately visible to the kernel (via the straight-mapped physical window), but peer CPUs may have the old translation cached in their TLBs. Any access through a stale TLB entry is undefined behaviour — either a silent wrong-address read or a spurious page fault.

Phase 4 will implement the IPI-based shootdown protocol: the unmap path sends TLB_SHOOTDOWN_VECTOR to all peer CPUs, each peer executes invlpg (or reloads CR3 for a full flush), and the sender spin-waits until every target has acknowledged.

Phase	Description	Status
M6 Phase 1	SMP boot (INIT-SIPI-SIPI, trampoline, MADT)	✅ Done
M6 Phase 2	Per-CPU run queues + LAPIC timer preemption	✅ Done
M6 Phase 3	clone(CLONE_VM\|CLONE_THREAD), tgid, futex wake-on-exit	✅ Done
M6 Phase 3.5	SMP debug tooling + WaitQueue race fix	✅ Done
M6 Phase 4	TLB shootdown + SMP thread safety	🔄 Next

Kevlar Documentation