M6 Phase 3: Threading
Kevlar now supports POSIX threads end-to-end. pthread_create, pthread_join,
mutexes, condition variables, TLS, tgkill, and fork from a threaded process
all work correctly under an SMP guest. Twelve integration tests pass on 4 vCPUs.
This one was a marathon.
What "threading" actually requires
fork() was already working. A thread is not a fork — it's closer, and in some
ways harder. The Linux ABI for thread creation goes through clone(2) with a
specific set of flags:
clone(CLONE_VM | CLONE_THREAD | CLONE_SIGHAND | CLONE_FILES |
CLONE_FS | CLONE_SETTLS | CLONE_CHILD_SETTID | CLONE_CHILD_CLEARTID,
child_stack, &ptid, &ctid, newtls)
Each flag is a contract:
| Flag | Contract |
|---|---|
CLONE_VM | Share the address space (no copy-on-write) |
CLONE_THREAD | Same thread group → getpid() returns parent's PID |
CLONE_SETTLS | Set FS base (x86_64) / TPIDR_EL0 (ARM64) to newtls |
CLONE_CHILD_SETTID | Write child TID to ctid in child's address space |
CLONE_CHILD_CLEARTID | On thread exit: write 0 to ctid, wake futex waiters |
CLONE_CHILD_CLEARTID is what makes pthread_join work. musl's join
implementation sleeps on futex(ctid, FUTEX_WAIT, tid). When the thread
exits and clears ctid, the kernel wakes that futex. No CLEARTID, no join.
Kernel changes
Process struct: tgid and clear_child_tid
Two new fields on Process:
#![allow(unused)] fn main() { pub struct Process { pid: PId, tgid: PId, // thread group id; == pid for leaders clear_child_tid: AtomicUsize, // ctid address, or 0 // … } }
fork() sets tgid = pid (new process is its own group leader).
new_thread() sets tgid = parent.tgid (same thread group as creator).
getpid() returns tgid. gettid() returns pid. This is the Linux
invariant: all threads in a group see the same getpid().
sys_clone: the thread path
clone(CLONE_VM | …) routes to a dedicated code path that calls
Process::new_thread() instead of Process::fork():
#![allow(unused)] fn main() { if flags & CLONE_VM != 0 { let set_child_tid = flags & CLONE_CHILD_SETTID != 0; let clear_child_tid = flags & CLONE_CHILD_CLEARTID != 0; let newtls_val = if flags & CLONE_SETTLS != 0 { newtls as u64 } else { 0 }; let child = Process::new_thread( parent, self.frame, child_stack as u64, newtls_val, ctid, set_child_tid, clear_child_tid, )?; // … Ok(child.pid().as_i32() as isize) } }
Note the argument swap between architectures: x86_64 passes (ptid, ctid, newtls) but ARM64 passes (ptid, newtls, ctid). A single #[cfg] at the
top of the handler unpacks them into the right names.
new_thread(): what's shared, what's not
#![allow(unused)] fn main() { let child = Arc::new(Process { pid, tgid: parent.tgid, // same thread group vm: AtomicRefCell::new(parent.vm().as_ref().map(Arc::clone)), // shared opened_files: Arc::clone(&parent.opened_files), // shared signals: Arc::clone(&parent.signals), // shared signal_pending: AtomicU32::new(0), // per-thread (own pending bitmask) sigset: AtomicU64::new(parent.sigset_load().bits()), // inherited mask clear_child_tid: AtomicUsize::new(0), // … credentials, umask, comm all copied from parent }); }
Three things are shared via Arc: the virtual memory map (vm), the open
file table (opened_files), and the signal disposition table (signals).
The signal pending bitmask and signal mask are per-thread — threads have
independent delivery state even though they share handlers.
ArchTask::new_thread(): the stack layout
Every thread needs its own kernel stack, interrupt stack, and syscall stack —
three 1 MiB allocations. The initial kernel stack is pre-loaded with a fake
do_switch_thread context frame so the thread can be scheduled like any other:
#![allow(unused)] fn main() { // IRET frame for returning to userspace. rsp = push_stack(rsp, (USER_DS | USER_RPL) as u64); // SS rsp = push_stack(rsp, child_stack); // user RSP ← pthread stack rsp = push_stack(rsp, frame.rflags); // RFLAGS rsp = push_stack(rsp, (USER_CS64 | USER_RPL) as u64); // CS rsp = push_stack(rsp, frame.rip); // RIP ← clone() return addr // Registers popped before IRET (clone() returns 0 to child via RAX). rsp = push_stack(rsp, frame.rflags); // r11 rsp = push_stack(rsp, frame.rip); // rcx // … rsi, rdi, rdx, r8-r10 // do_switch_thread context frame. rsp = push_stack(rsp, forked_child_entry as *const u8 as u64); // "return" address rsp = push_stack(rsp, frame.rbp); // … callee-saves … rsp = push_stack(rsp, 0x02); // RFLAGS (interrupts disabled) }
When the scheduler first picks up the new thread, do_switch_thread pops the
callee-saves and returns to forked_child_entry, which pops the remaining
registers and executes iret — landing in userspace at clone()'s return
address with RSP pointing at the freshly-allocated pthread stack.
The ARM64 path is analogous, replacing the IRET frame with an eret-compatible
exception-return frame via SPSR_EL1 and ELR_EL1.
Thread exit: CLEARTID and futex wake
On thread exit, Process::exit() checks is_thread = (tgid != pid). For
threads:
- Skip sending
SIGCHLD(thread exits are invisible to the parent process). - Skip closing file descriptors (the table is shared with siblings).
- Write 0 to
clear_child_tidaddress and callfutex_wake_addr. - Push the
Arc<Process>ontoEXITED_PROCESSES(so the Arc stays alive through the upcoming context switch — the idle thread GCs it later).
#![allow(unused)] fn main() { let ctid_addr = current.clear_child_tid.load(Ordering::Relaxed); if ctid_addr != 0 { let _ = uaddr.write::<i32>(&0); futex_wake_addr(ctid_addr, 1); } }
Without the EXITED_PROCESSES push, switch() would free the thread's kernel
stacks while still executing on them:
PROCESSES.remove(&pid) → refcount drops to 1 (only CURRENT)
arc_leak_one_ref(&prev) → refcount 1 (CURRENT)
CURRENT.set(next) → drops CURRENT → refcount 0 → freed ← use-after-free
switch_thread(prev.arch, next.arch) ← executing on freed memory
exit_group
exit_group(2) terminates the entire thread group. The implementation
collects all sibling threads (same tgid, different pid), sends each
SIGKILL, then calls exit() on the current thread. The siblings receive
the signal on their next preemption and call their own exit().
The integration test
testing/mini_threads.c exercises twelve scenarios in order:
| # | Test | What it checks |
|---|---|---|
| 1 | thread_create_join | Basic create + join, return value |
| 2 | gettid_unique | Each thread has a distinct TID |
| 3 | getpid_same | All threads share the same TGID |
| 4 | shared_memory | Stack variable written by one thread read by another |
| 5 | atomic_counter | 4 threads × 1000 increments = 4000 (no data race) |
| 6 | mutex | pthread_mutex serialises 4 × 1000 increments |
| 7 | tls | __thread gives per-thread storage |
| 8 | condvar | pthread_cond_wait + pthread_cond_signal |
| 9 | signal_group | kill(getpid(), SIGUSR1) delivered to thread group |
| 10 | tgkill | Signal routed to a specific thread by TID |
| 11 | mmap_shared | Anonymous mmap written by child thread |
| 12 | fork_from_thread | fork() from a threaded process, waitpid() succeeds |
Tests 1–9 and 11–12 passed quickly. Test 10 took everything else in this post.
The debugging marathon
First: a deadlock hiding as a panic
With 4 vCPUs and all tests running, the kernel would panic somewhere in
tests 1–3 with double panic! — a second panic firing while the first
panic handler was still running.
Following the backtrace, the first panic address decoded to a Result::expect
in the kernel but with a return address of 0x46 — obviously corrupt. Stack
corruption at that level usually means either a stack overflow or a lock
deadlock that caused a CPU to spin until the watchdog fired.
Reading new_thread() and switch() side by side revealed a classic AB-BA
deadlock:
CPU 0 (new_thread): lock PROCESSES → ... → lock SCHEDULER
CPU 1 (switch): lock SCHEDULER → ... → lock PROCESSES
new_thread() was holding PROCESSES when it called SCHEDULER.lock().enqueue().
switch() was holding SCHEDULER when it called PROCESSES.lock().get() inside.
Under SMP, both could fire simultaneously.
The fix is one line — drop PROCESSES before touching SCHEDULER:
#![allow(unused)] fn main() { process_table.insert(pid, child.clone()); drop(process_table); // ← release before acquiring SCHEDULER SCHEDULER.lock().enqueue(pid); }
Applied in both fork() and new_thread(). Tests 1–9 and 11–12 passed.
Then: test 10 (tgkill) — the double-panic
tgkill test spins a child thread and has the main thread send it SIGUSR2
via tgkill(getpid(), child_tid, SIGUSR2). Consistently: panic, then
double panic!, then halt.
The first panic decoded to a kernel-mode General Protection Fault at
core::fmt::write + 0x23 — a movzbl 0x0(%r13), %eax with R13 holding a
non-canonical address. In other words, the kernel panicked while trying to
format a panic message, then panicked again while formatting that panic.
Two separate bugs caused this.
Bug 1: panic handler ordering
The panic handler structure was:
#![allow(unused)] fn main() { fn panic(info: &core::panic::PanicInfo) -> ! { if PANICKED.load() { /* double panic exit */ } // … capture msg_buf from info … begin_panic(Box::new(msg_buf.as_str().to_owned())); // ← unwind to catch frame PANICKED.store(true); // ← set AFTER begin_panic returned error!("{}", info); // ← use info directly } }
Two problems here. begin_panic (from the unwinding crate) scans the stack
for catch frames. It unwinds through x64_handle_interrupt's stack frame —
the frame that owns the fmt::Arguments referenced by PanicInfo. After
begin_panic returns (no catch frame found), info.message points into
destroyed stack data. The subsequent error!("{}", info) dereferences a
non-canonical pointer — the second GPF.
And because PANICKED.store(true) was after begin_panic, any exception
during begin_panic's unwinding wouldn't hit the double-panic guard — it
would fall through and try to panic again from scratch, eventually hitting the
second GPF and then the double-panic guard.
The fix: reorder all three operations:
#![allow(unused)] fn main() { fn panic(info: &core::panic::PanicInfo) -> ! { // 1. Disable interrupts immediately. unsafe { core::arch::asm!("cli", options(nomem, nostack, preserves_flags)); } if PANICKED.load(Ordering::SeqCst) { /* double panic */ } // 2. Set PANICKED before begin_panic — any exception during unwinding // is now caught as "double panic" rather than re-entering here. PANICKED.store(true, Ordering::SeqCst); // 3. Capture message NOW, before begin_panic can corrupt info. let mut msg_buf = arrayvec::ArrayString::<512>::new(); let _ = write!(msg_buf, "{}", info); begin_panic(Box::new(alloc::string::String::from(msg_buf.as_str()))); // 4. Use msg_buf from here on, not info. error!("{}", msg_buf.as_str()); // … } }
The cli at the top was already there (from the prior session's fix to prevent
hardware IRQs from firing during panic formatting). The new ordering ensures
that even if begin_panic corrupts the stack, the kernel either exits cleanly
via a catch frame or hits the double-panic guard.
(The to_owned() / to_string() calls fail to compile in no_std without the
trait explicitly in scope; alloc::string::String::from() bypasses that.)
Bug 2: signals never delivered to AP CPUs
Even with the panic handler fixed, tgkill would still fail: the signal was
sent, but the target thread — running on CPU 1, 2, or 3 — never received it.
The interrupt handler dispatches on the vector number:
#![allow(unused)] fn main() { match vec { LAPIC_PREEMPT_VECTOR => { ack_interrupt(); handler().handle_ap_preempt(); // schedules next thread // … (nothing else) } _ if vec >= VECTOR_IRQ_BASE => { ack_interrupt(); handle_irq(irq); // Deliver pending signals when returning to userspace. if frame.cs & 3 != 0 { handler().handle_interrupt_return(&mut pt); // ← try_delivering_signal } } // exceptions … } }
handle_interrupt_return calls try_delivering_signal. It was only in the
hardware IRQ arm.
Hardware timer IRQs (PIT/HPET via IOAPIC) route only to the BSP (CPU 0).
Application Processors only ever receive LAPIC_PREEMPT_VECTOR.
So: a thread running on CPU 1, 2, or 3 would be preempted by the LAPIC timer,
the kernel would schedule the next task, and return to userspace — but
try_delivering_signal was never called. tgkill set the target thread's
signal_pending atomic, but nobody ever checked it on the AP.
The fix is small: copy the signal delivery block into the LAPIC_PREEMPT_VECTOR
arm:
#![allow(unused)] fn main() { LAPIC_PREEMPT_VECTOR => { ack_interrupt(); handler().handle_ap_preempt(); // Deliver pending signals when returning to userspace. // Without this, threads on AP CPUs would never get signals. let cs = frame.cs; if cs & 3 != 0 { let mut pt = PtRegs { /* copy frame fields */ }; handler().handle_interrupt_return(&mut pt); frame.rip = pt.rip; frame.rsp = pt.rsp; // … } } }
With this in place, the LAPIC timer on each AP also checks for pending signals on every return to userspace — exactly as the BSP's hardware timer does.
Results
=== Kevlar M6 Threading Tests ===
PID=1 TID=1 CPUs=1
TEST_PASS thread_create_join
TEST_PASS gettid_unique
TEST_PASS getpid_same
TEST_PASS shared_memory
TEST_PASS atomic_counter
TEST_PASS mutex
TEST_PASS tls
TEST_PASS condvar
TEST_PASS signal_group
TEST_PASS tgkill
TEST_PASS mmap_shared
TEST_PASS fork_from_thread
TEST_END 12/12
Under -smp 4 (TCG), all twelve pass.
What's next
The threading implementation is functionally correct but still has rough edges for a production SMP kernel:
- TLB shootdowns: when one thread unmaps a page, other CPUs still have that mapping cached in their TLBs. Currently safe under TCG (single-threaded emulation), but required before any real hardware or KVM multi-thread workload.
- Per-thread signal pending:
tgkillsets the target'ssignal_pendingatomic, but the delivery races with other threads that share thesignalsArc. A thread could receive a signal intended for its sibling if the sibling checks first. Acceptable for now; fixing it requires splitting the pending bitmask out of the sharedSignalDelivery. pthread_cancel,pthread_barrier,pthread_rwlock: not yet implemented. musl falls back to futex-based implementations, so they may work partially.
The next milestone is TLB shootdown infrastructure — at which point the kernel will be safe to run under KVM with multiple vCPUs exercising real parallelism.
| Phase | Description | Status |
|---|---|---|
| M6 Phase 1 | SMP boot (INIT-SIPI-SIPI, trampoline, MADT) | ✅ Done |
| M6 Phase 2 | Per-CPU run queues + LAPIC timer preemption | ✅ Done |
| M6 Phase 3 | clone(CLONE_VM|CLONE_THREAD), tgid, futex wake-on-exit | ✅ Done |
| M6 Phase 4 | TLB shootdown + SMP thread safety | 🔄 Next |