Process & Thread Model
Process Structure
A Process (kernel/process/process.rs) is the unit of resource ownership:
#![allow(unused)] fn main() { pub struct Process { pid: PId, tgid: PId, // Thread group leader PID state: AtomicCell<ProcessState>, parent: Weak<Process>, children: SpinLock<Vec<Arc<Process>>>, // Execution context arch: arch::Process, // Saved registers, kernel stack, xsave FPU area // Shared resources (Arc for thread sharing) vm: AtomicRefCell<Option<Arc<SpinLock<Vm>>>>, opened_files: Arc<SpinLock<OpenedFileTable>>, signals: Arc<SpinLock<SignalDelivery>>, root_fs: AtomicRefCell<Arc<SpinLock<RootFs>>>, // Lock-free signal state signal_pending: AtomicU32, // Mirror of signals.pending for fast-path check sigset: AtomicU64, // Signal mask (lock-free Relaxed ordering) signaled_frame: AtomicCell<Option<PtRegs>>, // Identity uid: AtomicU32, euid: AtomicU32, gid: AtomicU32, egid: AtomicU32, umask: AtomicCell<u32>, nice: AtomicI32, comm: SpinLock<Option<Vec<u8>>>, cmdline: AtomicRefCell<Cmdline>, // Containers cgroup: AtomicRefCell<Option<Arc<CgroupNode>>>, namespaces: AtomicRefCell<Option<NamespaceSet>>, ns_pid: AtomicI32, // Namespace-local PID // Thread support clear_child_tid: AtomicUsize, // CLONE_CHILD_CLEARTID futex address vfork_parent: Option<PId>, // Accounting start_ticks: u64, utime: AtomicU64, stime: AtomicU64, // Diagnostics syscall_trace: SyscallTrace, // Lock-free ring buffer of last 32 syscalls // ... } pub enum ProcessState { Runnable, BlockedSignalable, Stopped(Signal), ExitedWith(c_int), } }
Atomic fields (AtomicU32, AtomicU64, AtomicCell) enable lock-free reads from
other CPUs — critical for signal delivery and scheduler decisions.
Lifecycle
fork
- Check cgroup
pids.maxlimit. - Allocate a new PID from the global process table.
- Duplicate the page table with copy-on-write (writable pages get refcount bumped, WRITABLE bit cleared in both parent and child).
- Copy the xsave FPU area from parent to child (preserves SSE/AVX state).
- Clone the open file table, signal handlers, root filesystem, and CWD.
- Inherit the parent's cgroup and namespace set; allocate a namespace-local PID.
- Enqueue the child on the scheduler; child returns 0, parent returns child PID.
#![allow(unused)] fn main() { let vm = parent.vm().lock().fork()?; // CoW page table copy let opened_files = parent.opened_files().lock().clone(); let child = Arc::new(Process { pid, tgid: pid, // New thread group leader vm: Some(Arc::new(SpinLock::new(vm))), opened_files: Arc::new(SpinLock::new(opened_files)), signals: Arc::new(SpinLock::new(SignalDelivery::new())), // ... }); }
vfork
Same as fork except:
- No page table copy — child shares the parent's address space.
- Parent is suspended until the child calls
execveor_exit. - Much faster than fork for the common fork+exec pattern.
execve
- Parse the ELF binary from the filesystem.
- For PIE binaries: choose a base address and apply relocations.
- For
PT_INTERP(dynamic linking): load the interpreter (ld-musl-*.so.1orld-linux-*.so.2) as a second ELF. - Kill all sibling threads (
de_thread— POSIX requires execve to terminate all other threads in the thread group). - Reset signal handlers to
SIG_DFL(handler addresses are no longer valid). - Rebuild the virtual memory map with ELF
PT_LOADsegments. - Push
argv,envp, and the auxiliary vector onto the new user stack. - Close
O_CLOEXECfile descriptors. - Switch to the new page table and jump to the entry point.
Auxiliary vector entries: AT_ENTRY, AT_BASE, AT_PHDR, AT_PHENT, AT_PHNUM,
AT_PAGESZ, AT_UID, AT_GID, AT_EUID, AT_EGID, AT_SECURE, AT_RANDOM,
AT_SYSINFO_EHDR, AT_HWCAP, AT_CLKTCK.
exit and wait
On exit(2), the process:
- Closes all open files and releases memory.
- Reparents children to the subreaper or init (PID 1).
- Clears the
clear_child_tidaddress and wakes the futex (forpthread_join). - Marks itself as a zombie and sends
SIGCHLDto its parent. - Wakes the parent's wait queue.
The parent collects the exit status via wait4. If the parent set
sigaction(SIGCHLD, SIG_IGN) (explicit ignore, not the default), children are
auto-reaped without becoming zombies (nocldwait flag).
exit_group kills all sibling threads (same tgid) before exiting.
exit_by_signal
Signal-induced exits collect crash diagnostics:
- The last 32 syscalls from the per-process trace ring buffer
- The VMA map (up to 64 entries)
- Register state at the faulting instruction
These are emitted as structured JSONL debug events before the process terminates
with status 128 + signal.
Threads
Threads are created via clone(CLONE_VM | CLONE_THREAD | CLONE_FILES | CLONE_SIGHAND).
A thread shares its parent's VM, file descriptor table, and signal handlers, but gets
its own PID (which serves as the TID), signal mask, and kernel stack:
#![allow(unused)] fn main() { pub fn new_thread(parent: &Arc<Process>, ...) -> Result<Arc<Process>> { let child = Arc::new(Process { pid, // Unique TID tgid: parent.tgid, // Same thread group vm: parent.vm().clone(), // SHARED opened_files: Arc::clone(&parent.opened_files), // SHARED signals: Arc::clone(&parent.signals), // SHARED handlers sigset: AtomicU64::new(parent.sigset_load().bits()), // Independent mask // ... }); // ... } }
Thread exit clears clear_child_tid and performs a futex wake, enabling pthread_join
to detect thread completion.
SMP Scheduler
The scheduler (kernel/process/scheduler.rs) implements per-CPU round-robin with
work stealing:
#![allow(unused)] fn main() { pub const MAX_CPUS: usize = 8; pub struct Scheduler { run_queues: [SpinLock<VecDeque<PId>>; MAX_CPUS], } }
Each CPU has its own run queue. pick_next tries the local queue first for cache
warmth, then steals from other CPUs in round-robin order (stealing from the back
for fairness):
#![allow(unused)] fn main() { fn pick_next(&self) -> Option<PId> { let local = cpu_id() % MAX_CPUS; // Try local queue first if let Some(pid) = self.run_queues[local].lock().pop_front() { return Some(pid); } // Work stealing: try other CPUs for i in 1..MAX_CPUS { let victim = (local + i) % MAX_CPUS; if let Some(pid) = self.run_queues[victim].lock().pop_back() { return Some(pid); } } None } }
Preemption
The LAPIC timer fires at 100 Hz. Every 3 ticks (30 ms), the current process is
preempted and rescheduled. The scheduler implements the SchedulerPolicy trait,
allowing the algorithm to be replaced without touching the platform crate.
Per-CPU State
Each CPU maintains its own:
CURRENT: the currently executing process (Arc<Process>)IDLE_THREAD: the idle thread (runshltwhen no work is available)- Kernel stack cache for warm L1/L2 allocation during fork
Job Control
Processes are organized into process groups and sessions:
setpgid/getpgid— move a process into a process groupsetsid— create a new session (detach from controlling terminal)tcsetpgrp/tcgetpgrp— set/get the foreground group on a TTY
Background processes receive SIGTTOU on terminal write. Ctrl+Z sends SIGTSTP to
the foreground group. SIGCONT resumes stopped processes.
cgroups v2
Each process belongs to a cgroup node. The hierarchy is managed via cgroupfs
(mounted at /sys/fs/cgroup):
#![allow(unused)] fn main() { pub struct CgroupNode { name: String, parent: Option<Weak<CgroupNode>>, children: SpinLock<BTreeMap<String, Arc<CgroupNode>>>, member_pids: SpinLock<Vec<PId>>, pids_max: AtomicI64, // Enforced: fork returns EAGAIN if exceeded memory_max: AtomicI64, // Stub cpu_max_quota: AtomicI64, // Stub cpu_max_period: AtomicI64, // Stub } }
The pids controller is enforced: fork, vfork, and clone check the cgroup's
pids.max limit before allocating a PID. Memory and CPU controllers are stubs
(accepted but not enforced).
Children inherit their parent's cgroup membership on fork.
Namespaces
Three namespace types are implemented:
UTS Namespace
Per-namespace hostname and domainname. Default hostname: "kevlar". Created via
clone(CLONE_NEWUTS) or unshare(CLONE_NEWUTS).
PID Namespace
Hierarchical PID isolation. Processes in a non-root PID namespace see namespace-local PIDs starting at 1:
#![allow(unused)] fn main() { pub struct PidNamespace { parent: Option<Arc<PidNamespace>>, next_pid: AtomicI32, local_to_global: SpinLock<BTreeMap<PId, PId>>, global_to_local: SpinLock<BTreeMap<PId, PId>>, } }
getpid() returns ns_pid in non-root namespaces, the global PID otherwise.
Mount Namespace
Per-namespace mount table. pivot_root is supported for container-style filesystem
isolation.
NamespaceSet
#![allow(unused)] fn main() { pub struct NamespaceSet { pub uts: Arc<UtsNamespace>, pub pid_ns: Arc<PidNamespace>, pub mnt: Arc<MountNamespace>, } }
Namespaces are inherited on fork and can be selectively cloned with
CLONE_NEWUTS, CLONE_NEWPID, or CLONE_NEWNS.
Capabilities
Linux capabilities are tracked as a bitmask. prctl(PR_CAP_AMBIENT_*) and
capset/capget manipulate the set. Operations requiring root (like mount)
check CAP_SYS_ADMIN. prctl(PR_SET_CHILD_SUBREAPER) designates the process
as the reaper for orphaned descendants.