Process & Thread Model

Process Structure

A Process (kernel/process/process.rs) is the unit of resource ownership:

#![allow(unused)]
fn main() {
pub struct Process {
    pid: PId,
    tgid: PId,                  // Thread group leader PID
    state: AtomicCell<ProcessState>,
    parent: Weak<Process>,
    children: SpinLock<Vec<Arc<Process>>>,

    // Execution context
    arch: arch::Process,        // Saved registers, kernel stack, xsave FPU area

    // Shared resources (Arc for thread sharing)
    vm: AtomicRefCell<Option<Arc<SpinLock<Vm>>>>,
    opened_files: Arc<SpinLock<OpenedFileTable>>,
    signals: Arc<SpinLock<SignalDelivery>>,
    root_fs: AtomicRefCell<Arc<SpinLock<RootFs>>>,

    // Lock-free signal state
    signal_pending: AtomicU32,  // Mirror of signals.pending for fast-path check
    sigset: AtomicU64,          // Signal mask (lock-free Relaxed ordering)
    signaled_frame: AtomicCell<Option<PtRegs>>,

    // Identity
    uid: AtomicU32, euid: AtomicU32,
    gid: AtomicU32, egid: AtomicU32,
    umask: AtomicCell<u32>,
    nice: AtomicI32,
    comm: SpinLock<Option<Vec<u8>>>,
    cmdline: AtomicRefCell<Cmdline>,

    // Containers
    cgroup: AtomicRefCell<Option<Arc<CgroupNode>>>,
    namespaces: AtomicRefCell<Option<NamespaceSet>>,
    ns_pid: AtomicI32,          // Namespace-local PID

    // Thread support
    clear_child_tid: AtomicUsize,  // CLONE_CHILD_CLEARTID futex address
    vfork_parent: Option<PId>,

    // Accounting
    start_ticks: u64,
    utime: AtomicU64,
    stime: AtomicU64,

    // Diagnostics
    syscall_trace: SyscallTrace, // Lock-free ring buffer of last 32 syscalls
    // ...
}

pub enum ProcessState {
    Runnable,
    BlockedSignalable,
    Stopped(Signal),
    ExitedWith(c_int),
}
}

Atomic fields (AtomicU32, AtomicU64, AtomicCell) enable lock-free reads from other CPUs — critical for signal delivery and scheduler decisions.

Lifecycle

fork

Check cgroup pids.max limit.
Allocate a new PID from the global process table.
Duplicate the page table with copy-on-write (writable pages get refcount bumped, WRITABLE bit cleared in both parent and child).
Copy the xsave FPU area from parent to child (preserves SSE/AVX state).
Clone the open file table, signal handlers, root filesystem, and CWD.
Inherit the parent's cgroup and namespace set; allocate a namespace-local PID.
Enqueue the child on the scheduler; child returns 0, parent returns child PID.

#![allow(unused)]
fn main() {
let vm = parent.vm().lock().fork()?;           // CoW page table copy
let opened_files = parent.opened_files().lock().clone();
let child = Arc::new(Process {
    pid, tgid: pid,  // New thread group leader
    vm: Some(Arc::new(SpinLock::new(vm))),
    opened_files: Arc::new(SpinLock::new(opened_files)),
    signals: Arc::new(SpinLock::new(SignalDelivery::new())),
    // ...
});
}

vfork

Same as fork except:

No page table copy — child shares the parent's address space.
Parent is suspended until the child calls execve or _exit.
Much faster than fork for the common fork+exec pattern.

execve

Parse the ELF binary from the filesystem.
For PIE binaries: choose a base address and apply relocations.
For PT_INTERP (dynamic linking): load the interpreter (ld-musl-*.so.1 or ld-linux-*.so.2) as a second ELF.
Kill all sibling threads (de_thread — POSIX requires execve to terminate all other threads in the thread group).
Reset signal handlers to SIG_DFL (handler addresses are no longer valid).
Rebuild the virtual memory map with ELF PT_LOAD segments.
Push argv, envp, and the auxiliary vector onto the new user stack.
Close O_CLOEXEC file descriptors.
Switch to the new page table and jump to the entry point.

Auxiliary vector entries: AT_ENTRY, AT_BASE, AT_PHDR, AT_PHENT, AT_PHNUM, AT_PAGESZ, AT_UID, AT_GID, AT_EUID, AT_EGID, AT_SECURE, AT_RANDOM, AT_SYSINFO_EHDR, AT_HWCAP, AT_CLKTCK.

exit and wait

On exit(2), the process:

Closes all open files and releases memory.
Reparents children to the subreaper or init (PID 1).
Clears the clear_child_tid address and wakes the futex (for pthread_join).
Marks itself as a zombie and sends SIGCHLD to its parent.
Wakes the parent's wait queue.

The parent collects the exit status via wait4. If the parent set sigaction(SIGCHLD, SIG_IGN) (explicit ignore, not the default), children are auto-reaped without becoming zombies (nocldwait flag).

exit_group kills all sibling threads (same tgid) before exiting.

exit_by_signal

Signal-induced exits collect crash diagnostics:

The last 32 syscalls from the per-process trace ring buffer
The VMA map (up to 64 entries)
Register state at the faulting instruction

These are emitted as structured JSONL debug events before the process terminates with status 128 + signal.

Threads

Threads are created via clone(CLONE_VM | CLONE_THREAD | CLONE_FILES | CLONE_SIGHAND). A thread shares its parent's VM, file descriptor table, and signal handlers, but gets its own PID (which serves as the TID), signal mask, and kernel stack:

#![allow(unused)]
fn main() {
pub fn new_thread(parent: &Arc<Process>, ...) -> Result<Arc<Process>> {
    let child = Arc::new(Process {
        pid,                                          // Unique TID
        tgid: parent.tgid,                            // Same thread group
        vm: parent.vm().clone(),                      // SHARED
        opened_files: Arc::clone(&parent.opened_files), // SHARED
        signals: Arc::clone(&parent.signals),         // SHARED handlers
        sigset: AtomicU64::new(parent.sigset_load().bits()), // Independent mask
        // ...
    });
    // ...
}
}

Thread exit clears clear_child_tid and performs a futex wake, enabling pthread_join to detect thread completion.

SMP Scheduler

The scheduler (kernel/process/scheduler.rs) implements per-CPU round-robin with work stealing:

#![allow(unused)]
fn main() {
pub const MAX_CPUS: usize = 8;

pub struct Scheduler {
    run_queues: [SpinLock<VecDeque<PId>>; MAX_CPUS],
}
}

Each CPU has its own run queue. pick_next tries the local queue first for cache warmth, then steals from other CPUs in round-robin order (stealing from the back for fairness):

#![allow(unused)]
fn main() {
fn pick_next(&self) -> Option<PId> {
    let local = cpu_id() % MAX_CPUS;
    // Try local queue first
    if let Some(pid) = self.run_queues[local].lock().pop_front() {
        return Some(pid);
    }
    // Work stealing: try other CPUs
    for i in 1..MAX_CPUS {
        let victim = (local + i) % MAX_CPUS;
        if let Some(pid) = self.run_queues[victim].lock().pop_back() {
            return Some(pid);
        }
    }
    None
}
}

Preemption

The LAPIC timer fires at 100 Hz. Every 3 ticks (30 ms), the current process is preempted and rescheduled. The scheduler implements the SchedulerPolicy trait, allowing the algorithm to be replaced without touching the platform crate.

Per-CPU State

Each CPU maintains its own:

CURRENT: the currently executing process (Arc<Process>)
IDLE_THREAD: the idle thread (runs hlt when no work is available)
Kernel stack cache for warm L1/L2 allocation during fork

Job Control

Processes are organized into process groups and sessions:

setpgid / getpgid — move a process into a process group
setsid — create a new session (detach from controlling terminal)
tcsetpgrp / tcgetpgrp — set/get the foreground group on a TTY

Background processes receive SIGTTOU on terminal write. Ctrl+Z sends SIGTSTP to the foreground group. SIGCONT resumes stopped processes.

cgroups v2

Each process belongs to a cgroup node. The hierarchy is managed via cgroupfs (mounted at /sys/fs/cgroup):

#![allow(unused)]
fn main() {
pub struct CgroupNode {
    name: String,
    parent: Option<Weak<CgroupNode>>,
    children: SpinLock<BTreeMap<String, Arc<CgroupNode>>>,
    member_pids: SpinLock<Vec<PId>>,
    pids_max: AtomicI64,       // Enforced: fork returns EAGAIN if exceeded
    memory_max: AtomicI64,     // Stub
    cpu_max_quota: AtomicI64,  // Stub
    cpu_max_period: AtomicI64, // Stub
}
}

The pids controller is enforced: fork, vfork, and clone check the cgroup's pids.max limit before allocating a PID. Memory and CPU controllers are stubs (accepted but not enforced).

Children inherit their parent's cgroup membership on fork.

Namespaces

Three namespace types are implemented:

UTS Namespace

Per-namespace hostname and domainname. Default hostname: "kevlar". Created via clone(CLONE_NEWUTS) or unshare(CLONE_NEWUTS).

PID Namespace

Hierarchical PID isolation. Processes in a non-root PID namespace see namespace-local PIDs starting at 1:

#![allow(unused)]
fn main() {
pub struct PidNamespace {
    parent: Option<Arc<PidNamespace>>,
    next_pid: AtomicI32,
    local_to_global: SpinLock<BTreeMap<PId, PId>>,
    global_to_local: SpinLock<BTreeMap<PId, PId>>,
}
}

getpid() returns ns_pid in non-root namespaces, the global PID otherwise.

Mount Namespace

Per-namespace mount table. pivot_root is supported for container-style filesystem isolation.

NamespaceSet

#![allow(unused)]
fn main() {
pub struct NamespaceSet {
    pub uts: Arc<UtsNamespace>,
    pub pid_ns: Arc<PidNamespace>,
    pub mnt: Arc<MountNamespace>,
}
}

Namespaces are inherited on fork and can be selectively cloned with CLONE_NEWUTS, CLONE_NEWPID, or CLONE_NEWNS.

Capabilities

Linux capabilities are tracked as a bitmask. prctl(PR_CAP_AMBIENT_*) and capset/capget manipulate the set. Operations requiring root (like mount) check CAP_SYS_ADMIN. prctl(PR_SET_CHILD_SUBREAPER) designates the process as the reaper for orphaned descendants.

Kevlar Documentation