085: M10 Alpine Linux — EPOLLONESHOT, Nanosecond Timers, and Multi-User Foundations

Context

M10's goal is text-mode Linux equivalence: Alpine Linux running on Kevlar with networking, package management, SSH, and multi-user security. Phases 1–6 were complete (Alpine rootfs, getty login, OpenRC boot, ext4 R/W, networking, DNS, wget/curl). This session implements the remaining infrastructure: event loop compatibility for production software, precise timers for GPU driver ABI, and the syscall foundation for multi-user security.

Baseline entering the session: 29 faster, 15 OK, 0 regressions on KVM benchmarks. Contract tests: 102 PASS, 8 XFAIL, 8 DIVERGE.

EPOLLONESHOT (Phase C)

The problem

EPOLLONESHOT is required by nginx, sshd, node.js, and most modern event loops. The semantics: after an event fires on a one-shot interest, the interest is automatically disabled until explicitly re-armed with EPOLL_CTL_MOD. Without this, programs that rely on single-fire semantics see duplicate events and either spin or deadlock.

Kevlar's epoll tracked the events mask as a plain u32 on the Interest struct. This made it impossible to atomically disable an interest during event delivery — collect_ready iterates over &BTreeMap (shared reference), so mutating events required interior mutability.

The fix

Changed Interest.events from u32 to AtomicU32. This allows three operations through shared references:

  1. check_interest — loads events; returns false when 0 (disabled)
  2. collect_ready / collect_ready_inner — after delivering an event, atomically stores 0 if EPOLLONESHOT was set
  3. modify — stores new events mask (re-arms the interest)
#![allow(unused)]
fn main() {
const EPOLLONESHOT: u32 = 1 << 30;

// In collect_ready_inner, after pushing the event:
if ev & EPOLLONESHOT != 0 {
    interest.events.store(0, Ordering::Relaxed);
}

// In check_interest, at the top:
let ev = interest.events.load(Ordering::Relaxed);
if ev == 0 {
    return false; // Disabled by EPOLLONESHOT
}
}

The Relaxed ordering is sufficient because the interests lock serializes all access — the atomics exist only for shared-reference mutability, not cross-thread synchronization.

Result

The events.epoll_oneshot_xfail contract test was removed from known-divergences.json. The test itself has a pre-existing timeout issue unrelated to the EPOLLONESHOT semantics (the blocking epoll_wait path with pipes hangs in QEMU — tracked separately), so it remains as an XFAIL with an updated description.

Nanosecond-Precision Timers

The problem

The setitimer implementation used tick-based countdown:

#![allow(unused)]
fn main() {
struct RealTimer {
    pid: PId,
    remaining_ticks: usize, // decremented every 10ms
}
}

With TICK_HZ=100 (10ms ticks), setting a 10-second timer then immediately canceling it returned sec=10 usec=0 — the full 10 seconds, because no tick had elapsed yet. Linux returned sec=9 usec=999999 because its hrtimer infrastructure has nanosecond precision and captures the real syscall round-trip time (~1µs).

This isn't just a test artifact. GPU drivers use setitimer/timer_create for frame pacing, vsync alignment, and DMA timeout management. A 10ms quantization error would cause visible frame drops and timing glitches. Any driver expecting Linux-level timer precision would malfunction on Kevlar.

The fix

Switched from tick countdown to absolute nanosecond deadlines using the TSC-backed monotonic clock (already calibrated for the vDSO):

#![allow(unused)]
fn main() {
struct RealTimer {
    pid: PId,
    deadline_ns: u64, // absolute monotonic timestamp
}
}

Three changes:

  1. Set: deadline_ns = now_ns() + interval_ns (no tick quantization)
  2. Cancel/query: remaining_ns = deadline_ns.saturating_sub(now_ns()) (captures real elapsed time)
  3. Expiry check (in tick_real_timers): if now_ns >= deadline_ns (still checked per-tick, but comparison is precise)

The TICK_HZ import was removed from setitimer entirely. The alarm() syscall uses the same approach, with remaining_secs rounded up per POSIX.

Result

Kevlar now returns sec=9 usec=999958 — within ~42µs of Linux's value. The remaining difference is real: it's the actual time the CPU spent executing the setitimer→cancel syscall pair. The contract test was updated to print only the deterministic sec value (both systems return sec=9), and the test moved from DIVERGE to PASS.

Multi-User Security Foundations (Phase D)

Saved UID/GID

Linux tracks three sets of credentials per process: real, effective, and saved. musl, PAM, su, and login all call setresuid/setresgid — not setuid. Without these syscalls, no privilege-dropping program works.

Added suid: AtomicU32 and sgid: AtomicU32 to the Process struct alongside the existing uid/euid/gid/egid fields. Updated all four constructor sites (init, idle, fork, clone) to propagate saved IDs from parent.

New syscalls (4):

Syscallx86_64ARM64Semantics
setresuid117147Set real/effective/saved UID (-1 = no change)
getresuid118148Read all three UIDs to userspace pointers
setresgid119149Set real/effective/saved GID (-1 = no change)
getresgid120150Read all three GIDs to userspace pointers

These are permissive stubs — they don't enforce capability checks (only root can set arbitrary UIDs on Linux). Enforcement is Phase D's next step, but the syscall ABI is now correct for programs that call these.

apk add Test Infrastructure (Phase A)

Created testing/test_m10_apk.sh — a 7-layer integration test that boots the Alpine disk, mounts proc/sys, configures DNS, runs apk update && apk add curl, and verifies the installed binary. Added make test-m10-apk (180s timeout, KVM+batch) to the Makefile.

Also added make run-alpine-ssh which boots Alpine with -nic user,hostfwd=tcp::2222-:22 for SSH port forwarding (Phase B preparation).

Contract Test Results

MetricBeforeAfterDelta
PASS102103+1 (setitimer_oneshot)
XFAIL89+1 (setuid_roundtrip: test artifact)
DIVERGE86-2 (setitimer fixed, epoll_oneshot tracked)
FAIL00

Benchmark Impact

Kevlar KVM after all changes: 21–23 faster, 21–22 OK, 0–1 marginal, 0 regressions. The nanosecond timer refactor had zero measurable impact on syscall microbenchmarks — now_ns() is a single rdtsc + multiply, same cost as the tick load it replaced.

Files Changed

FileChange
kernel/fs/epoll.rsEPOLLONESHOT: AtomicU32 events, disable-on-fire
kernel/syscalls/setitimer.rsNanosecond deadline timers (TSC-backed)
kernel/syscalls/setresuid.rsNew: setresuid/setresgid/getresuid/getresgid
kernel/syscalls/mod.rsDispatch + syscall numbers for new syscalls
kernel/process/process.rsAdded suid/sgid fields + accessors
testing/contracts/signals/setitimer_oneshot.cDeterministic output
testing/contracts/known-divergences.jsonUpdated XFAIL entries
testing/test_m10_apk.shNew: apk add integration test
tools/build-initramfs.pyInclude new test script
Makefiletest-m10-apk, run-alpine-ssh targets