Blog 125: utimes, flock, cgroups PID leak fix, 66/66 Alpine tests pass
Date: 2026-03-28 Milestone: M10 Alpine Linux
Summary
Four major improvements to Alpine Linux compatibility:
- Real file timestamps --
utimes/utimensatnow modify ext4 inode atime/mtime/ctime on disk - Advisory file locking --
flock(2)with per-OFD lock table, contention, and auto-release on close - Socket bind duplicate checking -- TCP/UDP return EADDRINUSE on port conflicts
- Cgroups v2: 4 bugs fixed -- dead PID leak in member_pids, recursive spinlock hold, non-atomic migration, O_CREAT on control files
Test results: 66/66 PASS, 0 FAIL (up from 61).
utimes/utimensat: real file timestamps
The problem
utimes(2) and utimensat(2) were stubs -- they verified the file existed
but never modified timestamps. This broke touch, make (dependency
tracking), and APK's package management metadata.
The fix
Added set_times(atime_secs, mtime_secs) method to the VFS trait hierarchy:
FileLike,Directory,Symlinktraits (with default no-op)INodeenum dispatcher- Ext4 implementation: locks inode, updates atime/mtime/ctime, calls
write_inode()to persist to disk
Rewrote both syscalls:
utimes: parsesstruct timeval[2], callsset_times()utimensat: parsesstruct timespec[2], handlesUTIME_NOW,UTIME_OMIT,AT_SYMLINK_NOFOLLOW, fd-based operation viaCwdOrFd
Uses read_wall_clock().secs_from_epoch() for UTIME_NOW and times==NULL.
Files: libs/kevlar_vfs/src/inode.rs, libs/kevlar_vfs/src/stat.rs,
services/kevlar_ext2/src/lib.rs, kernel/syscalls/utimes.rs,
kernel/syscalls/utimensat.rs
flock(2): advisory file locking
The problem
flock(2) was a no-op stub -- it validated the fd and returned success. APK,
build tools, and databases rely on advisory locking for coordination.
The fix
Global lock table keyed by (dev_id, inode_no) with per-open-file-description
(OFD) tracking. The OFD identity is the raw Arc<OpenedFile> pointer, so
fork'd children sharing the same file description share the lock.
Operations:
LOCK_SH-- shared lock (multiple readers)LOCK_EX-- exclusive lock (single writer)LOCK_UN-- explicit unlockLOCK_NB-- non-blocking (returns EAGAIN on contention)- Upgrade (SH -> EX) and downgrade (EX -> SH) supported
- Auto-release via
DroponOpenedFilewhen last Arc reference drops
Files: kernel/syscalls/flock.rs, kernel/syscalls/mod.rs,
kernel/fs/opened_file.rs
Socket bind duplicate port checking
The problem
TCP and UDP bind() silently allowed duplicate port binds. Services like
nginx, sshd, and dropbear expect EADDRINUSE when a port is taken.
The fix
- TCP: Check
INUSE_ENDPOINTSset before bind, insert on success, remove inDrop - UDP: Reject non-zero port duplicates (random port assignment already
skipped in-use ports), added
Dropimpl to release port and smoltcp handle
Files: kernel/net/tcp_socket.rs, kernel/net/udp_socket.rs
Cgroups v2: 4 bugs fixed
Bug 1 (critical): dead PID leak in member_pids
Process::exit() never removed the dying process's PID from its cgroup's
member_pids list. Dead PIDs accumulated indefinitely, causing:
- Inflated
pids.currentcount cgroup.procslisting dead PIDsrmdirfailing on emptied cgroups (EBUSY)- Fork failures if
pids.maxwas set (EAGAIN from inflated count)
Fix: Added cg.member_pids.lock().retain(|p| *p != current.pid) before
set_state(ExitedWith) in Process::exit().
Bug 2: recursive spinlock hold in count_pids_recursive
count_pids_recursive() held self.children.lock() across recursive calls
into child cgroups. Under concurrent fork + cgroup.procs writes, this created
prolonged lock contention.
Fix: Collect children into a Vec under lock, then release lock before recursing:
#![allow(unused)] fn main() { let children: Vec<Arc<CgroupNode>> = self.children.lock().values().cloned().collect(); children.iter().fold(count, |acc, child| acc + child.count_pids_recursive()) }
Bug 3: non-atomic cgroup.procs migration
Writing to cgroup.procs removed the PID from the old cgroup and added to
the new in two separate lock acquisitions. Between them, the PID was in
neither cgroup.
Fix: Lock both cgroups atomically in pointer order to prevent deadlock:
#![allow(unused)] fn main() { if old_ptr < new_ptr { let mut old_pids = old_cgroup.member_pids.lock(); let mut new_pids = self.node.member_pids.lock(); // migrate atomically } }
Bug 4: O_CREAT on cgroupfs control files returned EPERM
BusyBox shell's echo 0 > cgroup.procs uses open(path, O_WRONLY|O_CREAT|O_TRUNC).
The kernel's open path calls create_file() first when O_CREAT is set. If
it returns EEXIST, open falls through to the existing-file lookup. But
CgroupDir::create_file() returned EPERM unconditionally, which didn't
match EEXIST and propagated as an error.
Fix: Return EEXIST for names that match existing control files or child
cgroup directories.
Bonus: PID 0 -> current process in cgroup.procs write
Writing "0" to cgroup.procs is the standard Linux way to move the current
process. The handler now maps PID 0 to current_process().pid().
Files: kernel/process/process.rs, kernel/cgroups/mod.rs,
kernel/cgroups/cgroupfs.rs
Test results
66/66 PASS, 0 FAIL:
| Category | Tests | Status |
|---|---|---|
| Boot + OpenRC | 3 | PASS |
| Cgroups v2 | 2 | PASS (NEW) |
| APK package management | 3 | PASS |
| curl HTTP/HTTPS | 3 | PASS |
| ext4 filesystem | 18 | PASS |
| File timestamps | 2 | PASS (NEW) |
| Advisory locking | 4 | PASS (NEW) |
| Dynamic linking | 5 | PASS |
| dlopen from C | 6 | PASS |
| mmap integrity | 4 | PASS |
| Long symlinks | 5 | PASS |
| Python 3.12 | 7 | PASS |
| Total | 66 | ALL PASS |
Benchmark results (Kevlar vs Linux KVM)
0 regressions, 23 faster, 21 at parity across 44 micro-benchmarks:
| Benchmark | Kevlar (ns) | Linux (ns) | Ratio | Verdict |
|---|---|---|---|---|
| getpid | 70 | 101 | 0.69x | FASTER |
| gettid | 1 | 102 | 0.01x | FASTER (vDSO) |
| clock_gettime | 12 | 22 | 0.55x | FASTER (vDSO) |
| brk | 6 | 2620 | 0.00x | FASTER |
| mmap_fault | 89 | 1805 | 0.05x | FASTER |
| mmap_munmap | 341 | 1699 | 0.20x | FASTER |
| socketpair | 971 | 2596 | 0.37x | FASTER |
| file_tree | 37337 | 74965 | 0.50x | FASTER |
| open_close | 642 | 792 | 0.81x | FASTER |
| exec_true | 91289 | 111204 | 0.82x | FASTER |
| shell_noop | 121580 | 156343 | 0.78x | FASTER |
| fork_exit | 59456 | 57152 | 1.04x | parity |
| tar_extract | 723270 | 641299 | 1.13x | parity |
Full regression suite
All test suites pass with zero regressions:
| Suite | Tests | Status |
|---|---|---|
| Alpine APK (ext4 + curl + Python + dlopen) | 66/66 | PASS |
| ext4 comprehensive | 42/42 | PASS |
| BusyBox applets | 100/100 | PASS |
| SMP threading (4 CPUs) | 14/14 | PASS |
| SMP regression (mini_systemd) | 16/16 | PASS |
| Cgroups + namespaces | 14/14 | PASS |
| VM contract tests | 20/20 | PASS |
OpenRC boot investigation
With the cgroups fixes, OpenRC itself now boots successfully — all three runlevels (sysinit, boot, default) complete with empty service lists.
However, individual service startup via openrc-run hangs after the
service function completes. The service itself succeeds (e.g., "Setting
hostname ... [ok]") but openrc-run never exits. This affects all services
tested: hostname, cgroups, bootmisc, seedrng.
The hang is NOT caused by:
- fd inheritance (redirecting all fds to /dev/null doesn't help)
- The
timeoutcommand (hang persists without timeout) - cgroups PID accounting (fixed in this session)
- cgroupfs O_CREAT (fixed in this session)
Detailed investigation found two issues:
Issue 1 (FIXED): Pipe close never woke POLL_WAIT_QUEUE. The pipe
implementation only woke its local waitq on state changes, not the global
POLL_WAIT_QUEUE used by poll(2). Added POLL_WAIT_QUEUE.wake_all() to
all 7 pipe wake points (PipeWriter/PipeReader read/write/drop).
Issue 2 (IDENTIFIED): openrc-run self-pipe SIGCHLD pattern. OpenRC uses
posix_spawn (falls back to fork+exec on musl) with a self-pipe pattern:
- Creates
pipe2(signal_pipe, O_CLOEXEC) - Forks child to run service
- SIGCHLD handler in parent calls
waitpid+write(signal_pipe[1]) - Parent does
poll(signal_pipe[0], POLLIN, -1)to detect child exit
The parent blocks in poll() waiting for POLLIN. When SIGCHLD arrives,
poll() returns EINTR, the signal handler runs, writes to the pipe, and
the re-entered poll() sees POLLIN. Syscall tracing confirmed the
openrc-run parent process (running /sbin/openrc) is stuck in an ioctl
loop querying terminal window size — suggesting the SIGCHLD/poll/handler
chain works but a subsequent output formatting step loops.
-
Root cause: cgroupfs
poll()not implemented (FIXED). Instrumented the openrc-run.sh shell script and traced the hang towhile read ... done < cgroup.events. TheCgroupControlFiletype used the defaultFileLike::poll()which returns empty events. BusyBox ash callspoll()on file descriptors before reading from shell redirects (< file). With empty poll events, ash blocks forever. Fix: implementpoll()returningPOLLIN | POLLOUT(matching regular file behavior). -
cgroupfs
read()offset handling (FIXED). The read implementation ignored the offset parameter. Fixed to respect position for sequential reads.
Result: OpenRC now boots Alpine with real services — hostname, cgroups, and seedrng all start successfully.
What's next
- Integrate full OpenRC boot into the main Alpine test suite
- Test more Alpine packages (gcc, make, git, openssh, nginx)
- ARM64 testing with updated kernel