Event Source FDs: Filling the Epoll Loop
Blog 018 gave Kevlar an epoll event loop. But an empty loop is useless — systemd needs event sources to monitor. This post covers the three fd types that systemd plugs into epoll before it does anything else: signalfd, timerfd, and eventfd.
eventfd: the simplest possible IPC
An eventfd is a counter wrapped in a file descriptor. Write adds to the counter, read returns it and resets to zero. Poll reports POLLIN when the counter is non-zero. systemd uses this for internal wake-up signaling between components.
#![allow(unused)] fn main() { pub struct EventFd { inner: SpinLock<EventFdInner>, } struct EventFdInner { counter: u64, semaphore: bool, // EFD_SEMAPHORE: read returns 1, decrements } }
The implementation follows the same pattern as pipes: fast path tries the
operation under lock, falls back to POLL_WAIT_QUEUE.sleep_signalable_until
for blocking. Write blocks only if the counter would overflow u64::MAX - 1
(effectively never in practice).
timerfd: lazy expiration checking
A timerfd becomes readable when a deadline passes. systemd uses this for scheduled service starts, watchdog timers, and rate limiting.
The obvious implementation would hook into the timer interrupt to check
armed timerfds on every tick. We chose a simpler approach: lazy
evaluation. The timerfd stores an absolute nanosecond deadline, and
poll()/read() compare it against the current monotonic clock:
#![allow(unused)] fn main() { fn check_expiry(inner: &mut TimerFdInner) { if inner.next_fire_ns == 0 { return; } // disarmed let now_ns = timer::read_monotonic_clock().nanosecs() as u64; if now_ns < inner.next_fire_ns { return; } // not yet if inner.interval_ns > 0 { // Periodic: count elapsed intervals let elapsed = now_ns - inner.next_fire_ns; let extra = elapsed / inner.interval_ns; inner.expirations += 1 + extra; inner.next_fire_ns += (1 + extra) * inner.interval_ns; } else { // One-shot inner.expirations += 1; inner.next_fire_ns = 0; } } }
This is correct because epoll_wait re-polls all interested fds on every wakeup. The question is: what causes the wakeup? Without something periodically nudging the wait queue, a sleeping epoll_wait would never notice the timer expired.
The fix: handle_timer_irq() now calls POLL_WAIT_QUEUE.wake_all() on
every tick (100 Hz on x86_64). This costs one atomic load per tick when
nobody is waiting (the fast path checks waiter_count), and at most one
reschedule per tick when someone is. This also fixes a latent bug where
poll()/select() timeouts were unreliable — they depended on some other
event waking the queue.
signalfd: zero modifications to signal delivery
signalfd was the design challenge. systemd uses it to handle SIGCHLD, SIGTERM, and SIGHUP through epoll instead of signal handlers. The normal approach would intercept signal delivery, check if a signalfd is watching, and redirect the signal. This would require threading signalfd state through the signal delivery path.
We chose a simpler design: don't touch signal delivery at all. The
user blocks signals via sigprocmask, creates a signalfd with the same
mask, and adds it to epoll. Blocked signals accumulate in the process's
existing pending bitmask. The signalfd's poll() and read() simply
check this bitmask:
#![allow(unused)] fn main() { fn poll(&self) -> Result<PollStatus> { let pending = current_process().signal_pending_bits(); if pending & self.mask != 0 { Ok(PollStatus::POLLIN) } else { Ok(PollStatus::empty()) } } }
On read, pop_pending_masked(mask) atomically dequeues matching signals
and fills in 128-byte signalfd_siginfo structs. No new data
structures, no hooks, no coordination — just reading from state that
already exists.
For epoll to notice new signals promptly, send_signal() now calls
POLL_WAIT_QUEUE.wake_all() after queuing a signal.
Fixing a signal delivery bug
While implementing signalfd, we found a bug in try_delivering_signal.
The old code called pop_pending() which unconditionally removed the
lowest-numbered pending signal, then checked if it was blocked:
#![allow(unused)] fn main() { // BEFORE (buggy): blocked signals are popped and silently discarded let (signal, action) = sigs.pop_pending(); if !sigset.is_blocked(signal) { // deliver } // If blocked: signal is gone forever }
The fix: pop_pending_unblocked(sigset) only pops signals that aren't
in the blocked set. Blocked signals remain pending for signalfd to
consume or for later delivery when unblocked.
We also fixed has_pending_signals() — used by sleep_signalable_until
to decide whether to return EINTR — to check pending & ~blocked
instead of just pending != 0. Without this, blocked signals would
cause spurious EINTR returns from every blocking syscall.
What's next
With epoll + signalfd + timerfd + eventfd, Kevlar has the complete I/O multiplexing substrate for systemd's main loop. Phase 3 tackles Unix domain sockets — the transport layer for D-Bus, which systemd uses for inter-process communication with every service it manages.