Phase 3: Tail-Call Elimination and Phi-Copy Coalescing
2026-03-19
Phase 3 ships two independent optimizations. Together they bring LCCC to +41% faster than CCC on register-pressure code and 139× faster on tail-recursive code.
Phase 3a — Tail-Call Elimination
The problem
sum(10000000, 0) takes 1.09 seconds in CCC. LCCC runs it in 8 ms. The difference is 10 million stack frames.
static long sum(int n, long acc) {
if (n <= 0) return acc;
return sum(n - 1, acc + n); // ← tail call: result returned immediately
}
A tail call is a recursive call whose result flows directly to a return — no further computation. The recursive call could be a loop back-edge, but CCC’s backend doesn’t know that.
The fix
TCE (Tail-Call Elimination) runs as an IR-level pass after inlining, before the main optimization loop. It:
- Detects self-recursive calls where the result feeds directly into a
Return - Inserts a loop header block with one Phi node per function parameter
- Renames all parameter references in the function body to the Phi outputs
- Replaces
call sum(n-1, acc+n); return resultwith assignments to the Phi inputs + an unconditional branch to the loop header
The result is a proper SSA loop that LICM, IVSR, and GVN can optimize in the normal pipeline.
Before TCE: After TCE:
loop_header:
entry: %n_cur = Phi [%n from entry, %n_next from body]
... %acc_cur = Phi [%acc from entry, %acc_next from body]
body: body:
%result = Call sum(%n-1, %acc+%n) %n_next = BinOp sub %n_cur, 1
Return %result %acc_next = BinOp add %acc_cur, %n_cur
Branch → loop_header
After LICM and IVSR, the resulting loop compiles to 3 instructions — identical to GCC’s output.
Results
| LCCC | CCC | GCC | |
|---|---|---|---|
sum(10000000, 0) |
8 ms | 1090 ms | 8 ms |
| vs CCC | 139× faster | baseline | 139× faster |
The pass lives in src/passes/tail_call_elim.rs (747 lines, 8 unit tests). Disable with CCC_DISABLE_PASSES=tce.
Phase 3b — Phi-Copy Stack Slot Coalescing
The problem
After Phase 2 (linear scan), arith_loop was 20% faster than CCC. But 20 redundant memory copies were still happening every loop iteration.
Here’s why. CCC’s phi elimination converts each SSA phi node to Copy instructions in predecessor blocks:
SSA form:
loop_header:
%b = Phi [2 from entry, %b_next from backedge]
After phi elimination:
entry: %b = Copy 2 ← initializes b's slot
backedge: %b = Copy %b_next ← updates b's slot for next iteration
Now %b has two definitions (entry + backedge). The existing copy coalescing marks multi-defined values as ineligible, so %b and %b_next get separate stack slots. The backedge copy becomes:
movq -280(%rbp), %rax ; load %b_next from its slot
movq %rax, -104(%rbp) ; store to %b's slot
With 32 loop variables, that’s 22 pairs of redundant movq mem, %rax; movq %rax, mem — every iteration, 10 million times.
The fix
The existing coalescing only merged copies in the forward direction: alias dest → src (dest borrows src’s slot). This is safe when both are in the same block. For phi copies, both conditions fail: dest is multi-defined, and dest’s uses span different blocks.
The key insight: reverse the alias direction. For the phi-copy pattern, alias src → dest instead — %b_next borrows %b’s slot. This is safe because:
%b_nexthas exactly one use: the Copy. After aliasing, that Copy is a same-slot no-op.%b’s slot is already live across all of%b’s uses (including the loop header), so%b_nextwriting to it in the backedge block is fine.%bbeing multi-defined just means the slot gets written in multiple places — correct by design.
The code change in src/backend/stack_layout/copy_coalescing.rs:
// Before: rejected because d (phi dest) is multi-defined
if multi_def_values.contains(&d) || multi_def_values.contains(&s) {
continue;
}
// After: d being multi-defined is OK when it's the slot owner (root)
// Only reject if s (the aliased value) is multi-defined
if multi_def_values.contains(&s) { continue; }
// ... detect phi-copy pattern ...
if src_in_copy_block && dest_cross_block {
raw_aliases.push((s, d)); // reversed: s borrows d's wider-live slot
continue;
}
// Standard coalescing: still requires d to be single-defined
if multi_def_values.contains(&d) { continue; }
The code generator already had a same-slot no-op check in generate_copy:
if ds.0 == ss.0 {
// same slot — update the accumulator cache and return
return;
}
So the “redundant copy” is simply never emitted.
Results
| Before (Phase 2 only) | After (Phase 2 + 3b) | CCC | |
|---|---|---|---|
arith_loop |
0.124s | 0.104s | 0.147s |
| vs CCC | +20% faster | +41% faster | baseline |
| Assembly lines | 550 | 507 | ~550 |
The arith_loop loop body loses ~43 assembly lines — the 22 redundant movq pairs, now gone.
Combined results
| Benchmark | LCCC | CCC | GCC -O2 | LCCC vs CCC |
|---|---|---|---|---|
arith_loop |
0.104s | 0.147s | 0.068s | +41% |
sieve |
0.037s | 0.044s | 0.024s | +19% |
qsort |
0.098s | 0.096s | 0.087s | ≈ equal |
fib(40) |
0.352s | 0.355s | 0.096s | ≈ equal |
matmul |
0.027s | 0.029s | 0.003s | ≈ equal |
tce_sum |
0.008s | 1.09s | 0.008s | 139× |
508 unit tests pass. All outputs are byte-identical to GCC’s.
What’s next
Phase 4 is loop unrolling. The arith_loop inner loop body is still ~350 instructions; unrolling 4× would reduce the branch overhead and expose more ILP to the CPU’s out-of-order engine. Target: close the arith_loop gap from 1.53× to ~1.2× vs GCC.
See the Roadmap for the full plan.