Optimization Passes
LCCC inherits CCC’s full optimizer. All optimization levels (-O0 through -O3, -Os, -Oz) run the same pipeline — by design.
Why One Pipeline?
Having separate optimization tiers creates distinct code paths through the optimizer. Bugs that only appear at -O2 can hide for months if most testing uses -O0. CCC and LCCC prioritize correctness over partial speedups: by always running all passes, every test exercises the full optimizer, and divergences are caught early.
The -O flags still control the __OPTIMIZE__ and __OPTIMIZE_SIZE__ preprocessor macros, which some projects (like the Linux kernel) use for conditional compilation.
Pass Order
The optimizer runs up to three full iterations. Each iteration runs this pipeline:
1. CFG simplification — remove dead blocks, thread jump chains, simplify branches
2. Copy propagation — replace uses of copies with original values
2a. Division-by-constant — replace idiv/div with multiply-shift (iter 0 only)
2b. Integer narrowing — shrink operation widths (e.g. i64 add → i32 add)
3. Algebraic simplification — strength reduction, identity removal
4. Constant folding — evaluate constant expressions at compile time
5. GVN + LICM + IVSR — shared CFG analysis (computed once per function):
├── GVN — global value numbering / CSE across dominated blocks
├── LICM — loop-invariant code motion
└── IVSR — induction variable strength reduction
7. If-conversion — convert branch+phi diamonds to cmov/Select
8. Copy propagation — clean up copies created by GVN, simplify, LICM
9. Dead code elimination — remove dead instructions (excluded from convergence check)
10. CFG simplification — clean up after DCE empties blocks
10.5 IPCP — interprocedural constant propagation (all iterations)
Phase 0 (before the loop):
- Tail-call elimination (TCE) — converts self-recursive tail calls to loops
- Function inlining
- mem2reg
- Initial constant-fold/copy-prop round
- Vectorization (iteration 0 only) — transforms reduction loops to AVX2/SSE2 SIMD
Phase 11 (after the loop): dead static function elimination — removes static inline functions that became unreferenced after inlining.
Convergence
Iterations stop when:
- No pass made any change (fixpoint reached), or
- This iteration made fewer than 1/20th the changes of the first iteration (diminishing returns).
DCE changes are excluded from the convergence check because DCE’s large removal counts inflate the first-iteration baseline and cause premature exit.
Pass Interactions
The passes form a dependency graph. LCCC uses a should_run! macro to skip passes when their upstream passes made no changes in the previous iteration:
| Pass | Only re-runs if… |
|---|---|
| CFG simplify | constfold changed (constant branches) or DCE changed (empty blocks) |
| Copy prop | CFG, GVN, LICM, or if-convert changed |
| Simplify | Copy prop or narrowing changed |
| Constfold | Copy prop, narrowing, simplify, or if-convert changed |
| GVN | CFG, copy prop, or simplify changed |
| LICM | CFG, copy prop, or GVN changed |
| DCE | GVN, LICM, if-convert, or copy prop2 changed |
Key Passes
GVN (Global Value Numbering)
Eliminates redundant computations across basic blocks. If two blocks both compute a + b where a and b have the same value, GVN replaces the second with a copy of the first. Operates on BinOp, UnaryOp, Cmp, Cast, GetElementPtr, and Load instructions within the dominator tree.
GVN, LICM, and IVSR share a single CfgAnalysis (dominator tree + loop nesting) per function per iteration, saving significant compile time on large translation units.
LICM (Loop-Invariant Code Motion)
Hoists computations out of loops when their operands don’t change within the loop. Identifies natural loops via the dominator tree, then moves invariant instructions to loop preheaders. Critical for inner-loop performance on code like matrix multiply.
IPCP (Interprocedural Constant Propagation)
When a function is always called with the same constant argument, IPCP specializes the function for that constant and folds the resulting dead branches. Important for Linux kernel code (IS_ENABLED(), cpucap_is_possible() chains) where static inline wrappers gate large blocks of dead code.
Division-by-Constant
Replaces idiv/div instructions (20–90 cycle latency on modern CPUs) with multiply-and-shift sequences. Runs only on the first iteration, before narrowing and constant folding can further simplify the expanded sequence. Disabled on i686 where 64-bit multiply overflow semantics differ.
Disabling Passes
For debugging, individual passes can be disabled:
CCC_DISABLE_PASSES="gvn,licm" ./target/release/ccc input.c -o output
CCC_DISABLE_PASSES="all" ./target/release/ccc input.c -o output
Pass names: cfg, copyprop, narrow, simplify, constfold, gvn, licm, ifconv, dce, ipcp, inline, ivsr, divconst.
Timing data is available via:
CCC_TIME_PASSES=1 ./target/release/ccc input.c -o output 2>&1 | grep PASS
LCCC-Specific Passes
LCCC adds two optimization passes that run before CCC’s main optimizer loop.
Tail-Call Elimination (tce)
Converts self-recursive tail calls to back-edge branches. A tail call is a recursive call whose result is returned immediately — return f(args) with no further computation.
// Before: 10M stack frames
long sum(int n, long acc) {
if (n <= 0) return acc;
return sum(n - 1, acc + n);
}
// After TCE: tight counted loop (identical to GCC output)
long sum(int n, long acc) {
loop:
if (n <= 0) return acc;
acc += n; n -= 1; goto loop;
}
TCE runs once after inlining, before the main optimization loop, so that LICM, IVSR, and GVN can subsequently optimize the resulting loop.
Pass name: tce (disable with CCC_DISABLE_PASSES=tce)
Implementation: src/passes/tail_call_elim.rs
Phi-Copy Stack Slot Coalescing (backend)
This is a backend optimization in src/backend/stack_layout/copy_coalescing.rs, not a pass in the traditional sense. It runs during stack layout, before code generation.
When CCC’s phi elimination lowers SSA phi nodes to Copy instructions, it creates separate stack slots for the phi destination and its backedge update value. For a 32-variable loop, this generates ~20 redundant stack-to-stack movq pairs per iteration.
LCCC detects the phi-copy pattern — where the source is defined and killed in the backedge block — and aliases the source to use the phi destination’s wider-live slot. The Copy becomes a same-slot no-op and is dropped by generate_copy.
Result: arith_loop (32 variables): 550 → 507 assembly lines; 0.124s → 0.104s.
LCCC-Specific: Reduction Vectorization
Added in Phase 8 — LCCC detects and transforms reduction loops into AVX2/SSE2 SIMD operations.
What Gets Vectorized
Simple reduction patterns:
// Sum reduction
double sum = 0.0;
for (int i = 0; i < n; i++)
sum += arr[i];
// Dot product
double dot = 0.0;
for (int i = 0; i < n; i++)
dot += a[i] * b[i];
Transformation Strategy
- Pattern detection: Identifies loops with a scalar accumulator PHI and a single reduction operation
- Loop splitting: Divides loop bound by vector width (4 for AVX2, 2 for SSE2)
- Vector body: Replaces scalar ops with vector intrinsics (VecLoad, VecAdd, VecMul)
- Horizontal reduction: Extracts scalar from final vector (
vextractf128+vunpckhpd+vaddsd) - Remainder loop: Handles
N % vec_width != 0with scalar operations - Correct return: Exit block returns scalar from remainder loop, not vector from main loop
Backend Implementation
Vector values are treated as first-class SSA values that:
- Get unique, never-reused stack slots (protected from slot recycler)
- Are excluded from GPR allocation (forced to stack)
- Use direct slot access in intrinsics (no pointer indirection)
- Support vector-to-vector Copy via ymm/xmm registers
Generated Code (AVX2 Example)
vxorpd %ymm0, %ymm0, %ymm0 # Zero vector accumulator
.loop:
vmovupd (%rax,%rcx), %ymm0 # Load 4 doubles
vaddpd %ymm1, %ymm0, %ymm0 # Add 4 doubles
; loop back...
; Horizontal reduction
vextractf128 $1, %ymm0, %xmm1 # Extract high 128 bits
vaddpd %xmm1, %xmm0, %xmm0 # Add high + low (2 doubles each)
vunpckhpd %xmm0, %xmm0, %xmm1 # Unpack high double
vaddsd %xmm1, %xmm0, %xmm0 # Final scalar
; Remainder loop (scalar)
.remainder:
movsd (%rbx,%r13,8), %xmm0 # Load single element
addsd -24(%rbp), %xmm0 # Add to scalar accumulator
; loop back...
; Return scalar result
movsd -24(%rbp), %xmm0 # Return scalar (not vector!)
Why It Beats GCC
GCC’s auto-vectorizer is conservative on simple reductions:
- Worries about aliasing even with clear array indexing
- Considers the pattern “too simple” to benefit
- Falls back to 2× scalar loop unrolling
LCCC’s pattern-based approach:
- Explicitly targets common reduction idioms
- Aggressively transforms when pattern matches
- Generates complete vectorization (4× for AVX2 vs GCC’s 2× unroll)
Result: LCCC vectorizes patterns GCC -O3 leaves scalar, achieving ~2.7× speedup.
Debug Flags
LCCC_DEBUG_VECTORIZE=1 # Show vectorization transformations
LCCC_DEBUG_PROTECT=1 # Show stack slot protection decisions