Benchmarks
Six micro-benchmarks targeting different bottlenecks. All measured with best-of-5 wall-clock time.
Test Environment
| Item | Value |
|---|---|
| Host | Linux x86-64 |
| LCCC | Phase 16g complete — linear scan + TCE + phi-copy coalescing + FP opts + AVX2 vectorization + reduction vectorization + rec-to-iter + SIB fold + accumulator fold + sign-ext fusion + loop rotation + phi register coalescing + 3-channel multiply ILP + byte-offset IV + leaq GEP + FMA register elimination + broadcast hoisting + peephole XMM fixes |
| CCC | upstream, three-phase greedy allocator |
| GCC | 15.2.1 (Arch Linux) |
| Flags | -O2 for all compilers (GCC -O3 -march=native for reduction comparison) |
| Timing | time.perf_counter() wall clock, 7 reps, best taken |
| Date | 2026-03-25 |
Results
| Benchmark | LCCC | GCC -O2 | LCCC/GCC |
|---|---|---|---|
arith_loop |
0.08s | 0.08s | 1.0× (parity) |
sieve |
0.07s | 0.06s | 1.17× slower |
qsort |
0.122s | 0.101s | 1.20× slower |
fib(40) |
0.001s | 0.136s | 478× faster |
matmul |
0.004s | 0.004s | 1.0× (parity) |
reduction |
AVX2 | scalar (GCC -O3) | ~2.7× faster |
tce_sum |
0.007s | 0.001s | 7× slower (GCC const-folds) |
All outputs are byte-identical to GCC’s.
Real-World: SQLite 3.45
LCCC compiles and fully runs the SQLite amalgamation (260K lines). All core SQL operations work correctly: CREATE TABLE, INSERT, UPDATE, DELETE, SELECT with JOINs, subqueries, GROUP BY, transactions, prepared statements, and aggregate functions. 31 correctness bugs were fixed to reach this milestone. Build flags: CCC_NO_PEEPHOLE=1 CCC_DISABLE_PASSES=vectorize,gvn.
Benchmark Descriptions
01_arith_loop — Register Pressure + Phi Register Coalescing + Multiply ILP
// 32 local int variables, all updated every loop iteration, 10M iterations
int arith_loop(int n) {
int a=1, b=2, c=3, ..., z7=32;
for (int i = 0; i < n; i++) {
a += b*c; b += c*d; ... // 32 cross-dependent updates
}
return a ^ b ^ c ^ ... ^ z7;
}
Why it matters: This is the canonical register allocation stress test. With 32 live integer variables and 10M iterations, every extra stack spill costs ~3ns. LCCC’s linear scan assigns more callee-saved registers to the hottest values.
LCCC vs CCC: 0.08s vs 0.147s (+84% faster). Three improvements combine here:
- Linear-scan register allocation (Phase 2): keeps more variables in callee-saved registers across the loop
- Phi-copy stack coalescing (Phase 3b): phi elimination creates one Copy instruction per loop variable per backedge, generating ~20 redundant stack-to-stack
movqpairs per iteration. Reversing the alias direction (src borrows dest’s wider-live slot) makes these copies same-slot no-ops, dropped by the code generator - Phi register coalescing (Phase 15): loop-header phi dests share physical registers with their backedge source values, eliminating register-to-register or register-to-stack copies at the loop backedge (~130 instructions eliminated)
- 3-channel multiply ILP (Phase 15b): every 3rd fusible multiply temp uses %eax via multiply-add fusion while the other 2/3 use r12/rbx, creating 3 independent multiply chains that fully utilize the CPU’s multiply port throughput (1-cycle throughput, 3-cycle latency)
LCCC vs GCC: 1.0× (parity). LCCC generates 109 loop body instructions vs GCC’s 125, but GCC keeps ~14 of 32 variables in registers (vs LCCC’s ~11) due to its unified rax/rcx/rdx usage. The 3-channel multiply ILP compensates by fully saturating the multiply port.
02_fib — Recursive Calls
long fib(int n) {
if (n <= 1) return n;
return fib(n-1) + fib(n-2);
}
// fib(40) = 102,334,155 (~330M recursive calls)
Why it matters: Call-dominated workload — tests whether the compiler can recognize and optimize the exponential recursion pattern.
LCCC vs GCC: 478× faster. LCCC’s binary recursion-to-iteration pass (Phase 10) detects the f(n) = f(n-1) + f(n-2) pattern and converts it to an O(n) iterative sliding-window loop. GCC -O2 keeps the exponential recursion (with partial loop transformation of one call). The transformation is verified by a CI test that computes fib(90) — impossible without the O(n) conversion.
Note: This is a synthetic benchmark. No production code uses naive recursive Fibonacci. The optimization demonstrates LCCC’s pattern-matching capabilities but should not be interpreted as “LCCC is faster than GCC” in general — GCC wins on all other benchmarks.
03_matmul — Floating Point + Cache + AVX2 Vectorization
void matmul(void) {
for (int i = 0; i < 256; i++)
for (int k = 0; k < 256; k++)
for (int j = 0; j < 256; j++)
C[i][j] += A[i][k] * B[k][j];
}
Why it matters: FP throughput and cache behavior. The inner loop is a fused multiply-add pattern that benefits from SIMD vectorization.
LCCC vs CCC: 0.006s vs 0.029s (+4.8× faster). LCCC auto-vectorizes with AVX2 FMA3 (vfmadd231pd, 4 doubles per instruction), while CCC emits scalar mulsd/addsd.
LCCC vs GCC: 1.0× (parity). LCCC uses AVX2 4-wide FMA (vfmadd231pd), GCC uses SSE2 2-wide (mulpd/addpd). LCCC’s inner loop has 10 instructions vs GCC’s 7, but processes 2× more elements per iteration. Phase 16 optimizations:
- IVSR correctness fix (16a): Disabled IVSR/Un-IVSR — pointer IVs compounded with indexed offsets in nested loops, producing wrong output.
- Byte-offset IV (16c): Converted j-loop IV from element index to byte offset, eliminating
shl $5from the critical path. - leaq GEP optimization (16e):
leaq (%base, %offset), %destfor register-allocated GEPs (1 instruction instead of 3). - Peephole XMM fixes (16e): Fixed OOB
REG_NAMES[24]access and movsd string truncation (line[18..]→line[14..]). - FMA register elimination (16f): FMA codegen uses register-allocated pointers directly (
vmovupd (%r14)instead of copying r14→rax thenvmovupd (%rax)). - Broadcast hoisting (16g): Peephole hoists loop-invariant
movsd + vbroadcastsdpair out of the j-loop.
Remaining gap (10 vs 7 instructions): 2 leaq instructions (GCC uses inline SIB addressing), 2 redundant sign-extensions (needs I64 IV widening), 1 loop counter instruction.
04_qsort — Library Calls
qsort(arr, N, sizeof(int), cmp);
Why it matters: Minimal compiler involvement — qsort is a libc function. The only LCCC code is the comparison function and the setup.
LCCC vs CCC: ≈ equal (0.098s vs 0.096s). Expected — cmp is trivial.
LCCC vs GCC: 1.13× slower. Close to parity because libc does the work.
05_sieve — Memory Writes + Inner Loop
for (int i = 2; i*i <= N; i++)
if (sieve[i])
for (int j = i*i; j <= N; j += i)
sieve[j] = 0;
Why it matters: The inner loop has two variables (j, i) that should stay in registers. Both are integer loop variables — exactly what the allocator targets.
LCCC vs CCC: 0.07s vs 0.044s (CCC is faster here due to different loop structure). LCCC keeps j and i in registers across the inner loop; CCC reloads them from the stack each iteration.
LCCC vs GCC: 1.17× slower. Inner marking loop now has 4 instructions (matching GCC’s 4). Phase 17 optimizations:
- Late stack-load hoisting (17a): Moved the loop-invariant sieve pointer load out of the inner loop. Required matching conditional back-edges (jle) in addition to unconditional jmp, and running after all simplification passes to avoid false “register written elsewhere” detection.
- addl+movslq fusion (17b): Detected adjacent
addl %ebx, %ebp; movslq %ebp, %r14and redirected the addl to write directly to%r14d, eliminating the movslq entirely. Only fires when the intermediate register (%ebp) is not used elsewhere in the enclosing block. Runs after loop rotation (which moves the movslq from the header to the latch, placing it adjacent to the addl).
Remaining timing gap (0.07s vs 0.06s) is from outer loop overhead and the prime-counting loop.
06_reduction — Reduction Vectorization (NEW)
double sum_array(double *arr, int n) {
double sum = 0.0;
for (int i = 0; i < n; i++) {
sum += arr[i];
}
return sum;
}
// sum_array(array, 10000000)
Why it matters: Simple reduction patterns are common in scientific computing but surprisingly hard to auto-vectorize. GCC’s vectorizer is conservative and often leaves them scalar.
LCCC vs CCC: 4× speedup. LCCC detects the reduction pattern and transforms to AVX2 SIMD (4 doubles per iteration), with proper horizontal reduction (vextractf128 + vunpckhpd + vaddsd) and a remainder loop for N % 4 != 0. CCC emits scalar addsd.
LCCC vs GCC -O3: ~2.7× faster. This is LCCC’s signature win: GCC -O3 with -march=native does NOT vectorize this pattern—it only does 2× scalar loop unrolling. LCCC generates 12 ymm instructions; GCC generates 0. Assembly comparison:
; LCCC -O2
vxorpd %ymm0, %ymm0, %ymm0 # Zero vector
vmovupd (%rax,%rcx), %ymm0 # Load 4 doubles
vaddpd %ymm1, %ymm0, %ymm0 # Add 4 doubles
vextractf128 $1, %ymm0, %xmm1 # Horizontal reduction
vunpckhpd %xmm0, %xmm0, %xmm1
vaddsd %xmm1, %xmm0, %xmm0 # Final scalar
; GCC -O3 -march=native
vxorpd %xmm0, %xmm0, %xmm0 # Scalar zero
vaddsd (%rdi), %xmm0, %xmm0 # Scalar add
vaddsd -8(%rdi), %xmm0, %xmm0 # Scalar add (unrolled)
07_tce_sum — Tail-Call Elimination
static long sum(int n, long acc) {
if (n <= 0) return acc;
return sum(n - 1, acc + n); // tail call
}
// sum(10000000, 0) = 50000005000000
Why it matters: A pure accumulator-style recursion with 10M stack frames in CCC. LCCC’s tail-call elimination (TCE) converts the self-recursive tail call to a back-edge branch before the main optimization loop, turning 10M stack frames into a tight loop.
LCCC vs CCC: 0.008s vs 1.09s (139× faster). CCC executes 10M actual call/ret pairs; LCCC emits a counted loop identical to what GCC produces.
LCCC vs GCC: ≈ equal. Both emit a 3-instruction counted loop — LCCC’s TCE matches GCC’s output here.
Running the Benchmarks
# Build LCCC
cargo build --release
# Run full suite (requires upstream CCC at ../ccc-upstream or adjust paths)
python3 lccc-improvements/benchmarks/bench.py --reps 5 --md results.md
# Run a single benchmark, more reps
python3 lccc-improvements/benchmarks/bench.py --bench 01_arith_loop --reps 20
# Compile a benchmark manually for disassembly comparison
GCC_INC="-I/usr/lib/gcc/x86_64-linux-gnu/$(gcc -dumpversion)/include"
./target/release/ccc $GCC_INC -O2 -o /tmp/arith_lccc lccc-improvements/benchmarks/bench/01_arith_loop.c
objdump -d /tmp/arith_lccc | grep -A 100 "arith_loop"