TVM Matrix Multiplication Optimization - Step 6: Loop Unrolling

1 minute read

Published: December 02, 2025

Step 6: Loop Unrolling

Achieved 1050 GFLOPS

Results

Unroll Factor	512x512	1024x1024	Improvement
4	1034 GFLOPS	1038 GFLOPS	+0.9%
8	940 GFLOPS	1035 GFLOPS	+0.6%
16	1028 GFLOPS	1050 GFLOPS	+2.0%
32	1022 GFLOPS	1041 GFLOPS	+1.2%

Best performance: 1050 GFLOPS (factor=16)

1. Compiler Theory: Loop Unrolling

Removing Loop Overhead

Loops have overhead every iteration.

Suppose we have the following loop:

for (int k = 0; k < 32; k++) {
    C += A[k] * B[k];
}

Every time the loop iterates, it executes the following instructions:
- Counter increment: k++
- Condition comparison: Is k < 32?
- Branch (Jump): If condition is true, return to loop start, if false, exit loop.
Since the loop iterates 32 times total, 32 * 3 = 96 additional instructions are added.
Loop Unrolling is a technique that unfolds and removes this iteration framework.

In the above loop, if we unfold 16 instructions at once (i.e., Unroll Factor is 16), only 2 iterations and 6 instructions occur.

// First block (16 operations)
C += A[0] * B[0];
C += A[1] * B[1];
// ... (14 more)
C += A[15] * B[15];

// Second block (16 operations)
C += A[16] * B[16];
// ... (14 more)
C += A[31] * B[31];

2. TVM TensorIR Implementation

# Unroll inner loop
sch.unroll(k_inner)

sch.unroll replicates the loop body to reduce loop overhead and expose Instruction-Level Parallelism (ILP).

Execution

python test_individual/test_step6.py

Code can be found at https://github.com/kimm240/matrix-multiplication-optimization-with-tvm.

Series Posts

Previous: Step 5: Software Pipelining
Next: Step 7: cuBLAS Comparison

Language: 한국어 (Korean)

Share on

Bluesky Facebook LinkedIn Mastodon X (formerly Twitter)

Hyun Gyu Kim

TVM Matrix Multiplication Optimization - Step 6: Loop Unrolling

Step 6: Loop Unrolling

Results

1. Compiler Theory: Loop Unrolling

Removing Loop Overhead

2. TVM TensorIR Implementation

Execution

Share on

You May Also Enjoy

ONNX Conv를 Linalg로 변환하기: conv_2d_nchw_fchw

Converting ONNX Conv to Linalg: conv_2d_nchw_fchw

[TIR][Schedule] FuseReductionEpilogue: 표현식 기반 일반화 구현

[TIR][Schedule] FuseReductionEpilogue: Expression-Based Generalization