TVM Matrix Multiplication Optimization - Step 6: Loop Unrolling

1 minute read

Published:

Step 6: Loop Unrolling

Achieved 1050 GFLOPS

Results

Unroll Factor512x5121024x1024Improvement
41034 GFLOPS1038 GFLOPS+0.9%
8940 GFLOPS1035 GFLOPS+0.6%
161028 GFLOPS1050 GFLOPS+2.0%
321022 GFLOPS1041 GFLOPS+1.2%

Best performance: 1050 GFLOPS (factor=16)

1. Compiler Theory: Loop Unrolling

Removing Loop Overhead

Loops have overhead every iteration.

Suppose we have the following loop:

for (int k = 0; k < 32; k++) {
    C += A[k] * B[k];
}
  • Every time the loop iterates, it executes the following instructions:
    • Counter increment: k++
    • Condition comparison: Is k < 32?
    • Branch (Jump): If condition is true, return to loop start, if false, exit loop.
  • Since the loop iterates 32 times total, 32 * 3 = 96 additional instructions are added.
  • Loop Unrolling is a technique that unfolds and removes this iteration framework.

In the above loop, if we unfold 16 instructions at once (i.e., Unroll Factor is 16), only 2 iterations and 6 instructions occur.

// First block (16 operations)
C += A[0] * B[0];
C += A[1] * B[1];
// ... (14 more)
C += A[15] * B[15];

// Second block (16 operations)
C += A[16] * B[16];
// ... (14 more)
C += A[31] * B[31];

2. TVM TensorIR Implementation

# Unroll inner loop
sch.unroll(k_inner)
  • sch.unroll replicates the loop body to reduce loop overhead and expose Instruction-Level Parallelism (ILP).

Execution

python test_individual/test_step6.py

Code can be found at https://github.com/kimm240/matrix-multiplication-optimization-with-tvm.


Series Posts

Language: 한국어 (Korean)