TVM Matrix Multiplication Optimization - Step 2: Tiling + Loop Reordering
Published:
Step 2: Tiling + Loop Reordering
Achieved 481 GFLOPS (5.1x improvement over Step 1)
1. Compiler Theory: Tiling and Loop Transformation
Tiling (Blocking) - Loop Splitting for Cache
Tiling is a technique that divides large loops into small blocks (tiles). By doing so, we can increase the time data stays in fast-access memory (cache, Shared Memory).
Matrix Multiplication is basically the following triple loop:
! Basic form
do i = 1,N
do j = 1,N
do k = 1,N
C(i,j) = C(i,j) + A(i,k) * B(k,j)
enddo
enddo
enddo
- When
Nis large, matricesA,B,Cdon’t all fit in cache. Therefore, every time data is fetched, it must be read from slow-access memory.
Applying Tiling - Split with tile size t:
! Split j, k loops into tiles
do j = 1,N,t
do k = 1,N,t
do i = 1,N
do jj = j, min(j+t-1,N)
do kk = k, min(k+t-1,N)
C(i,jj) = C(i,jj) + A(i,kk) * B(kk,jj)
enddo
enddo
enddo
enddo
enddo
- When loops are split with tile size
t, the data processed at once becomes smaller (N x N->t × t), so it can fit in cache. - Transforming the iteration space in this way is called Affine Transform.
Loop Reordering - Maximizing Data Reuse
By changing the loop order along with Tiling, we can maximize reuse in registers.
Core Principle:
- Variables in the innermost loop stay in registers continuously
- Placing k in the innermost makes
A[i,k],B[k,j],C[i,j]all reside in registers
2. TVM TensorIR Implementation
GPU requires 2-level tiling:
- Block-level Tiling: GPU block unit (considering Shared Memory size)
- Thread-level Tiling: Workload per thread (considering register size)
2-Level Tiling
# Block-level tiling (GPU block)
BM, BN, BK = 32, 32, 32 # Small tiles! (optimal for A500 cache)
# Thread-level tiling (work per thread)
TM, TN = 8, 4 # High ILP
threads_x = BM // TM # 4
threads_y = BN // TN # 8
# Create tiles with split
i_block, i_rest = sch.split(i, factors=[None, BM])
j_block, j_rest = sch.split(j, factors=[None, BN])
k_outer, k_inner = sch.split(k, factors=[None, BK])
i_thread, i_elem = sch.split(i_rest, factors=[threads_x, None])
j_thread, j_elem = sch.split(j_rest, factors=[threads_y, None])
sch.splitis a function that divides a long loop into multiple loops.i_block, i_rest = sch.split(i, factors=[None, BM]): Fixes the inner loop size toBM(32), and asks the outer loop to be calculated automatically. (None)i_restbecomes the inner loop that runsBMtimes, andi_blockbecomes the outer loop that runsM / BMtimes.
- At this time, the optimal values of BM, BN, BK, TM, TN were discovered through the process in Section 4.
3. Results
Performance
| Matrix Size | Step 1 | Step 2 | Improvement |
|---|---|---|---|
| 512x512 | 91 GFLOPS | 466 GFLOPS | 5.1x |
| 1024x1024 | 95 GFLOPS | 482 GFLOPS | 5.1x |
| 2048x2048 | - | 222 GFLOPS | - |
Average: 390 GFLOPS
Optimal Configuration
# Best performance setting (482 GFLOPS)
BM, BN, BK = 32, 32, 32 # Fit in L1 cache
TM, TN = 8, 4 # High ILP
Threads = 32 (4 x 8) # High ILP with fewer threads
Pattern = "k_innermost" # Maximize register reuse
Execution
# Basic execution
python test_individual/test_step2_improved.py
# Parameter search (138 configurations)
python test_individual/step2_parameter_sweep.py
Code can be found at https://github.com/kimm240/matrix-multiplication-optimization-with-tvm.
Series Posts
- Previous: Step 1: Simple GPU Binding
- Next: Step 3: Shared Memory
Language: 한국어 (Korean)
