TVM Matrix Multiplication Optimization - Step 4: Vectorization + Local Memory

1 minute read

Published:

Step 4: Vectorization + Local Memory

Results

Matrix SizeStep 4Improvement
512x512804 GFLOPS+94%
1024x1024593 GFLOPS+29%
2048x2048549 GFLOPS+23%

Average: 614 GFLOPS

1. Compiler Theory: Scalar Replacement (Register Pipelining)

Basic Idea

Replace repeated array accesses with scalar variables and assign them to registers.

In matrix multiplication, C[i,j] is accumulated K times (here, 1024 times):

# Original: Read and write C[i,j] from memory every time
for k in range(1024):
    C[i,j] = C[i,j] + A[i,k] * B[k,j]
  • This way, we make 1024 read requests and 1024 write requests to C[i, j]. That is, 2048 memory accesses.
  • By assigning the value to be stored in C[i, j] to a scalar variable and storing the scalar variable in local memory (in GPU, register), we can create the following situation:
# Replace C[i,j] with scalar variable temp
temp = C[i,j]  # 1 read
for k in range(1024):
    temp = temp + A[i,k] * B[k,j]
    # Accumulated in register
C[i,j] = temp  # 1 write
  • Under the premise that register access is not counted as memory access, we can reduce memory access count from 2048 to 2.

2. TVM TensorIR Implementation

C_local: GPU’s Scalar Replacement

# Cache C to local memory (register)
C_local = sch.cache_write(block_C, 0, "local")
  • sch.cache_write creates a buffer in local memory (register) and writes the final result to global memory only once.

Vectorization

# Vectorize memory loads
sch.vectorize(i_elem)
  • Vectorization allows loading multiple elements at once, improving memory bandwidth utilization.

Execution

python test_individual/test_step4_final.py

Code can be found at https://github.com/kimm240/matrix-multiplication-optimization-with-tvm.


Series Posts

Language: 한국어 (Korean)