TVM Matrix Multiplication Optimization - Step 4: Vectorization + Local Memory

1 minute read

Published: December 02, 2025

Step 4: Vectorization + Local Memory

Results

Matrix Size	Step 4	Improvement
512x512	804 GFLOPS	+94%
1024x1024	593 GFLOPS	+29%
2048x2048	549 GFLOPS	+23%

Average: 614 GFLOPS

1. Compiler Theory: Scalar Replacement (Register Pipelining)

Basic Idea

Replace repeated array accesses with scalar variables and assign them to registers.

In matrix multiplication, C[i,j] is accumulated K times (here, 1024 times):

# Original: Read and write C[i,j] from memory every time
for k in range(1024):
    C[i,j] = C[i,j] + A[i,k] * B[k,j]

This way, we make 1024 read requests and 1024 write requests to C[i, j]. That is, 2048 memory accesses.
By assigning the value to be stored in C[i, j] to a scalar variable and storing the scalar variable in local memory (in GPU, register), we can create the following situation:

# Replace C[i,j] with scalar variable temp
temp = C[i,j]  # 1 read
for k in range(1024):
    temp = temp + A[i,k] * B[k,j]
    # Accumulated in register
C[i,j] = temp  # 1 write

Under the premise that register access is not counted as memory access, we can reduce memory access count from 2048 to 2.

2. TVM TensorIR Implementation

C_local: GPU’s Scalar Replacement

# Cache C to local memory (register)
C_local = sch.cache_write(block_C, 0, "local")

sch.cache_write creates a buffer in local memory (register) and writes the final result to global memory only once.

Vectorization

# Vectorize memory loads
sch.vectorize(i_elem)

Vectorization allows loading multiple elements at once, improving memory bandwidth utilization.

Execution

python test_individual/test_step4_final.py

Code can be found at https://github.com/kimm240/matrix-multiplication-optimization-with-tvm.

Series Posts

Previous: Step 3: Shared Memory
Next: Step 5: Software Pipelining

Language: 한국어 (Korean)

Share on

Bluesky Facebook LinkedIn Mastodon X (formerly Twitter)

Hyun Gyu Kim

TVM Matrix Multiplication Optimization - Step 4: Vectorization + Local Memory

Step 4: Vectorization + Local Memory

Results

1. Compiler Theory: Scalar Replacement (Register Pipelining)

Basic Idea

2. TVM TensorIR Implementation

C_local: GPU’s Scalar Replacement

Vectorization

Execution

Share on

You May Also Enjoy

ONNX Conv를 Linalg로 변환하기: conv_2d_nchw_fchw

Converting ONNX Conv to Linalg: conv_2d_nchw_fchw

[TIR][Schedule] FuseReductionEpilogue: 표현식 기반 일반화 구현

[TIR][Schedule] FuseReductionEpilogue: Expression-Based Generalization