TVM Matrix Multiplication Optimization - Step 5: Software Pipelining

1 minute read

Published:

Step 5: Software Pipelining

Achieved 1029 GFLOPS

Results

Matrix SizeStep 5Improvement
256x256855 GFLOPS+68%
512x5121020 GFLOPS+27%
1024x10241029 GFLOPS+74%

Average: 968 GFLOPS

1. Compiler Theory: Software Pipelining

Basic Idea - Multiple Iterations Simultaneously

Problem: Sequential execution is inefficient

Iteration 0: Load (400 cycles) → Compute (100 cycles) → Store
Iteration 1:                      Load (400) → Compute (100) → Store
Iteration 2:                                     Load (400) → Compute (100)

Total time = N × (400 + 100) = 500N cycles
  • When executed sequentially, GPU cores are idle (IDLE) while waiting for memory to arrive.

Solution: Software Pipelining - Execute multiple iterations overlapped.

Iteration 0: Load ────────────────→
Iteration 1:         Load ──────→ Compute ──→
Iteration 2:                 Load ──→ Compute ──→ Store
Iteration 3:                         Load ──→ Compute ──→ Store

Total time ≈ max(Load, Compute) × N = 400N cycles
Savings: 100N cycles
  • We overlap computation with memory transfer time.

2. TVM TensorIR Implementation

# Software pipelining: overlap memory loads with computation
sch.annotate(k_outer, "software_pipeline_stage", [0, 0, 0, 1, 1])
sch.annotate(k_outer, "software_pipeline_order", [0, 1, 2, 3, 4])
  • software_pipeline_stage specifies which stage each operation belongs to.
  • software_pipeline_order specifies the execution order within each stage.

Execution

python test_individual/test_step5.py

Code can be found at https://github.com/kimm240/matrix-multiplication-optimization-with-tvm.


Series Posts

Language: 한국어 (Korean)