TVM Matrix Multiplication Optimization - Step 5: Software Pipelining

3 minute read

Published:

Step 5: Software Pipelining

Achieved 1029 GFLOPS

Results

Matrix SizeStep 5Improvement
256x256855 GFLOPS+68%
512x5121020 GFLOPS+27%
1024x10241029 GFLOPS+74%

Average: 968 GFLOPS

1. Compiler Theory: Software Pipelining

Basic Idea - Multiple Iterations Simultaneously

Problem: Sequential execution is inefficient

Iteration 0: Load (400 cycles) → Compute (100 cycles) → Store
Iteration 1:                      Load (400) → Compute (100) → Store
Iteration 2:                                     Load (400) → Compute (100)

Total time = N × (400 + 100) = 500N cycles
  • When executed sequentially, GPU cores are idle (IDLE) while waiting for memory to arrive.

Solution: Software Pipelining - Execute multiple iterations overlapped.

Iteration 0: Load ────────────────→
Iteration 1:         Load ──────→ Compute ──→
Iteration 2:                 Load ──→ Compute ──→ Store
Iteration 3:                         Load ──→ Compute ──→ Store

Total time ≈ max(Load, Compute) × N = 400N cycles
Savings: 100N cycles
  • We overlap computation with memory transfer time.

2. TVM TensorIR Implementation

Pipeline Annotation

TVM implements pipelining through annotations.

  • The pipeline overlaps execution of Load, Compute, and Writeback. Here, Load is stage 0, Compute is stage 1, and Writeback is stage 2.
# Assign stages to 4 blocks inside k_outer
sch.annotate(k_outer, "software_pipeline_stage", [0, 0, 1, 2])

# Execution order
sch.annotate(k_outer, "software_pipeline_order", [0, 1, 2, 3])
  • sch.annotate(k_outer, "software_pipeline_stage", [0, 1, 2, 3]): The k_outer loop is the target of pipelining. The "software_pipeline_stage" directive specifies which stage each operation belongs to.
    • Stage 0: Load (A_shared, B_shared) - Preload next tile
    • Stage 1: Compute (C_local) - Compute with current tile
    • Stage 2: Writeback (C_local → C) - Write result
  • sch.annotate(k_outer, "software_pipeline_order", [0, 1, 2, 3]): This command determines the execution order. [0, 1, 2, 3] means to execute the 4 blocks inside the k_outer loop in the original code order.

  • Pipelining works as follows:
TimeStage 0 (L)Stage 1 (C)Stage 2 (WB)
T1Load k=0 tile  
T2Load k=1 tileCompute k=0 tile 
T3Load k=2 tileCompute k=1 tileWriteback k=0 result
T4Load k=3 tileCompute k=2 tileWriteback k=1 result

3. Generated Execution Pattern

Prologue-Kernel-Epilogue Structure

# Prologue: Preload first tile
k=0: 
  Load A_tile[0], B_tile[0] to Shared Memory
  __syncthreads()

# Kernel: Execute multiple stages simultaneously
for k in 1..31:
  Stage 0: Load A_tile[k], B_tile[k]     # Next tile
  Stage 1: Compute using A_tile[k-1], B_tile[k-1]  # Current tile
  Stage 2: Writeback C_local[k-2]  C   # Previous result
  __syncthreads()

# Epilogue: Final processing
k=32:
  Compute using A_tile[31], B_tile[31]
  Writeback

4. Experiment: Double Buffering

# Double Buffering
sch.annotate(A_shared, "double_buffer_scope", 0)
sch.annotate(B_shared, "double_buffer_scope", 0)

Results:

SizeBasic Pipeline+ Double BufferDifference
1024x10241029 GFLOPS1028 GFLOPS-0.1%

Conclusion: It is inefficient on A500’s small shared memory. Since it requires 2x the space compared to the existing technique, Occupancy decreases.

5. Results

Performance Analysis

Step 4 → Step 5:

  • Average +58% improvement

Why is it so effective?

A500 bottlenecks:

  • Small memory bandwidth (192 GB/s)
  • Memory latency is the performance limiting factor

Software Pipelining effects:

  • Completely hides memory latency with computation
  • Fully utilizes bandwidth

Execution

python test_individual/test_step5_pipelining.py

Code can be found at https://github.com/kimm240/matrix-multiplication-optimization-with-tvm.


Series Posts

Language: 한국어 (Korean)