TVM Matrix Multiplication Optimization - Step 7: cuBLAS Comparison

less than 1 minute read

Published:

Step 7: cuBLAS Comparison

Results

Overall Comparison (Average)

ImplementationPerformancevs cuBLAS
NumPy (CPU)13 GFLOPS0.6%
TVM Step 61039 GFLOPS50.1%
cuBLAS (NVIDIA)2074 GFLOPS100%

Size-Specific Details

SizeTVM Step 6 (Unrolling)cuBLASTVM/cuBLAS
512x5121028 GFLOPS1302 GFLOPS~79.0%
1024x10241050 GFLOPS2846 GFLOPS~36.9%

Analysis

Performance Characteristics of TVM Step 6 (Unrolling)

Around 79% at 512x512 size:

  • For small matrices, the combination of Tiling, Shared Memory, Software Pipelining, and Loop Unrolling allows TVM to reach a high fraction of cuBLAS performance.

Around 37% at 1024x1024 size:

  • For larger matrices, cuBLAS’s advanced optimizations (e.g., Tensor Core utilization, more aggressive tiling/vectorization) remain more effective.

Execution

# cuBLAS benchmark
python benchmarks/cublas_baseline.py

# TVM vs cuBLAS comparison
python benchmarks/compare_all_with_cublas.py

Code can be found at https://github.com/kimm240/matrix-multiplication-optimization-with-tvm.


Series Posts

Language: 한국어 (Korean)