TVM Matrix Multiplication Optimization - Step 7: cuBLAS Comparison
Published:
Step 7: cuBLAS Comparison
Results
Overall Comparison (Average)
| Implementation | Performance | vs cuBLAS |
|---|---|---|
| NumPy (CPU) | 13 GFLOPS | 0.6% |
| TVM Step 6 | 1053 GFLOPS | 50.7% |
| cuBLAS (NVIDIA) | 2074 GFLOPS | 100% |
Size-Specific Details
| Size | TVM Step 6 | cuBLAS | TVM/cuBLAS |
|---|---|---|---|
| 512x512 | 1115 GFLOPS | 1302 GFLOPS | 85.6% |
| 1024x1024 | 990 GFLOPS | 2846 GFLOPS | 34.8% |
Analysis
Performance Characteristics of TVM Step 6
Achieved 85.6% at 512x512 size:
- TVM’s optimization techniques work effectively on small matrix sizes
- Tiling, Shared Memory, Software Pipelining fit well
34.8% at 1024x1024 size:
- cuBLAS’s advanced optimization techniques are more effective on large matrices
- cuBLAS includes additional optimizations such as Tensor Core utilization
Execution
# cuBLAS benchmark
python benchmarks/cublas_baseline.py
# TVM vs cuBLAS comparison
python benchmarks/compare_all_with_cublas.py
Code can be found at https://github.com/kimm240/matrix-multiplication-optimization-with-tvm.
Series Posts
- Previous: Step 6: Loop Unrolling
Language: 한국어 (Korean)
