TVM Matrix Multiplication Optimization - Step 7: cuBLAS Comparison

less than 1 minute read

Published:

Step 7: cuBLAS Comparison

Results

Overall Comparison (Average)

ImplementationPerformancevs cuBLAS
NumPy (CPU)13 GFLOPS0.6%
TVM Step 61053 GFLOPS50.7%
cuBLAS (NVIDIA)2074 GFLOPS100%

Size-Specific Details

SizeTVM Step 6cuBLASTVM/cuBLAS
512x5121115 GFLOPS1302 GFLOPS85.6%
1024x1024990 GFLOPS2846 GFLOPS34.8%

Analysis

Performance Characteristics of TVM Step 6

Achieved 85.6% at 512x512 size:

  • TVM’s optimization techniques work effectively on small matrix sizes
  • Tiling, Shared Memory, Software Pipelining fit well

34.8% at 1024x1024 size:

  • cuBLAS’s advanced optimization techniques are more effective on large matrices
  • cuBLAS includes additional optimizations such as Tensor Core utilization

Execution

# cuBLAS benchmark
python benchmarks/cublas_baseline.py

# TVM vs cuBLAS comparison
python benchmarks/compare_all_with_cublas.py

Code can be found at https://github.com/kimm240/matrix-multiplication-optimization-with-tvm.


Series Posts

Language: 한국어 (Korean)