TVM Matrix Multiplication Optimization - Step 7: cuBLAS Comparison

less than 1 minute read

Published: December 02, 2025

Step 7: cuBLAS Comparison

Results

Overall Comparison (Average)

Implementation	Performance	vs cuBLAS
NumPy (CPU)	13 GFLOPS	0.6%
TVM Step 6	1053 GFLOPS	50.7%
cuBLAS (NVIDIA)	2074 GFLOPS	100%

Size-Specific Details

Size	TVM Step 6	cuBLAS	TVM/cuBLAS
512x512	1115 GFLOPS	1302 GFLOPS	85.6%
1024x1024	990 GFLOPS	2846 GFLOPS	34.8%

Analysis

Performance Characteristics of TVM Step 6

Achieved 85.6% at 512x512 size:

TVM’s optimization techniques work effectively on small matrix sizes
Tiling, Shared Memory, Software Pipelining fit well

34.8% at 1024x1024 size:

cuBLAS’s advanced optimization techniques are more effective on large matrices
cuBLAS includes additional optimizations such as Tensor Core utilization

Execution

# cuBLAS benchmark
python benchmarks/cublas_baseline.py

# TVM vs cuBLAS comparison
python benchmarks/compare_all_with_cublas.py

Code can be found at https://github.com/kimm240/matrix-multiplication-optimization-with-tvm.

Series Posts

Previous: Step 6: Loop Unrolling

Language: 한국어 (Korean)

Share on

Bluesky Facebook LinkedIn Mastodon X (formerly Twitter)

[TIR][Schedule] FuseReductionEpilogue: 표현식 기반 일반화 구현

4 minute read

Published: January 23, 2026

기존의 명시적 패턴 매칭 방식에서 벗어나, 임의의 에필로그 표현식을 처리할 수 있도록 fuse_reduction_epilogue를 일반화했습니다. 패턴별 분기 로직을 제거하고 표현식 기반의 통합 처리 방식을 도입하여 확장성과 유지보수성을 크게 향상시켰습니다.

[TIR][Schedule] FuseReductionEpilogue: Expression-Based Generalization

5 minute read