TVM 행렬 곱셈 최적화 - Step 7: cuBLAS Comparison

less than 1 minute read

Published: December 02, 2025

Step 7: cuBLAS Comparison

결과

전체 비교 (평균)

구현	성능	cuBLAS 대비
NumPy (CPU)	13 GFLOPS	0.6%
TVM Step 6	1053 GFLOPS	50.7%
cuBLAS (NVIDIA)	2074 GFLOPS	100%

크기별 상세

크기	TVM Step 6	cuBLAS	TVM/cuBLAS
512x512	1115 GFLOPS	1302 GFLOPS	85.6%
1024x1024	990 GFLOPS	2846 GFLOPS	34.8%

분석

TVM Step 6의 성능 특성

512x512 크기에서 85.6% 달성:

작은 행렬 크기에서는 TVM의 최적화 기법이 효과적으로 작동
Tiling, Shared Memory, Software Pipelining이 잘 맞음

1024x1024 크기에서 34.8%:

큰 행렬에서는 cuBLAS의 고급 최적화 기법이 더 효과적
cuBLAS는 Tensor Core 활용 등 추가 최적화 포함

실행

# cuBLAS 벤치마크
python benchmarks/cublas_baseline.py

# TVM vs cuBLAS 비교
python benchmarks/compare_all_with_cublas.py

코드는 https://github.com/kimm240/matrix-multiplication-optimization-with-tvm에서 찾아볼 수 있습니다.

시리즈 포스트

이전: Step 6: Loop Unrolling

Language: English

Share on

Bluesky Facebook LinkedIn Mastodon X (formerly Twitter)

[TIR][Schedule] FuseReductionEpilogue: 표현식 기반 일반화 구현

4 minute read

Published: January 23, 2026

기존의 명시적 패턴 매칭 방식에서 벗어나, 임의의 에필로그 표현식을 처리할 수 있도록 fuse_reduction_epilogue를 일반화했습니다. 패턴별 분기 로직을 제거하고 표현식 기반의 통합 처리 방식을 도입하여 확장성과 유지보수성을 크게 향상시켰습니다.

[TIR][Schedule] FuseReductionEpilogue: Expression-Based Generalization

5 minute read

Published: January 23, 2026

We generalized fuse_reduction_epilogue so that it can handle arbitrary epilogue expressions instead of relying on hard-coded pattern matching. By removing pattern-specific branching logic and introducing an expression-driven unified pipeline, we significantly improved extensibility and maintainability.

Hyun Gyu Kim