TVM

TVM 행렬 곱셈 최적화 - Step 7: cuBLAS Comparison

less than 1 minute read

Published: December 02, 2025

TVM으로 최적화한 행렬 곱셈 구현을 NVIDIA cuBLAS와 비교합니다. Step 6에서 달성한 1053 GFLOPS는 cuBLAS의 50.7%에 해당하며, 512x512 크기에서는 85.6%의 성능을 달성했습니다.

TVM Matrix Multiplication Optimization - Step 7: cuBLAS Comparison

less than 1 minute read

Published: December 02, 2025

We compare our TVM-optimized matrix multiplication implementation with NVIDIA cuBLAS. The 1053 GFLOPS achieved in Step 6 corresponds to 50.7% of cuBLAS, and we achieved 85.6% performance for 512x512 size.

TVM 행렬 곱셈 최적화 - Step 6: Loop Unrolling

1 minute read

Published: December 02, 2025

Loop Unrolling을 통해 1050 GFLOPS를 달성했습니다. 루프 오버헤드를 제거하고 Instruction-Level Parallelism을 향상시켜 최종 성능을 끌어올렸습니다.

TVM Matrix Multiplication Optimization - Step 6: Loop Unrolling

1 minute read

Published: December 02, 2025

We achieved 1050 GFLOPS through Loop Unrolling. We improved final performance by removing loop overhead and enhancing Instruction-Level Parallelism.

TVM 행렬 곱셈 최적화 - Step 5: Software Pipelining

3 minute read

Published: December 02, 2025

Software Pipelining을 통해 1029 GFLOPS를 달성했습니다. 메모리 레이턴시를 연산으로 은폐하여 평균 58% 성능 향상을 달성했습니다. 이 포스트에서는 여러 반복을 겹쳐서 실행하는 Software Pipelining 기법을 다룹니다.

TVM Matrix Multiplication Optimization - Step 5: Software Pipelining

1 minute read

Published: December 02, 2025

We achieved 1029 GFLOPS through Software Pipelining. We achieved an average 58% performance improvement by hiding memory latency with computation. This post covers Software Pipelining techniques that execute multiple iterations overlapped.

TVM 행렬 곱셈 최적화 - Step 4: Vectorization + Local Memory

2 minute read

Published: December 02, 2025

Vectorization과 Local Memory(레지스터) 캐싱을 통해 평균 614 GFLOPS를 달성했습니다. 이 포스트에서는 Scalar Replacement 기법을 통한 레지스터 최적화와 벡터화를 통한 메모리 대역폭 활용을 다룹니다.

TVM Matrix Multiplication Optimization - Step 4: Vectorization + Local Memory

1 minute read

Published: December 02, 2025

We achieved an average of 614 GFLOPS through Vectorization and Local Memory (register) caching. This post covers register optimization through Scalar Replacement techniques and memory bandwidth utilization through vectorization.

TVM 행렬 곱셈 최적화 - Step 3: Shared Memory

3 minute read

Published: December 02, 2025

Shared Memory를 활용하여 큰 행렬(2048x2048)에서 101% 성능 향상을 달성했습니다. 이 포스트에서는 GPU 메모리 계층 구조와 Shared Memory를 통한 캐싱 전략, Cooperative Fetching 기법을 다룹니다.

TVM Matrix Multiplication Optimization - Step 3: Shared Memory

2 minute read

Published: December 02, 2025

We achieved 101% performance improvement on large matrices (2048x2048) using Shared Memory. This post covers GPU memory hierarchy and caching strategies through Shared Memory, and Cooperative Fetching techniques.

TVM 행렬 곱셈 최적화 - Step 2: Tiling + Loop Reordering

5 minute read

Published: December 02, 2025

Tiling과 Loop Reordering을 통해 481 GFLOPS를 달성했습니다. Step 1 대비 5.1배 향상된 성능을 보여줍니다. 이 포스트에서는 캐시 최적화를 위한 Tiling 기법과 레지스터 재사용을 극대화하는 Loop Reordering을 다룹니다.

TVM Matrix Multiplication Optimization - Step 2: Tiling + Loop Reordering

3 minute read

Published: December 02, 2025

We achieved 481 GFLOPS through Tiling and Loop Reordering. This shows 5.1x performance improvement over Step 1. This post covers Tiling techniques for cache optimization and Loop Reordering to maximize register reuse.

TVM 행렬 곱셈 최적화 - Step 1: Simple GPU Binding

4 minute read

Published: December 02, 2025

기본 GPU 구현으로 95 GFLOPS를 달성했습니다. CPU 대비 6.3배 향상된 성능을 보여주지만, A500 Peak (3.072 TFLOPS)의 3.1%에 불과합니다. 이 포스트에서는 Data Parallelism과 2D Thread Mapping을 통한 기본 GPU 구현을 다룹니다.

TVM Matrix Multiplication Optimization - Step 1: Simple GPU Binding

4 minute read

Published: December 02, 2025

We achieved 95 GFLOPS with basic GPU implementation. This shows 6.3x performance improvement over CPU, but it is only 3.1% of A500 Peak (3.072 TFLOPS). This post covers basic GPU implementation through Data Parallelism and 2D Thread Mapping.