tile-ai/tilelang
Tile Language (tile-lang) is a concise domain-specific language designed to streamline the development of high-performance GPU/CPU kernels (e.g., GEMM, Dequant GEMM, FlashAttention, LinearAttention). By employing a Pythonic syntax with an underlying compiler infrastructure on top of TVM, tile-lang allows developers to focus on productivity without sacrificing the low-level optimizations necessary for state-of-the-art performance.

Latest News
- 09/29/2025 🎉: Thrilled to announce that AscendC and AscendNPU IR backends targeting Huawei Ascend chips are now supported! Check out the preview here: 🔗 link. This includes implementations across two branches: ascendc_pto and npuir. Feel free to explore and share your feedback!
- 07/04/2025 🚀: Introduced
T.gemm_sp
for 2:4 sparse tensor core support, check out Pull Request #526 for details. - 06/05/2025 ✨: Added NVRTC Backend to significantly reduce compilation time for cute templates!
- 04/14/2025 🚀: Added high-performance FlashMLA implementation for AMD MI300X, achieving performance parity with hand-optimized assembly kernels of Aiter! See example_mla_amd for details.
- 03/03/2025 🚀: Added high-performance MLA Decoding support using only 80 lines of Python code, achieving performance on par with FlashMLA on H100 (see example_mla_decode.py)! We also provide documentation explaining how TileLang achieves this.
- 02/15/2025 ✨: Added WebGPU Codegen support, see Pull Request #86!
- 02/12/2025 ✨: Excited to announce the release of v0.1.0!
- 02/10/2025 🚀: Added debug tools for TileLang—
T.print
for printing variables/buffers (docs) and a memory layout plotter (examples/plot_layout). - 01/20/2025 ✨: We are excited to announce that tile-lang, a dsl for high performance AI workloads, is now open source and available to the public!
Tested Devices
Although tile-lang aims to be portable across a range of Devices, it has been specifically tested and validated on the following devices: for NVIDIA GPUs, this includes the H100 (with Auto TMA/WGMMA support), A100, V100, RTX 4090, RTX 3090, and RTX A6000; for AMD GPUs, it includes the MI250 (with Auto MatrixCore support) and the MI300X (with Async Copy support).
OP Implementation Examples
tile-lang provides the building blocks to implement a wide variety of operators. Some examples include:
- Matrix Multiplication
- Dequantization GEMM
- Flash Attention
- Flash Linear Attention
- Flash MLA Decoding
- Native Sparse Attention
Within the examples
directory, you will also find additional complex kernels—such as convolutions, forward/backward passes for FlashAttention, more operators will continuously be added.
Benchmark Summary
TileLang achieves exceptional performance across a variety of computational patterns. Comprehensive benchmark scripts and settings are available at tilelang-benchmark. Below are selected results showcasing its capabilities:
-
MLA Decoding Performance on H100
-
Flash Attention Performance on H100
-
Matmul Performance on GPUs (RTX 4090, A100, H100, MI300X)
-
Dequantize Matmul Performance on A100
Installation
Method 1: Install with Pip
The quickest way to get started is to install the latest release from PyPI:
|
|
Alternatively, you can install directly from the GitHub repository:
|
|
Or install locally:
|
|
Method 2: Build from Source
We currently provide three ways to install tile-lang from source:
- Install from Source (using your own TVM installation)
- Install from Source (using the bundled TVM submodule)
- Install Using the Provided Script
Method 3: Install with Nightly Version
For users who want access to the latest features and improvements before official releases, we provide nightly builds of tile-lang.
|
|
Note: Nightly builds contain the most recent code changes but may be less stable than official releases. They’re ideal for testing new features or if you need a specific bugfix that hasn’t been released yet.
Quick Start
In this section, you’ll learn how to write and execute a straightforward GEMM (matrix multiplication) kernel using tile-lang, followed by techniques for layout optimizations, pipelining, and L2-cache–friendly swizzling.
GEMM Example with Annotations (Layout, L2 Cache Swizzling, and Pipelining, etc.)
Below is an example that demonstrates more advanced features: layout annotation, parallelized copy, and swizzle for improved L2 cache locality. This snippet shows how to adapt your kernel to maximize performance on complex hardware.
|
|
Dive Deep into TileLang Beyond GEMM
In addition to GEMM, we provide a variety of examples to showcase the versatility and power of TileLang, including:
- Dequantize GEMM: Achieve high-performance dequantization by fine-grained control over per-thread operations, with many features now adopted as default behaviors in BitBLAS, which utilizing magic layout transformation and intrins to accelerate dequantize gemm.
- FlashAttention: Enable cross-operator fusion with simple and intuitive syntax, and we also provide an example of auto tuning.
- LinearAttention: Examples include RetNet and Mamba implementations.
- Convolution: Implementations of Convolution with IM2Col.
Upcoming Features
Check our tilelang v0.2.0 release plan for upcoming features.
TileLang has now been used in project BitBLAS and AttentionEngine.
Join the Discussion
Welcome to join our Discord community for discussions, support, and collaboration!
Acknowledgements
We would like to express our gratitude to the TVM community for their invaluable contributions. The initial version of this project was mainly developed by LeiWang1999, chengyupku and nox-410 with supervision from Prof. Zhi Yang at Peking University. Part of this work was carried out during an internship at Microsoft Research, where Dr. Lingxiao Ma, Dr. Yuqing Xia, Dr. Jilong Xue, and Dr. Fan Yang offered valuable advice and support. We deeply appreciate their mentorship and contributions.