CUDA on Producthunt daily

llama.cpp

Tue, 14 Oct 2025 15:29:21 +0800

ggml-org/llama.cpp

llama.cpp

Manifesto / ggml / ops

LLM inference in C/C++

Recent API changes

Hot topics

guide : running gpt-oss with llama.cpp
[FEEDBACK] Better packaging for llama.cpp to support downstream consumers 🤗
Support for the gpt-oss model with native MXFP4 format has been added | PR | Collaboration with NVIDIA | Comment
Hot PRs: All | Open
Multimodal support arrived in llama-server: #12898 | documentation
VS Code extension for FIM completions: https://github.com/ggml-org/llama.vscode
Vim/Neovim plugin for FIM completions: https://github.com/ggml-org/llama.vim
Introducing GGUF-my-LoRA https://github.com/ggml-org/llama.cpp/discussions/10123
Hugging Face Inference Endpoints now support GGUF out of the box! https://github.com/ggml-org/llama.cpp/discussions/9669
Hugging Face GGUF editor: discussion | tool

Quick start

Getting started with llama.cpp is straightforward. Here are several ways to install it on your machine:

Install llama.cpp using brew, nix or winget
Run with Docker - see our Docker documentation
Download pre-built binaries from the releases page
Build from source by cloning this repository - check out our build guide

Once installed, you’ll need a model to work with. Head to the Obtaining and quantizing models section to learn more.

Example command:

# Use a local model file
llama-cli -m my_model.gguf

# Or download and run a model directly from Hugging Face
llama-cli -hf ggml-org/gemma-3-1b-it-GGUF

# Launch OpenAI-compatible API server
llama-server -hf ggml-org/gemma-3-1b-it-GGUF

Description

The main goal of llama.cpp is to enable LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware - locally and in the cloud.

Plain C/C++ implementation without any dependencies
Apple silicon is a first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks
AVX, AVX2, AVX512 and AMX support for x86 architectures
1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use
Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP and Moore Threads GPUs via MUSA)
Vulkan and SYCL backend support
CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity

The llama.cpp project is the main playground for developing new features for the ggml library.

Models

Typically finetunes of the base models below are supported as well.

Instructions for adding support for new models: HOWTO-add-model.md

Text-only

LLaMA 🦙
LLaMA 2 🦙🦙
LLaMA 3 🦙🦙🦙
Mistral 7B
Mixtral MoE
DBRX
Falcon
Chinese LLaMA / Alpaca and Chinese LLaMA-2 / Alpaca-2
Vigogne (French)
BERT
Koala
Baichuan 1 & 2 + derivations
Aquila 1 & 2
Starcoder models
Refact
MPT
Bloom
Yi models
StableLM models
Deepseek models
Qwen models
PLaMo-13B
Phi models
PhiMoE
GPT-2
Orion 14B
InternLM2
CodeShell
Gemma
Mamba
Grok-1
Xverse
Command-R models
SEA-LION
GritLM-7B + GritLM-8x7B
OLMo
OLMo 2
OLMoE
Granite models
GPT-NeoX + Pythia
Snowflake-Arctic MoE
Smaug
Poro 34B
Bitnet b1.58 models
Flan T5
Open Elm models
ChatGLM3-6b + ChatGLM4-9b + GLMEdge-1.5b + GLMEdge-4b
GLM-4-0414
SmolLM
EXAONE-3.0-7.8B-Instruct
FalconMamba Models
Jais
Bielik-11B-v2.3
RWKV-6
QRWKV-6
GigaChat-20B-A3B
Trillion-7B-preview
Ling models
LFM2 models
Hunyuan models

Multimodal

Bindings

Python: ddh0/easy-llama
Python: abetlen/llama-cpp-python
Go: go-skynet/go-llama.cpp
Node.js: withcatai/node-llama-cpp
JS/TS (llama.cpp server client): lgrammel/modelfusion
JS/TS (Programmable Prompt Engine CLI): offline-ai/cli
JavaScript/Wasm (works in browser): tangledgroup/llama-cpp-wasm
Typescript/Wasm (nicer API, available on npm): ngxson/wllama
Ruby: yoshoku/llama_cpp.rb
Rust (more features): edgenai/llama_cpp-rs
Rust (nicer API): mdrokz/rust-llama.cpp
Rust (more direct bindings): utilityai/llama-cpp-rs
Rust (automated build from crates.io): ShelbyJenkins/llm_client
C#/.NET: SciSharp/LLamaSharp
C#/VB.NET (more features - community license): LM-Kit.NET
Scala 3: donderom/llm4s
Clojure: phronmophobic/llama.clj
React Native: mybigday/llama.rn
Java: kherud/java-llama.cpp
Java: QuasarByte/llama-cpp-jna
Zig: deins/llama.cpp.zig
Flutter/Dart: netdur/llama_cpp_dart
Flutter: xuegao-tzx/Fllama
PHP (API bindings and features built on top of llama.cpp): distantmagic/resonance (more info)
Guile Scheme: guile_llama_cpp
Swift srgtuszy/llama-cpp-swift
Swift ShenghaiWang/SwiftLlama
Delphi Embarcadero/llama-cpp-delphi

UIs

(to have a project listed here, it should clearly state that it depends on llama.cpp)

AI Sublime Text plugin (MIT)
cztomsik/ava (MIT)
Dot (GPL)
eva (MIT)
iohub/collama (Apache-2.0)
janhq/jan (AGPL)
johnbean393/Sidekick (MIT)
KanTV (Apache-2.0)
KodiBot (GPL)
llama.vim (MIT)
LARS (AGPL)
Llama Assistant (GPL)
LLMFarm (MIT)
LLMUnity (MIT)
LMStudio (proprietary)
LocalAI (MIT)
LostRuins/koboldcpp (AGPL)
MindMac (proprietary)
MindWorkAI/AI-Studio (FSL-1.1-MIT)
Mobile-Artificial-Intelligence/maid (MIT)
Mozilla-Ocho/llamafile (Apache-2.0)
nat/openplayground (MIT)
nomic-ai/gpt4all (MIT)
ollama/ollama (MIT)
oobabooga/text-generation-webui (AGPL)
PocketPal AI (MIT)
psugihara/FreeChat (MIT)
ptsochantaris/emeltal (MIT)
pythops/tenere (AGPL)
ramalama (MIT)
semperai/amica (MIT)
withcatai/catai (MIT)
Autopen (GPL)

Tools

akx/ggify – download PyTorch models from HuggingFace Hub and convert them to GGML
akx/ollama-dl – download models from the Ollama library to be used directly with llama.cpp
crashr/gppm – launch llama.cpp instances utilizing NVIDIA Tesla P40 or P100 GPUs with reduced idle power consumption
gpustack/gguf-parser - review/check the GGUF file and estimate the memory usage
Styled Lines (proprietary licensed, async wrapper of inference part for game development in Unity3d with pre-built Mobile and Web platform wrappers and a model example)

Infrastructure

Paddler - Open-source LLMOps platform for hosting and scaling AI in your own infrastructure
GPUStack - Manage GPU clusters for running LLMs
llama_cpp_canister - llama.cpp as a smart contract on the Internet Computer, using WebAssembly
llama-swap - transparent proxy that adds automatic model switching with llama-server
Kalavai - Crowdsource end to end LLM deployment at any scale
llmaz - ☸️ Easy, advanced inference platform for large language models on Kubernetes.

Games

Lucy’s Labyrinth - A simple maze game where agents controlled by an AI model will try to trick you.

Supported backends

Backend	Target devices
Metal	Apple Silicon
BLAS	All
BLIS	All
SYCL	Intel and Nvidia GPU
MUSA	Moore Threads GPU
CUDA	Nvidia GPU
HIP	AMD GPU
Vulkan	GPU
CANN	Ascend NPU
OpenCL	Adreno GPU
IBM zDNN	IBM Z & LinuxONE
WebGPU [In Progress]	All
RPC	All

Obtaining and quantizing models

The Hugging Face platform hosts a number of LLMs compatible with llama.cpp:

You can either manually download the GGUF file or directly use any llama.cpp-compatible models from Hugging Face or other model hosting sites, such as ModelScope, by using this CLI argument: -hf <user>/<model>[:quant]. For example:

`1`	`llama-cli -hf ggml-org/gemma-3-1b-it-GGUF`

By default, the CLI would download from Hugging Face, you can switch to other options with the environment variable MODEL_ENDPOINT. For example, you may opt to downloading model checkpoints from ModelScope or other model sharing communities by setting the environment variable, e.g. MODEL_ENDPOINT=https://www.modelscope.cn/.

After downloading a model, use the CLI tools to run it locally - see below.

llama.cpp requires the model to be stored in the GGUF file format. Models in other data formats can be converted to GGUF using the convert_*.py Python scripts in this repo.

The Hugging Face platform provides a variety of online tools for converting, quantizing and hosting models with llama.cpp:

Use the GGUF-my-repo space to convert to GGUF format and quantize model weights to smaller sizes
Use the GGUF-my-LoRA space to convert LoRA adapters to GGUF format (more info: https://github.com/ggml-org/llama.cpp/discussions/10123)
Use the GGUF-editor space to edit GGUF meta data in the browser (more info: https://github.com/ggml-org/llama.cpp/discussions/9268)
Use the Inference Endpoints to directly host llama.cpp in the cloud (more info: https://github.com/ggml-org/llama.cpp/discussions/9669)

To learn more about model quantization, read this documentation

`llama-cli`

A CLI tool for accessing and experimenting with most of `llama.cpp`’s functionality.

Run in conversation mode

Models with a built-in chat template will automatically activate conversation mode. If this doesn’t occur, you can manually enable it by adding -cnv and specifying a suitable chat template with --chat-template NAME

llama-cli -m model.gguf

# > hi, who are you?
# Hi there! I'm your helpful assistant! I'm an AI-powered chatbot designed to assist and provide information to users like you. I'm here to help answer your questions, provide guidance, and offer support on a wide range of topics. I'm a friendly and knowledgeable AI, and I'm always happy to help with anything you need. What's on your mind, and how can I assist you today?
#
# > what is 1+1?
# Easy peasy! The answer to 1+1 is... 2!

Run in conversation mode with custom chat template

# use the "chatml" template (use -h to see the list of supported templates)
llama-cli -m model.gguf -cnv --chat-template chatml

# use a custom template
llama-cli -m model.gguf -cnv --in-prefix 'User: ' --reverse-prompt 'User:'

Run simple text completion

To disable conversation mode explicitly, use -no-cnv

1
2
3

llama-cli -m model.gguf -p "I believe the meaning of life is" -n 128 -no-cnv

# I believe the meaning of life is to find your own truth and to live in accordance with it. For me, this means being true to myself and following my passions, even if they don't align with societal expectations. I think that's what I love about yoga – it's not just a physical practice, but a spiritual one too. It's about connecting with yourself, listening to your inner voice, and honoring your own unique journey.

Constrain the output with a custom grammar
1 2 3

llama-cli -m model.gguf -n 256 --grammar-file grammars/json.gbnf -p 'Request: schedule a call at 8pm; Command:' # {"appointmentTime": "8pm", "appointmentDetails": "schedule a a call"}
The grammars/ folder contains a handful of sample grammars. To write your own, check out the GBNF Guide.

For authoring more complex JSON grammars, check out https://grammar.intrinsiclabs.ai/

`llama-server`

A lightweight, OpenAI API compatible, HTTP server for serving LLMs.

Start a local HTTP server with default configuration on port 8080

llama-server -m model.gguf --port 8080

# Basic web UI can be accessed via browser: http://localhost:8080
# Chat completion endpoint: http://localhost:8080/v1/chat/completions

Support multiple-users and parallel decoding

1
2

# up to 4 concurrent requests, each with 4096 max context
llama-server -m model.gguf -c 16384 -np 4

Enable speculative decoding

1
2

# the draft.gguf model should be a small variant of the target model.gguf
llama-server -m model.gguf -md draft.gguf

Serve an embedding model

1
2

# use the /embedding endpoint
llama-server -m model.gguf --embedding --pooling cls -ub 8192

Serve a reranking model

1
2

# use the /reranking endpoint
llama-server -m model.gguf --reranking

Constrain all outputs with a grammar

# custom grammar
llama-server -m model.gguf --grammar-file grammar.gbnf

# JSON
llama-server -m model.gguf --grammar-file grammars/json.gbnf

`llama-perplexity`

A tool for measuring the perplexity ¹ (and other quality metrics) of a model over a given text.

Measure the perplexity over a text file

llama-perplexity -m model.gguf -f file.txt

# [1]15.2701,[2]5.4007,[3]5.3073,[4]6.2965,[5]5.8940,[6]5.6096,[7]5.7942,[8]4.9297, ...
# Final estimate: PPL = 5.4007 +/- 0.67339

Measure KL divergence
1

# TODO

`llama-bench`

Benchmark the performance of the inference for various parameters.

Run default benchmark

llama-bench -m model.gguf

# Output:
# | model               |       size |     params | backend    | threads |          test |                  t/s |
# | ------------------- | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
# | qwen2 1.5B Q4_0     | 885.97 MiB |     1.54 B | Metal,BLAS |      16 |         pp512 |      5765.41 ± 20.55 |
# | qwen2 1.5B Q4_0     | 885.97 MiB |     1.54 B | Metal,BLAS |      16 |         tg128 |        197.71 ± 0.81 |
#
# build: 3e0ba0e60 (4229)

`llama-run`

A comprehensive example for running `llama.cpp` models. Useful for inferencing. Used with RamaLama ².

Run a model with a specific prompt (by default it's pulled from Ollama registry)
1

llama-run granite-code

`llama-simple`

A minimal example for implementing apps with `llama.cpp`. Useful for developers.

Basic text completion

1
2
3

llama-simple -m model.gguf

# Hello my name is Kaitlyn and I am a 16 year old girl. I am a junior in high school and I am currently taking a class called "The Art of

Contributing

Contributors can open PRs
Collaborators will be invited based on contributions
Maintainers can push to branches in the llama.cpp repo and merge PRs into the master branch
Any help with managing issues, PRs and projects is very appreciated!
See good first issues for tasks suitable for first contributions
Read the CONTRIBUTING.md for more information
Make sure to read this: Inference at the edge
A bit of backstory for those who are interested: Changelog podcast

XCFramework

The XCFramework is a precompiled version of the library for iOS, visionOS, tvOS, and macOS. It can be used in Swift projects without the need to compile the library from source. For example:

// swift-tools-version: 5.10
// The swift-tools-version declares the minimum version of Swift required to build this package.

import PackageDescription

let package = Package(
    name: "MyLlamaPackage",
    targets: [
        .executableTarget(
            name: "MyLlamaPackage",
            dependencies: [
                "LlamaFramework"
            ]),
        .binaryTarget(
            name: "LlamaFramework",
            url: "https://github.com/ggml-org/llama.cpp/releases/download/b5046/llama-b5046-xcframework.zip",
            checksum: "c19be78b5f00d8d29a25da41042cb7afa094cbf6280a225abe614b03b20029ab"
        )
    ]
)

The above example is using an intermediate build b5046 of the library. This can be modified to use a different version by changing the URL and checksum.

Completions

Command-line completion is available for some environments.

Bash Completion

1
2

$ build/bin/llama-cli --completion-bash > ~/.llama-completion.bash
$ source ~/.llama-completion.bash

Optionally this can be added to your .bashrc or .bash_profile to load it automatically. For example:

`1`	`$ echo "source ~/.llama-completion.bash" >> ~/.bashrc`

Dependencies

yhirose/cpp-httplib - Single-header HTTP server, used by llama-server - MIT license
stb-image - Single-header image format decoder, used by multimodal subsystem - Public domain
nlohmann/json - Single-header JSON library, used by various tools/examples - MIT License
minja - Minimal Jinja parser in C++, used by various tools/examples - MIT License
linenoise.cpp - C++ library that provides readline-like line editing capabilities, used by llama-run - BSD 2-Clause License
curl - Client-side URL transfer library, used by various tools/examples - CURL License
miniaudio.h - Single-header audio format decoder, used by multimodal subsystem - Public domain

audiblez

Fri, 29 Aug 2025 15:27:48 +0800

santinic/audiblez

Audiblez: Generate audiobooks from e-books

v4 Now with Graphical interface, CUDA support, and many languages!

Audiblez generates .m4b audiobooks from regular .epub e-books, using Kokoro’s high-quality speech synthesis.

Kokoro-82M is a recently published text-to-speech model with just 82M params and very natural sounding output. It’s released under Apache licence and it was trained on < 100 hours of audio. It currently supports these languages: 🇺🇸 🇬🇧 🇪🇸 🇫🇷 🇮🇳 🇮🇹 🇯🇵 🇧🇷 🇨🇳

On a Google Colab’s T4 GPU via Cuda, it takes about 5 minutes to convert “Animal’s Farm” by Orwell (which is about 160,000 characters) to audiobook, at a rate of about 600 characters per second.

On my M2 MacBook Pro, on CPU, it takes about 1 hour, at a rate of about 60 characters per second.

How to install the Command Line tool

If you have Python 3 on your computer, you can install it with pip. You also need espeak-ng and ffmpeg installed on your machine:

1
2

sudo apt install ffmpeg espeak-ng                   # on Ubuntu/Debian 🐧
pip install audiblez

1
2

brew install ffmpeg espeak-ng                       # on Mac 🍏
pip install audiblez

Then you can convert an .epub directly with:

`1`	`audiblez book.epub -v af_sky`

It will first create a bunch of book_chapter_1.wav, book_chapter_2.wav, etc. files in the same directory, and at the end it will produce a book.m4b file with the whole book you can listen with VLC or any audiobook player. It will only produce the .m4b file if you have ffmpeg installed on your machine.

How to run the GUI

The GUI is a simple graphical interface to use audiblez. You need some extra dependencies to run the GUI:

sudo apt install ffmpeg espeak-ng 
sudo apt install libgtk-3-dev        # just for Ubuntu/Debian 🐧, Windows/Mac don't need this
  
pip install audiblez pillow wxpython

Then you can run the GUI with:

`1`	`audiblez-ui`

How to run on Windows

After many trials, on Windows we recommend to install audiblez in a Python venv:

Open a Windows terminal
Create anew folder: mkdir audiblez
Enter the folder: cd audiblez
Create a venv: python -m venv venv
Activate the venv: .\venv\Scripts\Activate.ps1
Install the dependencies: pip install audiblez pillow wxpython
Now you can run audiblez or audiblez-ui
For Cuda support, you need to install Pytorch accordingly: https://pytorch.org/get-started/locally/

Speed

By default the audio is generated using a normal speed, but you can make it up to twice slower or faster by specifying a speed argument between 0.5 to 2.0:

`1`	`audiblez book.epub -v af_sky -s 1.5`

Supported Voices

Use -v option to specify the voice to use. Available voices are listed here. The first letter is the language code and the second is the gender of the speaker e.g. im_nicola is an italian male voice.

For hearing samples of Kokoro-82M voices, go here

Language	Voices
🇺🇸 American English	`af_alloy`, `af_aoede`, `af_bella`, `af_heart`, `af_jessica`, `af_kore`, `af_nicole`, `af_nova`, `af_river`, `af_sarah`, `af_sky`, `am_adam`, `am_echo`, `am_eric`, `am_fenrir`, `am_liam`, `am_michael`, `am_onyx`, `am_puck`, `am_santa`
🇬🇧 British English	`bf_alice`, `bf_emma`, `bf_isabella`, `bf_lily`, `bm_daniel`, `bm_fable`, `bm_george`, `bm_lewis`
🇪🇸 Spanish	`ef_dora`, `em_alex`, `em_santa`
🇫🇷 French	`ff_siwis`
🇮🇳 Hindi	`hf_alpha`, `hf_beta`, `hm_omega`, `hm_psi`
🇮🇹 Italian	`if_sara`, `im_nicola`
🇯🇵 Japanese	`jf_alpha`, `jf_gongitsune`, `jf_nezumi`, `jf_tebukuro`, `jm_kumo`
🇧🇷 Brazilian Portuguese	`pf_dora`, `pm_alex`, `pm_santa`
🇨🇳 Mandarin Chinese	`zf_xiaobei`, `zf_xiaoni`, `zf_xiaoxiao`, `zf_xiaoyi`, `zm_yunjian`, `zm_yunxi`, `zm_yunxia`, `zm_yunyang`

For more detaila about voice quality, check this document: Kokoro-82M voices

How to run on GPU

By default, audiblez runs on CPU. If you pass the option --cuda it will try to use the Cuda device via Torch.

Check out this example: Audiblez running on a Google Colab Notebook with Cuda .

We don’t currently support Apple Silicon, as there is not yet a Kokoro implementation in MLX. As soon as it will be available, we will support it.

Manually pick chapters to convert

Sometimes you want to manually select which chapters/sections in the e-book to read out loud. To do so, you can use --pick to interactively choose the chapters to convert (without running the GUI).

Help page

For all the options available, you can check the help page audiblez --help:

usage: audiblez [-h] [-v VOICE] [-p] [-s SPEED] [-c] [-o FOLDER] epub_file_path

positional arguments:
  epub_file_path        Path to the epub file

options:
  -h, --help            show this help message and exit
  -v VOICE, --voice VOICE
                        Choose narrating voice: a, b, e, f, h, i, j, p, z
  -p, --pick            Interactively select which chapters to read in the audiobook
  -s SPEED, --speed SPEED
                        Set speed from 0.5 to 2.0
  -c, --cuda            Use GPU via Cuda in Torch if available
  -o FOLDER, --output FOLDER
                        Output folder for the audiobook and temporary files

example:
  audiblez book.epub -l en-us -v af_sky

to use the GUI, run:
  audiblez-ui

Author

by Claudio Santini in 2025, distributed under MIT licence.

cutlass

Wed, 16 Jul 2025 15:32:12 +0800

NVIDIA/cutlass

Overview

CUTLASS 4.1.0

CUTLASS 4.1.0 - July 2025

CUTLASS is a collection of abstractions for implementing high-performance matrix-matrix multiplication (GEMM) and related computations at all levels and scales within CUDA. It incorporates strategies for hierarchical decomposition and data movement. CUTLASS decomposes these “moving parts” into reusable, modular software components and abstractions.

Primitives for different levels of a conceptual parallelization hierarchy can be specialized and tuned via custom tiling sizes, data types, and other algorithmic policy. The resulting flexibility simplifies their use as building blocks within custom kernels and applications.

CUTLASS has been providing CUDA C++ template abstractions for high-performance linear algebra since 2017 and these abstractions provide extensive support for a wide range of computations including mixed-precision computations, specialized data-movement (async copy) and multiply-accumulate abstractions for FP64, FP32, TF32, FP16, BF16, FP32 emulation via tensor core instruction, 8b floating point types (e5m2 and e4m3), block scaled data types (NVIDIA NVFP4 and OCP standard MXFP4, MXFP6, MXFP8), narrow integer types (4 and 8b signed and unsigned integers), and binary 1b data types (where architectures allow for the native support of such data types) across NVIDIA’s Volta, Turing, Ampere, Ada, Hopper, and Blackwell architectures.

To this rich ecosystem of C++ based kernel programming abstractions, CUTLASS 4 adds CUTLASS DSLs. These are Python native interfaces for writing high-performance CUDA kernels based on core CUTLASS and CuTe concepts without any performance compromises. This allows for a much smoother learning curve, orders of magnitude faster compile times, native integration with DL frameworks without writing glue code, and much more intuitive metaprogramming that does not require deep C++ expertise.

Overall we envision CUTLASS DSLs as a family of domain-specific languages (DSLs). With the release of 4.0, we are releasing the first of these in CuTe DSL. This is a low level programming model that is fully consistent with CuTe C++ abstractions — exposing core concepts such as layouts, tensors, hardware atoms, and full control over the hardware thread and data hierarchy.

CuTe DSL demonstrates optimal matrix multiply and other linear algebra operations targeting the programmable, high-throughput Tensor Cores implemented by NVIDIA’s Ampere, Hopper, and Blackwell architectures.

We believe it will become an indispensable tool for students, researchers, and performance engineers alike — flattening the learning curve of GPU programming, rapidly prototyping kernel designs, and bringing optimized solutions into production.

CuTe DSL is currently in public beta and will graduate out of beta by end of summer 2025.

To get started quickly - please refer :

What’s New in CUTLASS 4.1

CuTe DSL

More examples demonstrating how to use CuTe DSL to write peak-performance kernels
- Blackwell Mamba2 SSD
API updates
- for loop
  - Python built-in range now always generates IR and executes at runtime
  - cutlass.range is advanced range with IR level unrolling and pipelining control
  - Deprecated cutlass.range_dynamic, please replace with range or cutlass.range
  - Experimental Added pipelining control for compiler generated software pipeline code
- while/if
  - while/if now by default generates IR and executes at runtime unless cutlass.const_expr is specified for the predicate
  - Deprecated cutlass.dynamic_expr, please remove it
- Rename mbarrier functions to reduce ambiguity
- Modify SyncObject API (MbarrierArray, NamedBarrier, TmaStoreFence) to match std::barrier
- Change pipeline create function to take only keyword arguments, and make barrier_storage optional.

CUTLASS C++

Further enhance Blackwell SM100 Attention kernels in example 77.
- Add variable sequence length support for FMHA Backward kernel.
- Add varlen test support to Backward runner.
- Codes support empty batch sequences.
Replace subbyte_iterator with cute::recast_ptr when constructing logical iterators/arrays.
CuTe changes:
- Rewrite ArithTuple and ScaledBasis for robustness and clarity.
- Remove buggy and kludgy get_layoutA|B|C_MN and friends from Atoms/TiledX.
- Factor out print_latex and friends and rewrite.
- Factor out print_svg and friends and rewrite.
Support Blackwell SM100 SIMT FFMA2 kernels.
Support residual add for implicit gemm kernels.
Various fixes for CUTLASS C++ Python interface’s EVT tracer:
- Add verifier for sm90 to report the invalid input.
- When adding an edge to the graph, if the edge already exists, add an identity compute node to avoid having multiple parallel edges.
- Register operations of tanh, sigmoid, exp, gelu to the python ast frontend.
- Replace the NotImplemented Error by packing all nodes into a single topological visitor node as a fallback.
Fix profiler bugs in exhaustive perf search.
- Fix incorrect cluster shape output issue when doing exhaustive search.
- Fix a bug in profiler grouped GEMM for setting tile scheduler swizzles, cluster shapes, and raster orders.

Note: CUTLASS 4.x builds are known to be down on Windows platforms for all CUDA toolkits. CUTLASS team is working on a fix.

See the CHANGELOG for details of all past releases and updates.

Performance

CUTLASS primitives are very efficient. When used to construct device-wide GEMM kernels, they exhibit nearly optimal utilization of peak theoretical throughput. The figure below shows CUTLASS 3.8’s performance as a % of theoretical peak utilization on various input and output data types when run on NVIDIA Blackwell SM100 architecture GPU.

The two figures below show the continual CUTLASS performance improvements on an NVIDIA H100 (NVIDIA Hopper architecture) since CUTLASS 3.1. CUTLASS 3.5.1 was compiled with the CUDA 12.5u1 Toolkit. Tensor Core operations are implemented using CUDA’s mma and wgmma instructions.

CuTe

CUTLASS 3.0 introduced a new core library, CuTe, to describe and manipulate tensors of threads and data. CuTe is a collection of C++ CUDA template abstractions for defining and operating on hierarchically multidimensional layouts of threads and data. CuTe provides Layout and Tensor objects that compactly package the type, shape, memory space, and layout of data, while performing the complicated indexing for the user. This lets programmers focus on the logical descriptions of their algorithms while CuTe does the mechanical bookkeeping for them. With these tools, we can quickly design, implement, and modify all dense linear algebra operations.

The core abstractions of CuTe are hierarchically multidimensional layouts which can be composed with data arrays to represent tensors. The representation of layouts is powerful enough to represent nearly everything we need to implement efficient dense linear algebra. Layouts can also be combined and manipulated via functional composition, on which we build a large set of common operations such as tiling and partitioning.

CUTLASS 3.0 and beyond adopts CuTe throughout the GEMM hierarchy in its templates. This greatly simplifies the design and improves code composability and readability. More documentation specific to CuTe can be found in its dedicated documentation directory.

Compatibility

Minimum requirements:

Architecture: Volta (compute capability 7.0)
Compiler: Must support at least C++17
CUDA Toolkit version: 11.4

CUTLASS requires a C++17 host compiler and performs best when built with the CUDA 12.8 Toolkit. It is also compatible with CUDA 11.4, CUDA 11.5, CUDA 11.6, CUDA 11.7, CUDA 11.8, and all other CUDA 12.x versions.

Operating Systems

We have tested the following environments.

Operating System	Compiler
Ubuntu 18.04	GCC 7.5.0
Ubuntu 20.04	GCC 10.3.0
Ubuntu 22.04	GCC 11.2.0

Note: GCC 8.5.0 has known regressions regarding fold expressions and overloaded operators. Using GCC 7.5.0 or (preferred) GCC >= 9 is recommended.

Note: CUTLASS 3.x builds are known to be down on Windows platforms for all CUDA toolkits. CUTLASS team is working on a fix.

Hardware

CUTLASS runs successfully on the following NVIDIA GPUs, and it is expected to be efficient on Volta, Turing, Ampere, Ada, and Hopper architecture based NVIDIA GPUs.

GPU	CUDA Compute Capability	Minimum CUDA Toolkit Required by CUTLASS-3
NVIDIA V100 Tensor Core GPU	7.0	11.4
NVIDIA TitanV	7.0	11.4
NVIDIA GeForce RTX 20x0 series	7.5	11.4
NVIDIA T4	7.5	11.4
NVIDIA A100 Tensor Core GPU	8.0	11.4
NVIDIA A10	8.6	11.4
NVIDIA GeForce RTX 30x0 series	8.6	11.4
NVIDIA GeForce RTX 40x0 series	8.9	11.8
NVIDIA L40	8.9	11.8
NVIDIA H100 Tensor Core GPU	9.0	11.8
NVIDIA H200 Tensor Core GPU	9.0	11.8
NVIDIA B200 Tensor Core GPU	10.0	12.8
NVIDIA GeForce RTX 50x0 series	10.0	12.8

Target Architecture

In general, PTX code generated for one target architecture can be run on future architectures (i.e., it is forward compatible). However, CUDA 12.0 introduced the concept of “architecture-accelerated features” whose PTX does not have forward compatibility guarantees. Several Hopper and Blackwell PTX instructions fall under this category of architecture-accelerated features, and thus require a sm_90a or sm100a target architecture (note the “a” appended). For more details on this and other architecture-accelerated instructions, please refer to the CUDA Documentation.

The target architecture information is passed on to CUTLASS via the cmake flag CUTLASS_NVCC_ARCHS. In order to maximize performance on Hopper GH100, users are required to build CUTLASS with 90a as the target architecture. If a user accidentally builds a kernel which uses SM90a features (e.g. Hopper Tensor Core Instructions), using the SM90 target (note the lack of “a”), with either CUDA Toolkit 12 or 11.8, the kernel is expected to fail with a runtime error.

`1`	`cmake .. -DCUTLASS_NVCC_ARCHS="90a"`

`1`	`cmake .. -DCUTLASS_NVCC_ARCHS="100a"`

Note: The NVIDIA Blackwell SM100 architecture used in the datacenter products has a different compute capability than the one underpinning NVIDIA Blackwell GeForce RTX 50 series GPUs. As a result, kernels compiled for Blackwell SM100 architecture with arch conditional features (using sm100a) are not compatible with RTX 50 series GPUs.

Please refer to the functionality documentation for details on which kernels require which target architectures.

Documentation

CUTLASS is described in the following documents and the accompanying Doxygen documentation.

Quick Start Guide - basics of building and running CUTLASS
Functionality - summarizes functionality available in CUTLASS
Efficient GEMM in CUDA - describes how GEMM kernels may be implemented efficiently in CUDA
CUTLASS 3.x Design - describes the CUTLASS 3.x design, its benefits, and how CuTe enables us to write much more composable components
GEMM API 3.x - describes the CUTLASS 3.x GEMM model and C++ template concepts
GEMM API 2.x - describes the CUTLASS 2.x GEMM model and C++ template concepts
Implicit GEMM Convolution - describes 2-D and 3-D convolution in CUTLASS
Code Organization - describes the organization and contents of the CUTLASS project
Terminology - describes terms used in the code
Programming Guidelines - guidelines for writing efficient modern CUDA C++
Fundamental types - describes basic C++ classes used in CUTLASS to represent numeric quantities and arrays
Layouts - describes layouts of matrices and tensors in memory
Tile Iterators - describes C++ concepts for iterating over tiles of matrices in memory
CUTLASS Profiler - command-line driven profiling application
CUTLASS Utilities - additional templates used to facilitate rapid development
Dependent kernel launch - describes a new feature in Hopper which allows overlapping dependent kernels in the same stream, and how it is used in CUTLASS.

Resources

We have also described the structure of an efficient GEMM in our talk at the GPU Technology Conference 2018.

Building CUTLASS

CUTLASS is a header-only template library and does not need to be built to be used by other projects. Client applications should target CUTLASS’s include/ directory in their include paths.

CUTLASS unit tests, examples, and utilities can be build with CMake. The minimum version of CMake is given in the Quickstart guide. Make sure the CUDACXX environment variable points to NVCC in the CUDA Toolkit installed on your system.

`1`	`$ export CUDACXX=${CUDA_INSTALL_PATH}/bin/nvcc`

Create a build directory within the CUTLASS project, then run CMake. By default CUTLASS will build kernels for CUDA architecture versions 5.0, 6.0, 6.1, 7.0, 7.5, 8.0, 8.6, 8.9, and 9.0. To reduce compile time you can specify the architectures to build CUTLASS for by changing the CMake configuration setting CUTLASS_NVCC_ARCHS.

1
2
3

$ mkdir build && cd build

$ cmake .. -DCUTLASS_NVCC_ARCHS=80               # compiles for NVIDIA's Ampere Architecture

From the build/ directory, compile and run the CUTLASS unit tests by building the target test_unit with make.

The unit tests are organized as several binaries mirroring the top-level namespaces of CUTLASS, and they may be executed in parallel via make’s -j command line argument.

$ make test_unit -j
...
...
...
[----------] Global test environment tear-down
[==========] 946 tests from 57 test cases ran. (10812 ms total)
[  PASSED  ] 946 tests.

All tests should pass on supported platforms, though the exact number of tests may vary over time.

Project Structure

CUTLASS is arranged as a header-only library along with Utilities, Tools, Examples, and unit tests. Doxygen documentation provides a complete list of files, classes, and template concepts defined in the CUTLASS project.

A detailed explanation of the source code organization may be found in the CUTLASS documentation, but several main components are summarized below.

CUTLASS Template Library

include/                     # client applications should target this directory in their build's include paths

  cutlass/                   # CUDA Templates for Linear Algebra Subroutines and Solvers - headers only

    arch/                    # direct exposure of architecture features (including instruction-level GEMMs)

    conv/                    # code specialized for convolution

    epilogue/                # code specialized for the epilogue of gemm/convolution

    gemm/                    # code specialized for general matrix product computations

    layout/                  # layout definitions for matrices, tensors, and other mathematical objects in memory

    platform/                # CUDA-capable Standard Library components

    reduction/               # bandwidth-limited reduction kernels that do not fit the "gemm" model

    thread/                  # simt code that can be performed within a CUDA thread

    transform/               # code specialized for layout, type, and domain transformations

    *                        # core vocabulary types, containers, and basic numeric operations

  cute/                      # CuTe Layout, layout algebra, MMA/Copy atoms, tiled MMA/Copy

    algorithm/               # Definitions of core operations such as copy, gemm, and operations on cute::tuples

    arch/                    # Bare bones PTX wrapper structs for copy and math instructions

    atom/                    # Meta-information either link to or built from arch/ operators

      mma_atom.hpp           # cute::Mma_Atom and cute::TiledMma

      copy_atom.hpp          # cute::Copy_Atom and cute::TiledCopy

      *sm*.hpp               # Arch specific meta-information for copy and math operations

    *                        # Core library types such as Shape, Stride, Layout, Tensor, and associated operations

CUTLASS SDK Examples

CUTLASS SDK examples apply CUTLASS templates to implement basic computations.

Tools

tools/
  library/                   # CUTLASS Instance Library - contains instantiations of all supported CUTLASS templates
    include/
      cutlass/
        library/

  profiler/                  # CUTLASS Profiler         - command-line utility for executing operations in the
                             #                            CUTLASS Library

  util/                      # CUTLASS Utilities        - contains numerous helper classes for
    include/                 #                            managing tensors in device memory, reference
      cutlass/               #                            implementations for GEMM, random initialization
        util/                #                            of tensors, and I/O.

Test

The test/unit/ directory consist of unit tests implemented with Google Test that demonstrate basic usage of Core API components and complete tests of the CUTLASS GEMM computations.

Instructions for building and running the Unit tests are described in the Quickstart guide.

Performance Profiling

The tools/profiler/ directory contains a command-line utility for launching each of the GEMM kernels. It can be built as follows:

`1`	`$ make cutlass_profiler -j16`

Building all GEMM and Convolution kernels (long build times)

By default, only one tile size is instantiated for each data type, math instruction, and layout. To instantiate all, set the following environment variable when running CMake from an empty build/ directory. Beware, this results in tens of thousands of kernels and long build times. This would also result in a large binary size and on some platforms linker to fail on building the library. Therefore, it’s highly recommended to generate only a subset of kernels as demonstrated in the sub-section below.

1
2
3

$ cmake .. -DCUTLASS_NVCC_ARCHS=90a -DCUTLASS_LIBRARY_KERNELS=all
...
$ make cutlass_profiler -j16

Building a subset of GEMM and Convolution kernels (reduced build times)

To compile strictly one kernel or a small set of kernels, a comma-delimited list of kernel names with wildcard characters may be used to reduce the set of kernels. The following examples show building exactly one or a subset of kernels for NVIDIA Ampere and Turing architecture:

Building a subset Tensor Core GEMM kernels

To compile a subset of Tensor Core GEMM kernels with FP32 accumulation and FP16 input targeting NVIDIA Ampere and Turing architecture, use the below cmake command line:

1
2
3

$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_tensorop_s*gemm_f16_*_nt_align8
...
$ make cutlass_profiler -j16

Example command line for profiling a subset of Tensor Core GEMM kernels is as follows:

./tools/profiler/cutlass_profiler --kernels=cutlass_tensorop_s*gemm_f16_*_nt_align8 --m=3456 --n=4096 --k=4096

...
=============================
  Problem ID: 1

        Provider: CUTLASS
   OperationKind: gemm
       Operation: cutlass_tensorop_s1688gemm_f16_256x128_32x2_nt_align8

          Status: Success
    Verification: ON
     Disposition: Passed

reference_device: Passed
          cuBLAS: Passed

       Arguments: --gemm_kind=universal --m=3456 --n=4096 --k=4096 --A=f16:column --B=f16:row --C=f32:column --alpha=1  \
                  --beta=0 --split_k_slices=1 --batch_count=1 --op_class=tensorop --accum=f32 --cta_m=256 --cta_n=128  \
                  --cta_k=32 --stages=2 --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=8 --min_cc=75  \
                  --max_cc=1024

           Bytes: 118489088  bytes
           FLOPs: 115992428544  flops

         Runtime: 1.55948  ms
          Memory: 70.7616 GiB/s

            Math: 74378.8 GFLOP/s



=============================
...

Building one CUDA Core GEMM kernel

To compile one SGEMM kernel targeting NVIDIA Ampere and Turing architecture, use the below cmake command line:

1
2
3

$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_simt_sgemm_128x128_8x2_nn_align1
...
$ make cutlass_profiler -j16

Example command line for profiling single SGEMM CUDA kernel is as follows:

$ ./tools/profiler/cutlass_profiler --kernels=sgemm --m=3456 --n=4096 --k=4096

=============================
  Problem ID: 1

        Provider: CUTLASS
   OperationKind: gemm
       Operation: cutlass_simt_sgemm_128x128_8x2_nn_align1

          Status: Success
    Verification: ON
     Disposition: Passed

          cuBLAS: Passed

       Arguments: --m=3456 --n=4096 --k=4096 --A=f32:column --B=f32:column --C=f32:column --alpha=1 --beta=0 --split_k_slices=1  \
                  --batch_count=1 --op_class=simt --accum=f32 --cta_m=128 --cta_n=128 --cta_k=8 --stages=2 --warps_m=4  \
                  --warps_n=2 --warps_k=1 --inst_m=1 --inst_n=1 --inst_k=1 --min_cc=50 --max_cc=1024

           Bytes: 180355072  bytes
           FLOPs: 115992428544  flops

         Runtime: 6.73655  ms
          Memory: 24.934 GiB/s

            Math: 17218.4 GFLOP/s

=============================

Building a subset of Tensor Core Convolution kernels

To compile a subset of Tensor core convolution kernels implementing forward propagation (fprop) with FP32 accumulation and FP16 input targeting NVIDIA Ampere and Turing architecture, use the below cmake command line:

1
2
3

$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_tensorop_s*fprop_optimized_f16
...
$ make cutlass_profiler -j16

Example command line for profiling a subset of Tensor Core convolution kernels is as follows:

$ ./tools/profiler/cutlass_profiler --kernels=cutlass_tensorop_s*fprop_optimized_f16 --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3

...
=============================
  Problem ID: 1

        Provider: CUTLASS
   OperationKind: conv2d
       Operation: cutlass_tensorop_s16816fprop_optimized_f16_128x128_32x5_nhwc

          Status: Success
    Verification: ON
     Disposition: Passed

reference_device: Passed

       Arguments: --conv_kind=fprop --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3 --p=224 --q=224 --pad_h=1 --pad_w=1  \
                  --stride_h=1 --stride_w=1 --dilation_h=1 --dilation_w=1 --Activation=f16:nhwc --Filter=f16:nhwc --Output=f32:nhwc  \
                  --conv_mode=cross --iterator_algorithm=optimized --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1  \
                  --eq_gemm_provider=none --op_class=tensorop --accum=f32 --cta_m=128 --cta_n=128 --cta_k=32 --stages=5  \
                  --warps_m=2 --warps_n=2 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=16 --min_cc=80 --max_cc=1024

           Bytes: 1130659840  bytes
           FLOPs: 118482796544  flops

         Runtime: 0.711496  ms
          Memory: 1479.99 GiB/s

            Math: 166526 GFLOP/s

=============================
...

Building one Convolution CUDA kernel

To compile and run one CUDA Core convolution kernel implementing forward propagation (fprop) with F32 accumulation and FP32 input targeting NVIDIA Ampere and Turing architecture, use the below cmake command line:

1
2
3

$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_simt_sfprop_optimized_128x128_8x2_nhwc
...
$ make cutlass_profiler -j16

Example command line for profiling one CUDA Core convolution kernel:

$ ./tools/profiler/cutlass_profiler --kernels=cutlass_simt_sfprop_optimized_128x128_8x2_nhwc --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3


=============================
  Problem ID: 1

        Provider: CUTLASS
   OperationKind: conv2d
       Operation: cutlass_simt_sfprop_optimized_128x128_8x2_nhwc

          Status: Success
    Verification: ON
     Disposition: Passed

reference_device: Passed

       Arguments: --conv_kind=fprop --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3 --p=224 --q=224 --pad_h=1 --pad_w=1  \
                  --stride_h=1 --stride_w=1 --dilation_h=1 --dilation_w=1 --Activation=f32:nhwc --Filter=f32:nhwc --Output=f32:nhwc  \
                  --conv_mode=cross --iterator_algorithm=optimized --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1  \
                  --eq_gemm_provider=none --op_class=simt --accum=f32 --cta_m=128 --cta_n=128 --cta_k=8 --stages=2 --warps_m=4  \
                  --warps_n=2 --warps_k=1 --inst_m=1 --inst_n=1 --inst_k=1 --min_cc=50 --max_cc=1024

           Bytes: 2055798784  bytes
           FLOPs: 118482796544  flops

         Runtime: 7.34266  ms
          Memory: 260.752 GiB/s

            Math: 16136.2 GFLOP/s


=============================

More Details on Compiling CUTLASS Kernels and CUTLASS Profiler

Please follow the links for more CMake examples on selectively compiling CUTLASS kernels:
- GEMM CMake Examples
- Implicit GEMM convolution CMake Examples
Further details about the CUTLASS Profiler are described here.

About

CUTLASS is released by NVIDIA Corporation as Open Source software under the 3-clause “New” BSD license.

Contributors

The official list of CUTLASS developers and contributors is available here: CONTRIBUTORS.

Copyright

  Redistribution and use in source and binary forms, with or without
  modification, are permitted provided that the following conditions are met:

  1. Redistributions of source code must retain the above copyright notice, this
  list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright notice,
  this list of conditions and the following disclaimer in the documentation
  and/or other materials provided with the distribution.

  3. Neither the name of the copyright holder nor the names of its
  contributors may be used to endorse or promote products derived from
  this software without specific prior written permission.

  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
  AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
  DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
  FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
  SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
  CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
  OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

ZLUDA

Mon, 07 Jul 2025 15:31:03 +0800

vosen/ZLUDA

ZLUDA

ZLUDA is a drop-in replacement for CUDA on non-NVIDIA GPU. ZLUDA allows to run unmodified CUDA applications using non-NVIDIA GPUs with near-native performance.

ZLUDA supports AMD Radeon RX 5000 series and newer GPUs (both desktop and integrated).

ZLUDA is work in progress. Follow development here and say hi on Discord. For more details see the announcement: https://vosen.github.io/ZLUDA/blog/zludas-third-life/

Usage

Warning: This version ZLUDA is under heavy development (more here) and right now only supports Geekbench. ZLUDA probably will not work with your application just yet.

Windows

You should have recent AMD GPU driver (“AMD Software: Adrenalin Edition”) installed.
To run your application you should etiher:

(Recommended approach) Copy ZLUDA-provided nvcuda.dll and nvml.dll from target\release (if built from sources) or zluda (if downloaded a zip package) into a path which your application uses to load CUDA. Paths vary application to application, but usually it’s the directory where the .exe file is located
Use ZLUDA launcher like below. ZLUDA launcher is known to be buggy and incomplete:
1

<ZLUDA_DIRECTORY>\zluda_with.exe -- <APPLICATION> <APPLICATIONS_ARGUMENTS>

Linux

Run your application like this:

`1`	`LD_LIBRARY_PATH=<ZLUDA_DIRECTORY> <APPLICATION> <APPLICATIONS_ARGUMENTS>`

where <ZLUDA_DIRECTORY> is the directory which contains ZLUDA-provided libcuda.so: target/release if you built from sources or zluda if you downloaded prebuilt package.

MacOS

Not supported

Building

Dependencies

Git
CMake
Python 3
Rust compiler (recent version)
C++ compiler
(Optional, but recommended) Ninja build system

Build steps

Git clone the repo (make sure to use --recursive option to fetch submodules):
git clone --recursive https://github.com/vosen/ZLUDA.git
Enter freshly cloned ZLUDA directory and build with cargo (this takes a while):
cargo xtask --release

Contributing

ZLUDA project has a commercial backing and does not accept donations. ZLUDA project accepts pull requests and other non-monetary contributions.

If you want to contribute a code fix or documentation update feel free to open a Pull Request.

Getting started

There’s no architecture document (yet). Two most important crates in ZLUDA are ptx (PTX compiler) and zluda (AMD GPU runtime). A good starting point to tinkering the project is to run one of the ptx unit tests under a debugger and understand what it is doing. cargo test -p ptx -- ::add_hip is a simple test that adds two numbers.

Github issues tagged with “help wanted” are tasks that are self-containted. Their level of difficulty varies, they are not always good beginner tasks, but they defined unambiguously.

If you have questions feel free to ask on #devtalk channel on Discord.

License

This software is dual-licensed under either the Apache 2.0 license or the MIT license. See LICENSE-APACHE or LICENSE-MIT for details

less_slow.cpp

Mon, 21 Apr 2025 15:29:19 +0800

ashvardanian/less_slow.cpp

Playing Around Less Slow Coding Practices for C++, CUDA, and Assembly Code

The benchmarks in this repository don’t aim to cover every topic entirely, but they help form a mindset and intuition for performance-oriented software design. It also provides an example of using some non-STL but de facto standard libraries in C++, importing them via CMake and compiling from source. For higher-level abstractions and languages, check out less_slow.rs and less_slow.py. I needed many of these measurements to reconsider my own coding habits, but hopefully they’re helpful to others as well. Most of the code is organized in very long, ordered, and nested #pragma sections — not necessarily the preferred form for everyone.

Much of modern code suffers from common pitfalls — bugs, security vulnerabilities, and performance bottlenecks. University curricula and coding bootcamps tend to stick to traditional coding styles and standard features, rarely exposing the more fun, unusual, and potentially efficient design opportunities. This repository explores just that.

The code leverages C++20 and CUDA features and is designed primarily for GCC, Clang, and NVCC compilers on Linux, though it may work on other platforms. The topics range from basic micro-kernels executing in a few nanoseconds to more complex constructs involving parallel algorithms, coroutines, and polymorphism. Some of the highlights include:

100x cheaper random inputs?! Discover how input generation sometimes costs more than the algorithm.
1% error in trigonometry at 1/40 cost: Approximate STL functions like std::sin in just 3 lines of code.
4x faster lazy-logic with custom std::ranges and iterators!
Compiler optimizations beyond -O3: Learn about less obvious flags and techniques for another 2x speedup.
Multiplying matrices? Check how a 3x3x3 GEMM can be 70% slower than 4x4x4, despite 60% fewer ops.
Scaling AI? Measure the gap between theoretical ALU throughput and your BLAS.
How many if conditions are too many? Test your CPU’s branch predictor with just 10 lines of code.
Prefer recursion to iteration? Measure the depth at which your algorithm will SEGFAULT.
Why avoid exceptions? Take std::error_code or std::variant-like wrappers?
Scaling to many cores? Learn how to use OpenMP, Intel’s oneTBB, or your custom thread pool.
How to handle JSON avoiding memory allocations? Is it easier with C++ 20 or old-school C 99 tools?
How to properly use STL’s associative containers with custom keys and transparent comparators?
How to beat a hand-written parser with consteval RegEx engines?
Is the pointer size really 64 bits and how to exploit pointer-tagging?
How many packets is UDP dropping and how to serve web requests in io_uring from user-space?
Scatter and Gather for 50% faster vectorized disjoint memory operations.
Intel’s oneAPI vs Nvidia’s CCCL? What’s so special about <thrust> and <cub>?
CUDA C++, PTX Intermediate Representations, and SASS, and how do they differ from CPU code?
How to choose between intrinsics, inline asm, and separate .S files for your performance-critical code?
Tensor Cores & Memory differences on CPUs, and Volta, Ampere, Hopper, and Blackwell GPUs!
How coding FPGA differs from GPU and what is High-Level Synthesis, Verilog, and VHDL? 🔜 #36
What are Encrypted Enclaves and what’s the latency of Intel SGX, AMD SEV, and ARM Realm? 🔜 #31

To read, jump to the less_slow.cpp source file and read the code snippets and comments. Keep in mind, that most modern IDEs have a navigation bar to help you view and jump between #pragma region sections. Follow the instructions below to run the code in your environment and compare it to the comments as you read through the source.

Running the Benchmarks

The project aims to be compatible with GCC, Clang, and MSVC compilers on Linux, MacOS, and Windows. That said, to cover the broadest functionality, using GCC on Linux is recommended:

If you are on Windows, it’s recommended that you set up a Linux environment using WSL.
If you are on MacOS, consider using the non-native distribution of Clang from Homebrew or MacPorts.
If you are on Linux, make sure to install CMake and a recent version of GCC or Clang compilers to support C++20 features.

If you are familiar with C++ and want to review code and measurements as you read, you can clone the repository and execute the following commands.

git clone https://github.com/ashvardanian/less_slow.cpp.git # Clone the repository
cd less_slow.cpp                                            # Change the directory

pip install cmake --upgrade                                 # PyPI has a newer version of CMake
sudo apt-get install -y build-essential g++                 # Install default build tools
sudo apt-get install -y pkg-config liburing-dev             # Install liburing for kernel-bypass
sudo apt-get install -y libopenblas-base                    # Install numerics libraries

cmake -B build_release -D CMAKE_BUILD_TYPE=Release          # Generate the build files
cmake --build build_release --config Release                # Build the project
build_release/less_slow                                     # Run the benchmarks

The build will pull and compile several third-party dependencies from the source:

Google’s Benchmark is used for profiling.
Intel’s oneTBB is used as the Parallel STL backend.
Meta’s libunifex is used for senders & executors.
Eric Niebler’s range-v3 replaces std::ranges.
Victor Zverovich’s fmt replaces std::format.
Ash Vardanian’s StringZilla replaces std::string.
Hana Dusíková’s CTRE replaces std::regex.
Niels Lohmann’s json is used for JSON deserialization.
Yaoyuan Guo’s yyjson for faster JSON processing.
Google’s Abseil replaces STL’s associative containers.
Lewis Baker’s cppcoro implements C++20 coroutines.
Jens Axboe’s liburing to simplify Linux kernel-bypass.
Chris Kohlhoff’s ASIO as a networking TS extension.
Nvidia’s CCCL for GPU-accelerated algorithms.
Nvidia’s CUTLASS for GPU-accelerated Linear Algebra.

To build without Parallel STL, Intel TBB, and CUDA:

1
2

cmake -B build_release -D CMAKE_BUILD_TYPE=Release -D USE_INTEL_TBB=OFF -D USE_NVIDIA_CCCL=OFF
cmake --build build_release --config Release

To control the output or run specific benchmarks, use the following flags:

1
2
3

build_release/less_slow --benchmark_format=json             # Output in JSON format
build_release/less_slow --benchmark_out=results.json        # Save the results to a file instead of `stdout`
build_release/less_slow --benchmark_filter=std_sort         # Run only benchmarks containing `std_sort` in their name

To enhance stability and reproducibility, disable Simultaneous Multi-Threading (SMT) on your CPU and use the --benchmark_enable_random_interleaving=true flag, which shuffles and interleaves benchmarks as described here.

`1`	`build_release/less_slow --benchmark_enable_random_interleaving=true`

Google Benchmark supports User-Requested Performance Counters through libpmf. Note that collecting these may require sudo privileges.

`1`	`sudo build_release/less_slow --benchmark_enable_random_interleaving=true --benchmark_format=json --benchmark_perf_counters="CYCLES,INSTRUCTIONS"`

Alternatively, use the Linux perf tool for performance counter collection:

`1`	`sudo perf stat taskset 0xEFFFEFFFEFFFEFFFEFFFEFFFEFFFEFFF build_release/less_slow --benchmark_enable_random_interleaving=true --benchmark_filter=super_sort`

Project Structure

The primary file of this repository is clearly the less_slow.cpp C++ file with CPU-side code. Several other files for different hardware-specific optimizations are created:

$ tree .
.
├── CMakeLists.txt          # Build & assembly instructions for all files
├── less_slow.cpp           # Primary CPU-side benchmarking code with the majority of examples
├── less_slow_amd64.S       # Hand-written Assembly kernels for 64-bit x86 CPUs
├── less_slow_aarch64.S     # Hand-written Assembly kernels for 64-bit Arm CPUs
├── less_slow.cu            # CUDA C++ examples for parallel algorithms for Nvidia GPUs
├── less_slow_sm70.ptx      # Hand-written PTX IR kernels for Nvidia Volta GPUs
└── less_slow_sm90a.ptx     # Hand-written PTX IR kernels for Nvidia Hopper GPUs

Memes and References

Educational content without memes?! Come on!

Google Benchmark Functionality

This benchmark suite uses most of the features provided by Google Benchmark. If you write a lot of benchmarks and avoid going to the full User Guide, here is a condensed list of the most useful features:

->Args({x, y}) - Pass multiple arguments to parameterized benchmarks
BENCHMARK() - Register a basic benchmark function
BENCHMARK_CAPTURE() - Create variants of benchmarks with different captured values
Counter::kAvgThreads - Specify thread-averaged counters
DoNotOptimize() - Prevent compiler from optimizing away operations
ClobberMemory() - Force memory synchronization
->Complexity(oNLogN) - Specify and validate algorithmic complexity
->SetComplexityN(n) - Set input size for complexity calculations
->ComputeStatistics("max", ...) - Calculate custom statistics across runs
->Iterations(n) - Control exact number of iterations
->MinTime(n) - Set minimum benchmark duration
->MinWarmUpTime(n) - To warm up the data caches
->Name("...") - Assign custom benchmark names
->Range(start, end) - Profile for a range of input sizes
->RangeMultiplier(n) - Set multiplier between range values
->ReportAggregatesOnly() - Show only aggregated statistics
state.counters["name"] - Create custom performance counters
state.PauseTiming(), ResumeTiming() - Control timing measurement
state.SetBytesProcessed(n) - Record number of bytes processed
state.SkipWithError() - Skip benchmark with error message
->Threads(n) - Run benchmark with specified number of threads
->Unit(kMicrosecond) - Set time unit for reporting
->UseRealTime() - Measure real time instead of CPU time
->UseManualTime() - To feed custom timings for GPU and IO benchmarks

cuda-python

Fri, 11 Apr 2025 15:27:47 +0800

NVIDIA/cuda-python

cuda-python

CUDA Python is the home for accessing NVIDIA’s CUDA platform from Python. It consists of multiple components:

cuda.core: Pythonic access to CUDA Runtime and other core functionalities
cuda.bindings: Low-level Python bindings to CUDA C APIs
cuda.cooperative: A Python package providing CCCL’s reusable block-wide and warp-wide device primitives for use within Numba CUDA kernels
cuda.parallel: A Python package for easy access to CCCL’s highly efficient and customizable parallel algorithms, like sort, scan, reduce, transform, etc, that are callable on the host
numba.cuda: Numba’s target for CUDA GPU programming by directly compiling a restricted subset of Python code into CUDA kernels and device functions following the CUDA execution model.

For access to NVIDIA CPU & GPU Math Libraries, please refer to nvmath-python.

CUDA Python is currently undergoing an overhaul to improve existing and bring up new components. All of the previously available functionalities from the cuda-python package will continue to be available, please refer to the cuda.bindings documentation for installation guide and further detail.

cuda-python as a metapackage

cuda-python is being re-structured to become a metapackage that contains a collection of subpackages. Each subpackage is versioned independently, allowing installation of each component as needed.

Subpackage: `cuda.core`

The cuda.core package offers idiomatic, pythonic access to CUDA Runtime and other functionalities.

The goals are to

Provide idiomatic (“pythonic”) access to CUDA Driver, Runtime, and JIT compiler toolchain
Focus on developer productivity by ensuring end-to-end CUDA development can be performed quickly and entirely in Python
Avoid homegrown Python abstractions for CUDA for new Python GPU libraries starting from scratch
Ease developer burden of maintaining and catching up with latest CUDA features
Flatten the learning curve for current and future generations of CUDA developers

Subpackage: `cuda.bindings`

The cuda.bindings package is a standard set of low-level interfaces, providing full coverage of and access to the CUDA host APIs from Python.

The list of available interfaces are:

CUDA Driver
CUDA Runtime
NVRTC
nvJitLink
NVVM

CUDA on Producthunt daily

llama.cpp

llama.cpp

Recent API changes

Hot topics

Quick start

Description

Text-only

Multimodal

Supported backends

Obtaining and quantizing models

A CLI tool for accessing and experimenting with most of llama.cpp’s functionality.

A lightweight, OpenAI API compatible, HTTP server for serving LLMs.

A tool for measuring the perplexity 1 (and other quality metrics) of a model over a given text.

Benchmark the performance of the inference for various parameters.

A comprehensive example for running llama.cpp models. Useful for inferencing. Used with RamaLama 2.

A minimal example for implementing apps with llama.cpp. Useful for developers.

Contributing

Other documentation

Development documentation

Seminal papers and background on the models

XCFramework

Completions

Bash Completion

Dependencies

audiblez

Audiblez: Generate audiobooks from e-books

v4 Now with Graphical interface, CUDA support, and many languages!

How to install the Command Line tool

How to run the GUI

How to run on Windows

Speed

Supported Voices

How to run on GPU

Manually pick chapters to convert

Help page

Author

cutlass

Overview

CUTLASS 4.1.0

What’s New in CUTLASS 4.1

CuTe DSL

CUTLASS C++

Performance

CuTe

Compatibility

Operating Systems

Hardware

Target Architecture

Documentation

Resources

Building CUTLASS

Project Structure

CUTLASS Template Library

CUTLASS SDK Examples

Tools

Test

Performance Profiling

Building all GEMM and Convolution kernels (long build times)

Building a subset of GEMM and Convolution kernels (reduced build times)

Building a subset Tensor Core GEMM kernels

Building one CUDA Core GEMM kernel

Building a subset of Tensor Core Convolution kernels

Building one Convolution CUDA kernel

More Details on Compiling CUTLASS Kernels and CUTLASS Profiler

About

Contributors

Copyright

ZLUDA

ZLUDA

Usage

Windows

Linux

MacOS

Building

Dependencies

Build steps

Contributing

Getting started

License

A CLI tool for accessing and experimenting with most of `llama.cpp`’s functionality.

A tool for measuring the perplexity ¹ (and other quality metrics) of a model over a given text.

A comprehensive example for running `llama.cpp` models. Useful for inferencing. Used with RamaLama ².

A minimal example for implementing apps with `llama.cpp`. Useful for developers.

Subpackage: `cuda.core`

Subpackage: `cuda.bindings`