ai-dynamo/dynamo

NVIDIA Dynamo

High-throughput, low-latency inference framework designed for serving generative AI and reasoning models in multi-node distributed environments.

Why Dynamo

Multi Node Multi-GPU topology

Large language models exceed single-GPU capacity. Tensor parallelism spreads layers across GPUs but creates coordination challenges. Dynamo closes this orchestration gap.

Dynamo is inference engine agnostic (supports TRT-LLM, vLLM, SGLang) and provides:

Disaggregated Prefill & Decode – Maximizes GPU throughput with latency/throughput trade-offs
Dynamic GPU Scheduling – Optimizes performance based on fluctuating demand
LLM-Aware Request Routing – Eliminates unnecessary KV cache re-computation
Accelerated Data Transfer – Reduces inference response time using NIXL
KV Cache Offloading – Leverages multiple memory hierarchies for higher throughput

Dynamo architecture

Built in Rust for performance and Python for extensibility, Dynamo is fully open-source with an OSS-first development approach.

Framework Support Matrix

Feature	vLLM	SGLang	TensorRT-LLM
Disaggregated Serving	✅	✅	✅
KV-Aware Routing	✅	✅	✅
SLA-Based Planner	✅	✅	✅
KVBM	✅	🚧	✅
Multimodal	✅	✅	✅
Tool Calling	✅	✅	✅

Full Feature Matrix → — Detailed compatibility including LoRA, Request Migration, Speculative Decoding, and feature interactions.

Latest News

Get Started

Path	Use Case	Time	Requirements
Local Quick Start	Test on a single machine	~5 min	1 GPU, Ubuntu 24.04
Kubernetes Deployment	Production multi-node clusters	~30 min	K8s cluster with GPUs

Contributing

Want to help shape the future of distributed LLM inference? We welcome contributors at all levels—from doc fixes to new features.

Contributing Guide – How to get started
Report a Bug – Found an issue?
Feature Request – Have an idea?

Local Quick Start

The following examples require a few system level packages. Recommended to use Ubuntu 24.04 with a x86_64 CPU. See docs/reference/support-matrix.md

1. Initial Setup

The Dynamo team recommends the uv Python package manager, although any way works. Install uv:

1

curl -LsSf https://astral.sh/uv/install.sh | sh

Install Python Development Headers

Backend engines require Python development headers for JIT compilation. Install them with:

1

sudo apt install python3-dev

2. Select an Engine

We publish Python wheels specialized for each of our supported engines: vllm, sglang, and trtllm. The examples that follow use SGLang; continue reading for other engines.

1
2
3
4
5
6


uv venv venv
source venv/bin/activate
uv pip install pip

# Choose one
uv pip install "ai-dynamo[sglang]"  #replace with [vllm], [trtllm], etc.

3. Run Dynamo

Sanity Check (Optional)

Before trying out Dynamo, you can verify your system configuration and dependencies:

1

python3 deploy/sanity_check.py

This is a quick check for system resources, development tools, LLM frameworks, and Dynamo components.

Running an LLM API Server

Dynamo provides a simple way to spin up a local set of inference components including:

OpenAI Compatible Frontend – High performance OpenAI compatible http api server written in Rust.
Basic and Kv Aware Router – Route and load balance traffic to a set of workers.
Workers – Set of pre-configured LLM serving engines.

1
2
3
4
5
6
7


# Start an OpenAI compatible HTTP server with prompt templating, tokenization, and routing.
# For local dev: --store-kv file avoids etcd (workers and frontend must share a disk)
python3 -m dynamo.frontend --http-port 8000 --store-kv file

# Start the SGLang engine. You can run several of these for the same or different models.
# The frontend will discover them automatically.
python3 -m dynamo.sglang --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B --store-kv file

Note: vLLM workers publish KV cache events by default, which requires NATS. For dependency-free local development with vLLM, add --kv-events-config '{"enable_kv_cache_events": false}'. This keeps local prefix caching enabled while disabling event publishing. See Service Discovery and Messaging for details.

Send a Request

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


curl localhost:8000/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
    "messages": [
    {
        "role": "user",
        "content": "Hello, how are you?"
    }
    ],
    "stream":false,
    "max_tokens": 300
  }' | jq

Rerun with curl -N and change stream in the request to true to get the responses as soon as the engine issues them.

What’s Next?

Scale up: Deploy on Kubernetes with Recipes
Add features: Enable KV-aware routing, disaggregated serving
Benchmark: Use AIPerf to measure performance
Try other engines: vLLM, SGLang, TensorRT-LLM

Kubernetes Deployment

For production deployments on Kubernetes clusters with multiple GPUs.

Prerequisites

Kubernetes cluster with GPU nodes
Dynamo Platform installed
HuggingFace token for model downloads

Production Recipes

Pre-built deployment configurations for common models and topologies:

Model	Framework	Mode	GPUs	Recipe
Llama-3.1-70B	vLLM	Aggregated	4x H100	View
DeepSeek-R1	SGLang	Disaggregated	8x H200	View
Qwen3-32B	TensorRT-LLM	Disaggregated	8x GPU	View

See recipes/README.md for the full list and deployment instructions.

Cloud Deployment Guides

Concepts

Engines

Dynamo is inference engine agnostic. Install the wheel for your chosen engine and run with python3 -m dynamo.<engine> --help.

Engine	Install	Docs	Best For
vLLM	`uv pip install ai-dynamo[vllm]`	Guide	Broadest feature coverage
SGLang	`uv pip install ai-dynamo[sglang]`	Guide	High-throughput serving
TensorRT-LLM	`pip install --pre --extra-index-url https://pypi.nvidia.com ai-dynamo[trtllm]`	Guide	Maximum performance

Note: TensorRT-LLM requires pip (not uv) due to URL-based dependencies. See the TRT-LLM guide for container setup and prerequisites.

Use CUDA_VISIBLE_DEVICES to specify which GPUs to use. Engine-specific options (context length, multi-GPU, etc.) are documented in each backend guide.

Service Discovery and Messaging

Dynamo uses TCP for inter-component communication. External services are optional for most deployments:

Deployment	etcd	NATS	Notes
Kubernetes	❌ Not required	❌ Not required	K8s-native discovery; TCP request plane
Local Development	❌ Not required	❌ Not required	Pass `--store-kv file`; vLLM also needs `--kv-events-config '{"enable_kv_cache_events": false}'`
KV-Aware Routing	—	✅ Required	Prefix caching enabled by default requires NATS

For local development without external dependencies, pass --store-kv file (avoids etcd) to both the frontend and workers. vLLM users should also pass --kv-events-config '{"enable_kv_cache_events": false}' to disable KV event publishing (avoids NATS) while keeping local prefix caching enabled; SGLang and TRT-LLM don’t require this flag.

For distributed non-Kubernetes deployments or KV-aware routing:

etcd can be run directly as ./etcd.
nats needs JetStream enabled: nats-server -js.

To quickly setup both: docker compose -f deploy/docker-compose.yml up -d

Advanced Topics

Benchmarking

Dynamo provides comprehensive benchmarking tools:

Benchmarking Guide – Compare deployment topologies using AIPerf
SLA-Driven Deployments – Optimize deployments to meet SLA requirements

Frontend OpenAPI Specification

The OpenAI-compatible frontend exposes an OpenAPI 3 spec at /openapi.json. To generate without running the server:

1

cargo run -p dynamo-llm --bin generate-frontend-openapi

This writes to docs/frontends/openapi.json.

Building from Source

For contributors who want to build Dynamo from source rather than installing from PyPI.

1. Install Libraries

Ubuntu:

1

sudo apt install -y build-essential libhwloc-dev libudev-dev pkg-config libclang-dev protobuf-compiler python3-dev cmake

macOS:

Homebrew

1
2


# if brew is not installed on your system, install it
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Xcode

1
2
3
4


brew install cmake protobuf

## Check that Metal is accessible
xcrun -sdk macosx metal

If Metal is accessible, you should see an error like metal: error: no input files, which confirms it is installed correctly.

2. Install Rust

1
2


curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env

3. Create a Python Virtual Environment

Follow the instructions in uv installation guide to install uv if you don’t have uv installed. Once uv is installed, create a virtual environment and activate it.

Install uv

1

curl -LsSf https://astral.sh/uv/install.sh | sh

Create a virtual environment

1
2


uv venv dynamo
source dynamo/bin/activate

4. Install Build Tools

1

uv pip install pip maturin

Maturin is the Rust<->Python bindings build tool.

5. Build the Rust Bindings

1
2


cd lib/bindings/python
maturin develop --uv

6. Install GPU Memory Service

The GPU Memory Service is a Python package with a C++ extension. It requires only Python development headers and a C++ compiler (g++).

1
2


cd $PROJECT_ROOT
uv pip install -e lib/gpu_memory_service

7. Install the Wheel

1
2


cd $PROJECT_ROOT
uv pip install -e .

You should now be able to run python3 -m dynamo.frontend.

For local development, pass --store-kv file to avoid external dependencies (see Service Discovery and Messaging section).

Set the environment variable DYN_LOG to adjust the logging level; for example, export DYN_LOG=debug. It has the same syntax as RUST_LOG.

If you use vscode or cursor, we have a .devcontainer folder built on Microsofts Extension. For instructions see the ReadMe for more details.