ai-dynamo/dynamo
| Roadmap | Support Matrix | Docs | Recipes | Examples | Prebuilt Containers | Design Proposals | Blogs
NVIDIA Dynamo
High-throughput, low-latency inference framework designed for serving generative AI and reasoning models in multi-node distributed environments.
Why Dynamo
Large language models exceed single-GPU capacity. Tensor parallelism spreads layers across GPUs but creates coordination challenges. Dynamo closes this orchestration gap.
Dynamo is inference engine agnostic (supports TRT-LLM, vLLM, SGLang) and provides:
- Disaggregated Prefill & Decode β Maximizes GPU throughput with latency/throughput trade-offs
- Dynamic GPU Scheduling β Optimizes performance based on fluctuating demand
- LLM-Aware Request Routing β Eliminates unnecessary KV cache re-computation
- Accelerated Data Transfer β Reduces inference response time using NIXL
- KV Cache Offloading β Leverages multiple memory hierarchies for higher throughput
Built in Rust for performance and Python for extensibility, Dynamo is fully open-source with an OSS-first development approach.
Framework Support Matrix
| Feature | vLLM | SGLang | TensorRT-LLM |
|---|---|---|---|
| Disaggregated Serving | β | β | β |
| KV-Aware Routing | β | β | β |
| SLA-Based Planner | β | β | β |
| KVBM | β | π§ | β |
| Multimodal | β | β | β |
| Tool Calling | β | β | β |
Full Feature Matrix β β Detailed compatibility including LoRA, Request Migration, Speculative Decoding, and feature interactions.
Latest News
- [12/05] Moonshot AI’s Kimi K2 achieves 10x inference speedup with Dynamo on GB200
- [12/02] Mistral AI runs Mistral Large 3 with 10x faster inference using Dynamo
- [12/01] InfoQ: NVIDIA Dynamo simplifies Kubernetes deployment for LLM inference
- [11/20] Dell integrates PowerScale with Dynamo’s NIXL for 19x faster TTFT
- [11/20] WEKA partners with NVIDIA on KV cache storage for Dynamo
- [11/13] Dynamo Office Hours Playlist
- [10/16] How Baseten achieved 2x faster inference with NVIDIA Dynamo
Get Started
| Path | Use Case | Time | Requirements |
|---|---|---|---|
| Local Quick Start | Test on a single machine | ~5 min | 1 GPU, Ubuntu 24.04 |
| Kubernetes Deployment | Production multi-node clusters | ~30 min | K8s cluster with GPUs |
Contributing
Want to help shape the future of distributed LLM inference? We welcome contributors at all levelsβfrom doc fixes to new features.
- Contributing Guide β How to get started
- Report a Bug β Found an issue?
- Feature Request β Have an idea?
Local Quick Start
The following examples require a few system level packages. Recommended to use Ubuntu 24.04 with a x86_64 CPU. See docs/reference/support-matrix.md
1. Initial Setup
The Dynamo team recommends the uv Python package manager, although any way works. Install uv:
|
|
Install Python Development Headers
Backend engines require Python development headers for JIT compilation. Install them with:
|
|
2. Select an Engine
We publish Python wheels specialized for each of our supported engines: vllm, sglang, and trtllm. The examples that follow use SGLang; continue reading for other engines.
|
|
3. Run Dynamo
Sanity Check (Optional)
Before trying out Dynamo, you can verify your system configuration and dependencies:
|
|
This is a quick check for system resources, development tools, LLM frameworks, and Dynamo components.
Running an LLM API Server
Dynamo provides a simple way to spin up a local set of inference components including:
- OpenAI Compatible Frontend β High performance OpenAI compatible http api server written in Rust.
- Basic and Kv Aware Router β Route and load balance traffic to a set of workers.
- Workers β Set of pre-configured LLM serving engines.
|
|
Note: vLLM workers publish KV cache events by default, which requires NATS. For dependency-free local development with vLLM, add
--kv-events-config '{"enable_kv_cache_events": false}'. This keeps local prefix caching enabled while disabling event publishing. See Service Discovery and Messaging for details.
Send a Request
|
|
Rerun with curl -N and change stream in the request to true to get the responses as soon as the engine issues them.
What’s Next?
- Scale up: Deploy on Kubernetes with Recipes
- Add features: Enable KV-aware routing, disaggregated serving
- Benchmark: Use AIPerf to measure performance
- Try other engines: vLLM, SGLang, TensorRT-LLM
Kubernetes Deployment
For production deployments on Kubernetes clusters with multiple GPUs.
Prerequisites
- Kubernetes cluster with GPU nodes
- Dynamo Platform installed
- HuggingFace token for model downloads
Production Recipes
Pre-built deployment configurations for common models and topologies:
| Model | Framework | Mode | GPUs | Recipe |
|---|---|---|---|---|
| Llama-3.1-70B | vLLM | Aggregated | 4x H100 | View |
| DeepSeek-R1 | SGLang | Disaggregated | 8x H200 | View |
| Qwen3-32B | TensorRT-LLM | Disaggregated | 8x GPU | View |
See recipes/README.md for the full list and deployment instructions.
Cloud Deployment Guides
Concepts
Engines
Dynamo is inference engine agnostic. Install the wheel for your chosen engine and run with python3 -m dynamo.<engine> --help.
| Engine | Install | Docs | Best For |
|---|---|---|---|
| vLLM | uv pip install ai-dynamo[vllm] |
Guide | Broadest feature coverage |
| SGLang | uv pip install ai-dynamo[sglang] |
Guide | High-throughput serving |
| TensorRT-LLM | pip install --pre --extra-index-url https://pypi.nvidia.com ai-dynamo[trtllm] |
Guide | Maximum performance |
Note: TensorRT-LLM requires
pip(notuv) due to URL-based dependencies. See the TRT-LLM guide for container setup and prerequisites.
Use CUDA_VISIBLE_DEVICES to specify which GPUs to use. Engine-specific options (context length, multi-GPU, etc.) are documented in each backend guide.
Service Discovery and Messaging
Dynamo uses TCP for inter-component communication. External services are optional for most deployments:
| Deployment | etcd | NATS | Notes |
|---|---|---|---|
| Kubernetes | β Not required | β Not required | K8s-native discovery; TCP request plane |
| Local Development | β Not required | β Not required | Pass --store-kv file; vLLM also needs --kv-events-config '{"enable_kv_cache_events": false}' |
| KV-Aware Routing | β | β Required | Prefix caching enabled by default requires NATS |
For local development without external dependencies, pass --store-kv file (avoids etcd) to both the frontend and workers. vLLM users should also pass --kv-events-config '{"enable_kv_cache_events": false}' to disable KV event publishing (avoids NATS) while keeping local prefix caching enabled; SGLang and TRT-LLM don’t require this flag.
For distributed non-Kubernetes deployments or KV-aware routing:
To quickly setup both: docker compose -f deploy/docker-compose.yml up -d
Advanced Topics
Benchmarking
Dynamo provides comprehensive benchmarking tools:
- Benchmarking Guide β Compare deployment topologies using AIPerf
- SLA-Driven Deployments β Optimize deployments to meet SLA requirements
Frontend OpenAPI Specification
The OpenAI-compatible frontend exposes an OpenAPI 3 spec at /openapi.json. To generate without running the server:
|
|
This writes to docs/frontends/openapi.json.
Building from Source
For contributors who want to build Dynamo from source rather than installing from PyPI.
1. Install Libraries
Ubuntu:
|
|
macOS:
|
|
|
|
If Metal is accessible, you should see an error like metal: error: no input files, which confirms it is installed correctly.
2. Install Rust
|
|
3. Create a Python Virtual Environment
Follow the instructions in uv installation guide to install uv if you don’t have uv installed. Once uv is installed, create a virtual environment and activate it.
- Install uv
|
|
- Create a virtual environment
|
|
4. Install Build Tools
|
|
Maturin is the Rust<->Python bindings build tool.
5. Build the Rust Bindings
|
|
6. Install GPU Memory Service
The GPU Memory Service is a Python package with a C++ extension. It requires only Python development headers and a C++ compiler (g++).
|
|
7. Install the Wheel
|
|
You should now be able to run python3 -m dynamo.frontend.
For local development, pass --store-kv file to avoid external dependencies (see Service Discovery and Messaging section).
Set the environment variable DYN_LOG to adjust the logging level; for example, export DYN_LOG=debug. It has the same syntax as RUST_LOG.
If you use vscode or cursor, we have a .devcontainer folder built on Microsofts Extension. For instructions see the ReadMe for more details.