OpenBMB/VoxCPM
VoxCPM2: Tokenizer-Free TTS for Multilingual Speech Generation, Creative Voice Design, and True-to-Life Cloning
English | δΈζ
π Join our community for discussion and support!
Feishu
|
Discord
VoxCPM is a tokenizer-free Text-to-Speech system that directly generates continuous speech representations via an end-to-end diffusion autoregressive architecture, bypassing discrete tokenization to achieve highly natural and expressive synthesis.
VoxCPM2 is the latest major release β a 2B parameter model trained on over 2 million hours of multilingual speech data, now supporting 30 languages, Voice Design, Controllable Voice Cloning, and 48kHz studio-quality audio output. Built on a MiniCPM-4 backbone.
β¨ Highlights
- π 30-Language Multilingual β Input text in any of the 30 supported languages and synthesize directly, no language tag needed
- π¨ Voice Design β Create a brand-new voice from a natural-language description alone (gender, age, tone, emotion, pace β¦), no reference audio required
- ποΈ Controllable Cloning β Clone any voice from a short reference clip, with optional style guidance to steer emotion, pace, and expression while preserving the original timbre
- ποΈ Ultimate Cloning β Reproduce every vocal nuance: provide both reference audio and its transcript, and the model continues seamlessly from the reference, faithfully preserving every vocal detail β timbre, rhythm, emotion, and style (same as VoxCPM1.5)
- π 48kHz High-Quality Audio β Accepts 16kHz reference audio and directly outputs 48kHz studio-quality audio via AudioVAE V2’s asymmetric encode/decode design, with built-in super-resolution β no external upsampler needed
- π§ Context-Aware Synthesis β Automatically infers appropriate prosody and expressiveness from text content
- β‘ Real-Time Streaming β RTF as low as ~0.3 on NVIDIA RTX 4090, and ~0.13 accelerated by Nano-VLLM
- π Fully Open-Source & Commercial-Ready β Weights and code released under the Apache-2.0 license, free for commercial use
Arabic, Burmese, Chinese, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hindi, Indonesian, Italian, Japanese, Khmer, Korean, Lao, Malay, Norwegian, Polish, Portuguese, Russian, Spanish, Swahili, Swedish, Tagalog, Thai, Turkish, Vietnamese
Chinese Dialect: εε·θ―, η²€θ―, ε΄θ―, δΈεθ―, ζ²³εθ―, ιθ₯Ώθ―, ε±±δΈθ―, 倩ζ΄₯θ―, ι½εθ―
News
- [2026.04] π₯ We release VoxCPM2 β 2B, 30 languages, Voice Design & Controllable Voice Cloning, 48kHz audio output! Weights | Docs | Playground
- [2025.12] π Open-source VoxCPM1.5 weights with SFT & LoRA fine-tuning. (π #1 GitHub Trending)
- [2025.09] π₯ Release VoxCPM Technical Report.
- [2025.09] π Open-source VoxCPM-0.5B weights (π #1 HuggingFace Trending)
Contents
- Quick Start
- Models & Versions
- Performance
- Fine-tuning
- Documentation
- Ecosystem & Community
- Risks and Limitations
- Citation
π Quick Start
Installation
|
|
Requirements: Python β₯ 3.10 (<3.13), PyTorch β₯ 2.5.0, CUDA β₯ 12.0. See Quick Start Docs for details.
Python API
π£οΈ Text-to-Speech
|
|
If you prefer downloading from ModelScope first, you can use:
|
|
|
|
π¨ Voice Design
Create a voice from a natural-language description β no reference audio needed. Format: put the description in parentheses at the start of text(e.g. "(your voice description)The text to synthesize."):
|
|
ποΈ Controllable Voice Cloning
Upload a reference audio. The model clones the timbre, and you can still use control instructions to adjust speed, emotion, or style.
|
|
ποΈ Ultimate Cloning
Provide both the reference audio and its exact transcript for audio-continuation-based cloning with every vocal nuance reproduced. For maximum cloning similarity, pass the same reference clip to both reference_wav_path and prompt_wav_path as shown below:
|
|
π Streaming API
|
|
CLI Usage
|
|
Web Demo
|
|
π’ Production Deployment (Nano-vLLM)
For high-throughput serving, use Nano-vLLM-VoxCPM β a dedicated inference engine built on Nano-vLLM with concurrent request support and an async API.
|
|
|
|
RTF as low as ~0.13 on NVIDIA RTX 4090 (vs ~0.3 with the standard PyTorch implementation), with support for batched concurrent requests and a FastAPI HTTP server. See the Nano-vLLM-VoxCPM repo for deployment details.
Full parameter reference, multi-scenario examples, and voice cloning tips β Quick Start Guide | Usage Guide | Cookbook
π¦ Models & Versions
| VoxCPM2 | VoxCPM1.5 | VoxCPM-0.5B | |
|---|---|---|---|
| Status | π’ Latest | Stable | Legacy |
| Backbone Parameters | 2B | 0.6B | 0.5B |
| Audio Sample Rate | 48kHz | 44.1kHz | 16kHz |
| LM Token Rate | 6.25Hz | 6.25Hz | 12.5Hz |
| Languages | 30 | 2 (zh, en) | 2 (zh, en) |
| Cloning Mode | Isolated Reference & Continuation | Continuation only | Continuation only |
| Voice Design | β | β | β |
| Controllable Voice Cloning | β | β | β |
| SFT / LoRA | β | β | β |
| RTF (RTX 4090) | ~0.30 | ~0.15 | ~0.17 |
| RTF in Nano-VLLM (RTX 4090) | ~0.13 | ~0.08 | ~0.10 |
| VRAM | ~8 GB | ~6 GB | ~5 GB |
| Weights | π€ HF / MS | π€ HF / MS | π€ HF / MS |
| Technical Report | Coming soon | β | arXiv ICLR 2026 |
| Demo Page | Audio Samples | β | Audio Samples |
VoxCPM2 is built on a tokenizer-free, diffusion autoregressive paradigm. The model operates entirely in the latent space of AudioVAE V2, following a four-stage pipeline: LocEnc β TSLM β RALM β LocDiT, enabling rich expressiveness and 48kHz native audio output.
For full architectural details, VoxCPM2-specific upgrades, and a model comparison table, see the Architecture Design.
π Performance
VoxCPM2 achieves state-of-the-art or comparable results on public zero-shot and controllable TTS benchmarks.
Seed-TTS-eval
Seed-TTS-eval WER(β¬)&SIM(β¬) Results (click to expand)
| Model | Parameters | Open-Source | test-EN | test-ZH | test-Hard | |||
|---|---|---|---|---|---|---|---|---|
| WER/%β¬ | SIM/%β¬ | CER/%β¬ | SIM/%β¬ | CER/%β¬ | SIM/%β¬ | |||
| MegaTTS3 | 0.5B | β | 2.79 | 77.1 | 1.52 | 79.0 | - | - |
| DiTAR | 0.6B | β | 1.69 | 73.5 | 1.02 | 75.3 | - | - |
| CosyVoice3 | 0.5B | β | 2.02 | 71.8 | 1.16 | 78.0 | 6.08 | 75.8 |
| CosyVoice3 | 1.5B | β | 2.22 | 72.0 | 1.12 | 78.1 | 5.83 | 75.8 |
| Seed-TTS | - | β | 2.25 | 76.2 | 1.12 | 79.6 | 7.59 | 77.6 |
| MiniMax-Speech | - | β | 1.65 | 69.2 | 0.83 | 78.3 | - | - |
| F5-TTS | 0.3B | β | 2.00 | 67.0 | 1.53 | 76.0 | 8.67 | 71.3 |
| MaskGCT | 1B | β | 2.62 | 71.7 | 2.27 | 77.4 | - | - |
| CosyVoice | 0.3B | β | 4.29 | 60.9 | 3.63 | 72.3 | 11.75 | 70.9 |
| CosyVoice2 | 0.5B | β | 3.09 | 65.9 | 1.38 | 75.7 | 6.83 | 72.4 |
| SparkTTS | 0.5B | β | 3.14 | 57.3 | 1.54 | 66.0 | - | - |
| FireRedTTS | 0.5B | β | 3.82 | 46.0 | 1.51 | 63.5 | 17.45 | 62.1 |
| FireRedTTS-2 | 1.5B | β | 1.95 | 66.5 | 1.14 | 73.6 | - | - |
| Qwen2.5-Omni | 7B | β | 2.72 | 63.2 | 1.70 | 75.2 | 7.97 | 74.7 |
| Qwen3-Omni | 30B-A3B | β | 1.39 | - | 1.07 | - | - | - |
| OpenAudio-s1-mini | 0.5B | β | 1.94 | 55.0 | 1.18 | 68.5 | 23.37 | 64.3 |
| IndexTTS2 | 1.5B | β | 2.23 | 70.6 | 1.03 | 76.5 | 7.12 | 75.5 |
| VibeVoice | 1.5B | β | 3.04 | 68.9 | 1.16 | 74.4 | - | - |
| HiggsAudio-v2 | 3B | β | 2.44 | 67.7 | 1.50 | 74.0 | 55.07 | 65.6 |
| VoxCPM-0.5B | 0.6B | β | 1.85 | 72.9 | 0.93 | 77.2 | 8.87 | 73.0 |
| VoxCPM1.5 | 0.8B | β | 2.12 | 71.4 | 1.18 | 77.0 | 7.74 | 73.1 |
| MOSS-TTS | β | 1.85 | 73.4 | 1.20 | 78.8 | - | - | |
| Qwen3-TTS | 1.7B | β | 1.23 | 71.7 | 1.22 | 77.0 | 6.76 | 74.8 |
| FishAudio S2 | 4B | β | 0.99 | - | 0.54 | - | 5.99 | - |
| LongCat-Audio-DiT | 3.5B | β | 1.50 | 78.6 | 1.09 | 81.8 | 6.04 | 79.7 |
| VoxCPM2 | 2B | β | 1.84 | 75.3 | 0.97 | 79.5 | 8.13 | 75.3 |
CV3-eval
CV3-eval Multilingual WER/CER(β¬) Results (click to expand)
| Model | zh | en | hard-zh | hard-en | ja | ko | de | es | fr | it | ru |
|---|---|---|---|---|---|---|---|---|---|---|---|
| CosyVoice2 | 4.08 | 6.32 | 12.58 | 11.96 | 9.13 | 19.7 | - | - | - | - | - |
| CosyVoice3-1.5B | 3.91 | 4.99 | 9.77 | 10.55 | 7.57 | 5.69 | 6.43 | 4.47 | 11.8 | 10.5 | 6.64 |
| Fish Audio S2 | 2.65 | 2.43 | 9.10 | 4.40 | 3.96 | 2.76 | 2.22 | 2.00 | 6.26 | 2.04 | 2.78 |
| VoxCPM2 | 3.65 | 5.00 | 8.55 | 8.48 | 5.96 | 5.69 | 4.77 | 3.80 | 9.85 | 4.25 | 5.21 |
MiniMax-Multilingual-Test
Minimax-MLS-test WER(β¬) Results (click to expand)
| Language | Minimax | ElevenLabs | Qwen3-TTS | FishAudio S2 | VoxCPM2 |
|---|---|---|---|---|---|
| Arabic | 1.665 | 1.666 | β | 3.500 | 13.046 |
| Cantonese | 34.111 | 51.513 | β | 30.670 | 38.584 |
| Chinese | 2.252 | 16.026 | 0.928 | 0.730 | 1.136 |
| Czech | 3.875 | 2.108 | β | 2.840 | 24.132 |
| Dutch | 1.143 | 0.803 | β | 0.990 | 0.913 |
| English | 2.164 | 2.339 | 0.934 | 1.620 | 2.289 |
| Finnish | 4.666 | 2.964 | β | 3.330 | 2.632 |
| French | 4.099 | 5.216 | 2.858 | 3.050 | 4.534 |
| German | 1.906 | 0.572 | 1.235 | 0.550 | 0.679 |
| Greek | 2.016 | 0.991 | β | 5.740 | 2.844 |
| Hindi | 6.962 | 5.827 | β | 14.640 | 19.699 |
| Indonesian | 1.237 | 1.059 | β | 1.460 | 1.084 |
| Italian | 1.543 | 1.743 | 0.948 | 1.270 | 1.563 |
| Japanese | 3.519 | 10.646 | 3.823 | 2.760 | 4.628 |
| Korean | 1.747 | 1.865 | 1.755 | 1.180 | 1.962 |
| Polish | 1.415 | 0.766 | β | 1.260 | 1.141 |
| Portuguese | 1.877 | 1.331 | 1.526 | 1.140 | 1.938 |
| Romanian | 2.878 | 1.347 | β | 10.740 | 21.577 |
| Russian | 4.281 | 3.878 | 3.212 | 2.400 | 3.634 |
| Spanish | 1.029 | 1.084 | 1.126 | 0.910 | 1.438 |
| Thai | 2.701 | 73.936 | β | 4.230 | 2.961 |
| Turkish | 1.52 | 0.699 | β | 0.870 | 0.817 |
| Ukrainian | 1.082 | 0.997 | β | 2.300 | 6.316 |
| Vietnamese | 0.88 | 73.415 | β | 7.410 | 3.307 |
Minimax-MLS-test SIM(β¬) Results (click to expand)
| Language | Minimax | ElevenLabs | Qwen3-TTS | FishAudio S2 | VoxCPM2 |
|---|---|---|---|---|---|
| Arabic | 73.6 | 70.6 | β | 75.0 | 79.1 |
| Cantonese | 77.8 | 67.0 | β | 80.5 | 83.5 |
| Chinese | 78.0 | 67.7 | 79.9 | 81.6 | 82.5 |
| Czech | 79.6 | 68.5 | β | 79.8 | 78.3 |
| Dutch | 73.8 | 68.0 | β | 73.0 | 80.8 |
| English | 75.6 | 61.3 | 77.5 | 79.7 | 85.4 |
| Finnish | 83.5 | 75.9 | β | 81.9 | 89.0 |
| French | 62.8 | 53.5 | 62.8 | 69.8 | 73.5 |
| German | 73.3 | 61.4 | 77.5 | 76.7 | 80.3 |
| Greek | 82.6 | 73.3 | β | 79.5 | 86.0 |
| Hindi | 81.8 | 73.0 | β | 82.1 | 85.6 |
| Indonesian | 72.9 | 66.0 | β | 76.3 | 80.0 |
| Italian | 69.9 | 57.9 | 81.7 | 74.7 | 78.0 |
| Japanese | 77.6 | 73.8 | 78.8 | 79.6 | 82.8 |
| Korean | 77.6 | 70.0 | 79.9 | 81.7 | 83.3 |
| Polish | 80.2 | 72.9 | β | 81.9 | 88.4 |
| Portuguese | 80.5 | 71.1 | 81.7 | 78.1 | 83.7 |
| Romanian | 80.9 | 69.9 | β | 73.3 | 79.7 |
| Russian | 76.1 | 67.6 | 79.2 | 79.0 | 81.1 |
| Spanish | 76.2 | 61.5 | 81.4 | 77.6 | 83.1 |
| Thai | 80.0 | 58.8 | β | 78.6 | 84.0 |
| Turkish | 77.9 | 59.6 | β | 83.5 | 87.1 |
| Ukrainian | 73.0 | 64.7 | β | 74.7 | 79.8 |
| Vietnamese | 74.3 | 36.9 | β | 74.0 | 80.6 |
Internal 30-Language ASR Benchmark
We additionally run an internal multilingual intelligibility benchmark with 30 languages Γ 500 samples. ASR transcription is evaluated via Gemini 3.1 Flash Lite API.
Internal 30-Language ASR Benchmark (click to expand)
| Language | Metric | VoxCPM2 | Fish S2-Pro |
|---|---|---|---|
| ar (Arabic) | CER | 1.23% | 0.30% |
| da (Danish) | WER | 2.70% | 3.52% |
| de (German) | WER | 0.96% | 0.64% |
| el (Greek) | WER | 3.17% | 4.61% |
| en (English) | WER | 0.42% | 1.03% |
| es (Spanish) | WER | 1.33% | 0.64% |
| fi (Finnish) | WER | 2.24% | 2.80% |
| fr (French) | WER | 2.16% | 2.34% |
| he (Hebrew) | CER | 2.98% | 15.27% |
| hi (Hindi) | CER | 0.79% | 0.91% |
| id (Indonesian) | WER | 1.36% | 1.68% |
| it (Italian) | WER | 1.65% | 1.08% |
| ja (Japanese) | CER | 2.40% | 1.82% |
| km (Khmer) | CER | 2.05% | 75.15% |
| ko (Korean) | CER | 0.95% | 0.29% |
| lo (Lao) | CER | 1.90% | 87.40% |
| ms (Malay) | WER | 1.75% | 1.41% |
| my (Burmese) | CER | 1.42% | 85.27% |
| nl (Dutch) | WER | 1.25% | 1.68% |
| no (Norwegian) | WER | 2.49% | 3.76% |
| pl (Polish) | WER | 1.90% | 1.65% |
| pt (Portuguese) | WER | 1.48% | 1.49% |
| ru (Russian) | WER | 0.90% | 0.86% |
| sv (Swedish) | WER | 2.22% | 2.63% |
| sw (Swahili) | CER | 1.07% | 2.02% |
| th (Thai) | CER | 0.94% | 1.92% |
| tl (Tagalog) | WER | 2.63% | 4.00% |
| tr (Turkish) | WER | 1.65% | 1.65% |
| vi (Vietnamese) | WER | 1.56% | 5.56% |
| zh (Chinese) | CER | 0.92% | 1.02% |
| Average (30 languages) | 1.68% | - |
InstructTTSEval
Instruction-Guided Voice Design Results (click to expand)
| Model | InstructTTSEval-ZH | InstructTTSEval-EN | ||||
|---|---|---|---|---|---|---|
| APS⬠| DSD⬠| RP⬠| APS⬠| DSD⬠| RP⬠| |
| Hume | β | β | β | 83.0 | 75.3 | 54.3 |
| VoxInstruct | 47.5 | 52.3 | 42.6 | 54.9 | 57.0 | 39.3 |
| Parler-tts-mini | β | β | β | 63.4 | 48.7 | 28.6 |
| Parler-tts-large | β | β | β | 60.0 | 45.9 | 31.2 |
| PromptTTS | β | β | β | 64.3 | 47.2 | 31.4 |
| PromptStyle | β | β | β | 57.4 | 46.4 | 30.9 |
| VoiceSculptor | 75.7 | 64.7 | 61.5 | β | β | β |
| Mimo-Audio-7B-Instruct | 75.7 | 74.3 | 61.5 | 80.6 | 77.6 | 59.5 |
| Qwen3TTS-12Hz-1.7B-VD | 85.2 | 81.1 | 65.1 | 82.9 | 82.4 | 68.4 |
| VoxCPM2 | 85.2 | 71.5 | 60.8 | 84.2 | 83.2 | 71.4 |
βοΈ Fine-tuning
VoxCPM supports both full fine-tuning (SFT) and LoRA fine-tuning. With as little as 5β10 minutes of audio, you can adapt to a specific speaker, language, or domain.
|
|
Full guide β Fine-tuning Guide (data preparation, configuration, training, LoRA hot-swapping, FAQ)
π Documentation
Full documentation: voxcpm.readthedocs.io
| Topic | Link |
|---|---|
| Quick Start & Installation | Quick Start |
| Usage Guide & Cookbook | User Guide |
| VoxCPM Series | Models |
| Fine-tuning (SFT & LoRA) | Fine-tuning Guide |
| FAQ & Troubleshooting | FAQ |
π Ecosystem & Community
| Project | Description |
|---|---|
| Nano-vLLM | High-throughput and Fast GPU serving |
| VoxCPM.cpp | GGML/GGUF: CPU, CUDA, Vulkan inference |
| VoxCPM-ONNX | ONNX export for CPU inference |
| VoxCPMANE | Apple Neural Engine backend |
| voxcpm_rs | Rust re-implementation |
| ComfyUI-VoxCPM | ComfyUI node-based workflows |
| ComfyUI-VoxCPMTTS | ComfyUI TTS extension |
| TTS WebUI | Browser-based TTS extension |
See the full Ecosystem in the docs. Community projects are not officially maintained by OpenBMB. Built something cool? Open an issue or PR to add it!
β οΈ Risks and Limitations
- Potential for Misuse: VoxCPM’s voice cloning can generate highly realistic synthetic speech. It is strictly forbidden to use VoxCPM for impersonation, fraud, or disinformation. We strongly recommend clearly marking any AI-generated content.
- Controllable Generation Stability: Voice Design and Controllable Voice Cloning results can vary between runs β you may try to generate 1~3 times to obtain the desired voice or style. We are actively working on improving controllability consistency.
- Language Coverage: VoxCPM2 officially supports 30 languages. For languages not on the list, you are welcome to test directly or try fine-tuning on your own data. We plan to expand language coverage in future releases.
- Usage: This model is released under the Apache-2.0 license. For production deployments, we recommend conducting thorough testing and safety evaluation tailored to your use case.
π Citation
If you find VoxCPM helpful, please consider citing our work and starring β the repository!
|
|
π License
VoxCPM model weights and code are open-sourced under the Apache-2.0 license.
π Acknowledgments
- DiTAR for the diffusion autoregressive backbone
- MiniCPM-4 for the language model foundation
- CosyVoice for the Flow Matching-based LocDiT implementation
- DAC for the Audio VAE backbone
- Our community users for trying VoxCPM, reporting issues, sharing ideas, and contributingβyour support helps the project keep getting better
ModelBest
THUHCSI