OpenBMB/VoxCPM
ποΈ VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning
VoxCPM1.5 Model Weights
π Contact us on WeChat
News
- [2025.12.05] π π π We Open Source the VoxCPM1.5 weights! The model now supports both full-parameter fine-tuning and efficient LoRA fine-tuning, empowering you to create your own tailored version. See Release Notes for details.
- [2025.09.30] π₯ π₯ π₯ We Release VoxCPM Technical Report!
- [2025.09.16] π₯ π₯ π₯ We Open Source the VoxCPM-0.5B weights!
- [2025.09.16] π π π We Provide the Gradio PlayGround for VoxCPM-0.5B, try it now!
Overview
VoxCPM is a novel tokenizer-free Text-to-Speech (TTS) system that redefines realism in speech synthesis. By modeling speech in a continuous space, it overcomes the limitations of discrete tokenization and enables two flagship capabilities: context-aware speech generation and true-to-life zero-shot voice cloning.
Unlike mainstream approaches that convert speech to discrete tokens, VoxCPM uses an end-to-end diffusion autoregressive architecture that directly generates continuous speech representations from text. Built on MiniCPM-4 backbone, it achieves implicit semantic-acoustic decoupling through hierachical language modeling and FSQ constraints, greatly enhancing both expressiveness and generation stability.
π Key Features
- Context-Aware, Expressive Speech Generation - VoxCPM comprehends text to infer and generate appropriate prosody, delivering speech with remarkable expressiveness and natural flow. It spontaneously adapts speaking style based on content, producing highly fitting vocal expression trained on a massive 1.8 million-hour bilingual corpus.
- True-to-Life Voice Cloning - With only a short reference audio clip, VoxCPM performs accurate zero-shot voice cloning, capturing not only the speaker’s timbre but also fine-grained characteristics such as accent, emotional tone, rhythm, and pacing to create a faithful and natural replica.
- High-Efficiency Synthesis - VoxCPM supports streaming synthesis with a Real-Time Factor (RTF) as low as 0.17 on a consumer-grade NVIDIA RTX 4090 GPU, making it possible for real-time applications.
π¦ Model Versions
See Release Notes for details
-
VoxCPM1.5 (Latest):
- Model Params: 800M
- Sampling rate of AudioVAE: 44100
- Token rate in LM Backbone: 6.25Hz (patch-size=4)
- RTF in a single NVIDIA-RTX 4090 GPU: ~0.15
-
VoxCPM-0.5B (Original):
- Model Params: 640M
- Sampling rate of AudioVAE: 16000
- Token rate in LM Backbone: 12.5Hz (patch-size=2)
- RTF in a single NVIDIA-RTX 4090 GPU: 0.17
Quick Start
π§ Install from PyPI
|
|
1. Model Download (Optional)
By default, when you first run the script, the model will be downloaded automatically, but you can also download the model in advance.
-
Download VoxCPM1.5
1 2from huggingface_hub import snapshot_download snapshot_download("openbmb/VoxCPM1.5") -
Or Download VoxCPM-0.5B
1 2from huggingface_hub import snapshot_download snapshot_download("openbmb/VoxCPM-0.5B") -
Download ZipEnhancer and SenseVoice-Small. We use ZipEnhancer to enhance speech prompts and SenseVoice-Small for speech prompt ASR in the web demo.
1 2 3from modelscope import snapshot_download snapshot_download('iic/speech_zipenhancer_ans_multiloss_16k_base') snapshot_download('iic/SenseVoiceSmall')
2. Basic Usage
|
|
3. CLI Usage
After installation, the entry point is voxcpm (or use python -m voxcpm.cli).
|
|
4. Start web demo
You can start the UI interface by running python app.py, which allows you to perform Voice Cloning and Voice Creation.
5. Fine-tuning
VoxCPM1.5 supports both full fine-tuning (SFT) and LoRA fine-tuning, allowing you to train personalized voice models on your own data. See the Fine-tuning Guide for detailed instructions.
Quick Start:
|
|
π Documentation
- Usage Guide - Detailed guide on how to use VoxCPM effectively, including text input modes, voice cloning tips, and parameter tuning
- Fine-tuning Guide - Complete guide for fine-tuning VoxCPM models with SFT and LoRA
- Release Notes - Version history and updates
- Performance Benchmarks - Detailed performance comparisons on public benchmarks
π More Information
π Community Projects
We’re excited to see the VoxCPM community growing! Here are some amazing projects and features built by our community:
- ComfyUI-VoxCPM A VoxCPM extension for ComfyUI.
- ComfyUI-VoxCPMTTS A VoxCPM extension for ComfyUI.
- WebUI-VoxCPM A template extension for TTS WebUI.
- PR: Streaming API Support (by AbrahamSanders)
- VoxCPM-NanoVLLM NanoVLLM integration for VoxCPM for faster, high-throughput inference on GPU.
- VoxCPM-ONNX ONNX export for VoxCPM supports faster CPU inference.
- VoxCPMANE VoxCPM TTS with Apple Neural Engine backend server.
- PR: LoRA finetune web UI (by Ayin1412)
- voxcpm_rs A re-implementation of VoxCPM-0.5B in Rust.
Note: The projects are not officially maintained by OpenBMB.
Have you built something cool with VoxCPM? We’d love to feature it here! Please open an issue or pull request to add your project.
π Performance Highlights
VoxCPM achieves competitive results on public zero-shot TTS benchmarks. See Performance Benchmarks for detailed comparison tables.
β οΈ Risks and limitations
- General Model Behavior: While VoxCPM has been trained on a large-scale dataset, it may still produce outputs that are unexpected, biased, or contain artifacts.
- Potential for Misuse of Voice Cloning: VoxCPM’s powerful zero-shot voice cloning capability can generate highly realistic synthetic speech. This technology could be misused for creating convincing deepfakes for purposes of impersonation, fraud, or spreading disinformation. Users of this model must not use it to create content that infringes upon the rights of individuals. It is strictly forbidden to use VoxCPM for any illegal or unethical purposes. We strongly recommend that any publicly shared content generated with this model be clearly marked as AI-generated.
- Current Technical Limitations: Although generally stable, the model may occasionally exhibit instability, especially with very long or expressive inputs. Furthermore, the current version offers limited direct control over specific speech attributes like emotion or speaking style.
- Bilingual Model: VoxCPM is trained primarily on Chinese and English data. Performance on other languages is not guaranteed and may result in unpredictable or low-quality audio.
- This model is released for research and development purposes only. We do not recommend its use in production or commercial applications without rigorous testing and safety evaluations. Please use VoxCPM responsibly.
π TO-DO List
Please stay tuned for updates!
- Release the VoxCPM technical report.
- Support higher sampling rate (44.1kHz in VoxCPM-1.5).
- Support SFT and LoRA fine-tuning.
- Multilingual Support (besides ZH/EN).
- Controllable Speech Generation by Human Instruction.
π License
The VoxCPM model weights and code are open-sourced under the Apache-2.0 license.
π Acknowledgments
We extend our sincere gratitude to the following works and resources for their inspiration and contributions:
- DiTAR for the diffusion autoregressive backbone used in speech generation
- MiniCPM-4 for serving as the language model foundation
- CosyVoice for the implementation of Flow Matching-based LocDiT
- DAC for providing the Audio VAE backbone
Institutions
This project is developed by the following institutions:
β Star History
π Citation
If you find our model helpful, please consider citing our projects π and staring us βοΈοΌ
|
|