MLX-Audio
The best audio processing library built on Apple’s MLX framework, providing fast and efficient text-to-speech (TTS), speech-to-text (STT), and speech-to-speech (STS) on Apple Silicon.
Features
- Fast inference optimized for Apple Silicon (M series chips)
- Multiple model architectures for TTS, STT, and STS
- Multilingual support across models
- Voice customization and cloning capabilities
- Adjustable speech speed control
- Interactive web interface with 3D audio visualization
- OpenAI-compatible REST API
- Quantization support (3-bit, 4-bit, 6-bit, 8-bit, and more) for optimized performance
- Swift package for iOS/macOS integration
Installation
Using pip
Latest release from pypi:
1
|
uv tool install --force mlx-audio --prerelease=allow
|
Latest code from github:
1
|
uv tool install --force git+https://github.com/Blaizzy/mlx-audio.git --prerelease=allow
|
For development or web interface:
1
2
3
|
git clone https://github.com/Blaizzy/mlx-audio.git
cd mlx-audio
pip install -e ".[dev]"
|
Quick Start
Command Line
1
2
3
4
5
6
7
8
9
10
11
|
# Basic TTS generation
mlx_audio.tts.generate --model mlx-community/Kokoro-82M-bf16 --text 'Hello, world!' --lang_code a
# With voice selection and speed adjustment
mlx_audio.tts.generate --model mlx-community/Kokoro-82M-bf16 --text 'Hello!' --voice af_heart --speed 1.2 --lang_code a
# Play audio immediately
mlx_audio.tts.generate --model mlx-community/Kokoro-82M-bf16 --text 'Hello!' --play --lang_code a
# Save to a specific directory
mlx_audio.tts.generate --model mlx-community/Kokoro-82M-bf16 --text 'Hello!' --output_path ./my_audio --lang_code a
|
Python API
1
2
3
4
5
6
7
8
9
|
from mlx_audio.tts.utils import load_model
# Load model
model = load_model("mlx-community/Kokoro-82M-bf16")
# Generate speech
for result in model.generate("Hello from MLX-Audio!", voice="af_heart"):
print(f"Generated {result.audio.shape[0]} samples")
# result.audio contains the waveform as mx.array
|
Supported Models
Text-to-Speech (TTS)
| Model |
Description |
Languages |
Repo |
| Kokoro |
Fast, high-quality multilingual TTS |
EN, JA, ZH, FR, ES, IT, PT, HI |
mlx-community/Kokoro-82M-bf16 |
| Qwen3-TTS |
Alibaba’s multilingual TTS with voice design |
ZH, EN, JA, KO, + more |
mlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-bf16 |
| CSM |
Conversational Speech Model with voice cloning |
EN |
mlx-community/csm-1b |
| Dia |
Dialogue-focused TTS |
EN |
mlx-community/Dia-1.6B-bf16 |
| OuteTTS |
Efficient TTS model |
EN |
mlx-community/OuteTTS-0.2-500M |
| Spark |
SparkTTS model |
EN, ZH |
mlx-community/SparkTTS-0.5B-bf16 |
| Chatterbox |
Expressive multilingual TTS |
EN, ES, FR, DE, IT, PT, PL, TR, RU, NL, CS, AR, ZH, JA, HU, KO |
mlx-community/Chatterbox-bf16 |
| Soprano |
High-quality TTS |
EN |
mlx-community/Soprano-bf16 |
Speech-to-Text (STT)
Speech-to-Speech (STS)
Model Examples
Kokoro TTS
Kokoro is a fast, multilingual TTS model with 54 voice presets.
1
2
3
4
5
6
7
8
9
10
11
12
|
from mlx_audio.tts.utils import load_model
model = load_model("mlx-community/Kokoro-82M-bf16")
# Generate with different voices
for result in model.generate(
text="Welcome to MLX-Audio!",
voice="af_heart", # American female
speed=1.0,
lang_code="a" # American English
):
audio = result.audio
|
Available Voices:
- American English:
af_heart, af_bella, af_nova, af_sky, am_adam, am_echo, etc.
- British English:
bf_alice, bf_emma, bm_daniel, bm_george, etc.
- Japanese:
jf_alpha, jm_kumo, etc.
- Chinese:
zf_xiaobei, zm_yunxi, etc.
Language Codes:
| Code |
Language |
Note |
a |
American English |
Default |
b |
British English |
|
j |
Japanese |
Requires pip install misaki[ja] |
z |
Mandarin Chinese |
Requires pip install misaki[zh] |
e |
Spanish |
|
f |
French |
|
Qwen3-TTS
Alibaba’s state-of-the-art multilingual TTS with voice cloning, emotion control, and voice design capabilities.
1
2
3
4
5
6
7
8
9
10
|
from mlx_audio.tts.utils import load_model
model = load_model("mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16")
results = list(model.generate(
text="Hello, welcome to MLX-Audio!",
voice="Chelsie",
language="English",
))
audio = results[0].audio # mx.array
|
See the Qwen3-TTS README for voice cloning, CustomVoice, VoiceDesign, and all available models.
CSM (Voice Cloning)
Clone any voice using a reference audio sample:
1
2
3
4
5
|
mlx_audio.tts.generate \
--model mlx-community/csm-1b \
--text "Hello from Sesame." \
--ref_audio ./reference_voice.wav \
--play
|
Whisper STT
1
2
3
4
5
6
7
|
from mlx_audio.stt.generate import generate_transcription
result = generate_transcription(
model="mlx-community/whisper-large-v3-turbo-asr-fp16",
audio="audio.wav",
)
print(result.text)
|
VibeVoice-ASR
Microsoft’s 9B parameter speech-to-text model with speaker diarization and timestamps. Supports long-form audio (up to 60 minutes) and outputs structured JSON.
1
2
3
4
5
6
7
8
9
10
11
12
13
|
from mlx_audio.stt.utils import load
model = load("mlx-community/VibeVoice-ASR-bf16")
# Basic transcription
result = model.generate(audio="meeting.wav", max_tokens=8192, temperature=0.0)
print(result.text)
# [{"Start":0,"End":5.2,"Speaker":0,"Content":"Hello everyone, let's begin."},
# {"Start":5.5,"End":9.8,"Speaker":1,"Content":"Thanks for joining today."}]
# Access parsed segments
for seg in result.segments:
print(f"[{seg['start_time']:.1f}-{seg['end_time']:.1f}] Speaker {seg['speaker_id']}: {seg['text']}")
|
Streaming transcription:
1
2
3
|
# Stream tokens as they are generated
for text in model.stream_transcribe(audio="speech.wav", max_tokens=4096):
print(text, end="", flush=True)
|
With context (hotwords/metadata):
1
2
3
4
5
6
|
result = model.generate(
audio="technical_talk.wav",
context="MLX, Apple Silicon, PyTorch, Transformer",
max_tokens=8192,
temperature=0.0,
)
|
CLI usage:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
# Basic transcription
python -m mlx_audio.stt.generate \
--model mlx-community/VibeVoice-ASR-bf16 \
--audio meeting.wav \
--output-path output \
--format json \
--max-tokens 8192 \
--verbose
# With context/hotwords
python -m mlx_audio.stt.generate \
--model mlx-community/VibeVoice-ASR-bf16 \
--audio technical_talk.wav \
--output-path output \
--format json \
--max-tokens 8192 \
--context "MLX, Apple Silicon, PyTorch, Transformer" \
--verbose
|
SAM-Audio (Source Separation)
Separate specific sounds from audio using text prompts:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
|
from mlx_audio.sts import SAMAudio, SAMAudioProcessor, save_audio
model = SAMAudio.from_pretrained("mlx-community/sam-audio-large")
processor = SAMAudioProcessor.from_pretrained("mlx-community/sam-audio-large")
batch = processor(
descriptions=["A person speaking"],
audios=["mixed_audio.wav"],
)
result = model.separate_long(
batch.audios,
descriptions=batch.descriptions,
anchors=batch.anchor_ids,
chunk_seconds=10.0,
overlap_seconds=3.0,
ode_opt={"method": "midpoint", "step_size": 2/32},
)
save_audio(result.target[0], "voice.wav")
save_audio(result.residual[0], "background.wav")
|
Remove noise from speech recordings:
1
2
3
4
5
|
from mlx_audio.sts import MossFormer2SEModel, save_audio
model = MossFormer2SEModel.from_pretrained("starkdmi/MossFormer2_SE_48K_MLX")
enhanced = model.enhance("noisy_speech.wav")
save_audio(enhanced, "clean.wav", 48000)
|
Web Interface & API Server
MLX-Audio includes a modern web interface and OpenAI-compatible API.
Starting the Server
1
2
3
4
5
6
|
# Start API server
mlx_audio.server --host 0.0.0.0 --port 8000
# Start web UI (in another terminal)
cd mlx_audio/ui
npm install && npm run dev
|
API Endpoints
Text-to-Speech (OpenAI-compatible):
1
2
3
4
|
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"model": "mlx-community/Kokoro-82M-bf16", "input": "Hello!", "voice": "af_heart"}' \
--output speech.wav
|
Speech-to-Text:
1
2
3
|
curl -X POST http://localhost:8000/v1/audio/transcriptions \
-F "file=@audio.wav" \
-F "model=mlx-community/whisper-large-v3-turbo-asr-fp16"
|
Quantization
Reduce model size and improve performance with quantization using the convert script:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
# Convert and quantize to 4-bit
python -m mlx_audio.convert \
--hf-path prince-canuma/Kokoro-82M \
--mlx-path ./Kokoro-82M-4bit \
--quantize \
--q-bits 4 \
--upload-repo username/Kokoro-82M-4bit (optional: if you want to upload the model to Hugging Face)
# Convert with specific dtype (bfloat16)
python -m mlx_audio.convert \
--hf-path prince-canuma/Kokoro-82M \
--mlx-path ./Kokoro-82M-bf16 \
--dtype bfloat16 \
--upload-repo username/Kokoro-82M-bf16 (optional: if you want to upload the model to Hugging Face)
|
Options:
| Flag |
Description |
--hf-path |
Source Hugging Face model or local path |
--mlx-path |
Output directory for converted model |
-q, --quantize |
Enable quantization |
--q-bits |
Bits per weight (4, 6, or 8) |
--q-group-size |
Group size for quantization (default: 64) |
--dtype |
Weight dtype: float16, bfloat16, float32 |
--upload-repo |
Upload converted model to HF Hub |
Swift
Looking for Swift/iOS support? Check out mlx-audio-swift for on-device TTS using MLX on macOS and iOS.
Requirements
- Python 3.10+
- Apple Silicon Mac (M1/M2/M3/M4)
- MLX framework
- ffmpeg (required for MP3/FLAC audio encoding)
Installing ffmpeg
ffmpeg is required for saving audio in MP3 or FLAC format. Install it using:
1
2
3
4
5
|
# macOS (using Homebrew)
brew install ffmpeg
# Ubuntu/Debian
sudo apt install ffmpeg
|
WAV format works without ffmpeg.
License
MIT License
Citation
1
2
3
4
5
6
7
|
@misc{mlx-audio,
author = {Canuma, Prince},
title = {MLX Audio},
year = {2025},
howpublished = {\url{https://github.com/Blaizzy/mlx-audio}},
note = {Audio processing library for Apple Silicon with TTS, STT, and STS capabilities.}
}
|
Acknowledgements