Blaizzy/mlx-audio
MLX-Audio
A text-to-speech (TTS) and Speech-to-Speech (STS) library built on Apple’s MLX framework, providing efficient speech synthesis on Apple Silicon.
Features
- Fast inference on Apple Silicon (M series chips)
- Multiple language support
- Voice customization options
- Adjustable speech speed control (0.5x to 2.0x)
- Interactive web interface with 3D audio visualization
- REST API for TTS generation
- Quantization support for optimized performance
- Direct access to output files via Finder/Explorer integration
Installation
|
|
Quick Start
To generate audio with an LLM use:
|
|
How to call from python
To generate audio with an LLM use:
|
|
Web Interface & API Server
MLX-Audio includes a web interface with a 3D visualization that reacts to audio frequencies. The interface allows you to:
- Generate TTS with different voices and speed settings
- Upload and play your own audio files
- Visualize audio with an interactive 3D orb
- Automatically saves generated audio files to the outputs directory in the current working folder
- Open the output folder directly from the interface (when running locally)
Features
- Multiple Voice Options: Choose from different voice styles (AF Heart, AF Nova, AF Bella, BF Emma)
- Adjustable Speech Speed: Control the speed of speech generation with an interactive slider (0.5x to 2.0x)
- Real-time 3D Visualization: A responsive 3D orb that reacts to audio frequencies
- Audio Upload: Play and visualize your own audio files
- Auto-play Option: Automatically play generated audio
- Output Folder Access: Convenient button to open the output folder in your system’s file explorer
To start the web interface and API server:
|
|
Available command line arguments:
--host
: Host address to bind the server to (default: 127.0.0.1)--port
: Port to bind the server to (default: 8000)
Then open your browser and navigate to:
|
|
API Endpoints
The server provides the following REST API endpoints:
-
POST /tts
: Generate TTS audio- Parameters (form data):
text
: The text to convert to speech (required)voice
: Voice to use (default: “af_heart”)speed
: Speech speed from 0.5 to 2.0 (default: 1.0)
- Returns: JSON with filename of generated audio
- Parameters (form data):
-
GET /audio/{filename}
: Retrieve generated audio file -
POST /play
: Play audio directly from the server- Parameters (form data):
filename
: The filename of the audio to play (required)
- Returns: JSON with status and filename
- Parameters (form data):
-
POST /stop
: Stop any currently playing audio- Returns: JSON with status
-
POST /open_output_folder
: Open the output folder in the system’s file explorer- Returns: JSON with status and path
- Note: This feature only works when running the server locally
Note: Generated audio files are stored in
~/.mlx_audio/outputs
by default, or in a fallback directory if that location is not writable.
Models
Kokoro
Kokoro is a multilingual TTS model that supports various languages and voice styles.
Example Usage
|
|
Language Options
- 🇺🇸
'a'
- American English - 🇬🇧
'b'
- British English - 🇯🇵
'j'
- Japanese (requirespip install misaki[ja]
) - 🇨🇳
'z'
- Mandarin Chinese (requirespip install misaki[zh]
)
CSM (Conversational Speech Model)
CSM is a model from Sesame that allows you text-to-speech and to customize voices using reference audio samples.
Example Usage
|
|
You can pass any audio to clone the voice from or download sample audio file from here.
Advanced Features
Quantization
You can quantize models for improved performance:
|
|
Requirements
- MLX
- Python 3.8+
- Apple Silicon Mac (for optimal performance)
- For the web interface and API:
- FastAPI
- Uvicorn
License
Acknowledgements
- Thanks to the Apple MLX team for providing a great framework for building TTS and STS models.
- This project uses the Kokoro model architecture for text-to-speech synthesis.
- The 3D visualization uses Three.js for rendering.