Blaizzy/mlx-vlm
MLX-VLM
MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) and Omni Models (VLMs with audio and video support) on your Mac using MLX.
Table of Contents
- Installation
- Usage
- Activation Quantization (CUDA)
- Multi-Image Chat Support
- Model-Specific Documentation
- Vision Feature Caching
- TurboQuant KV Cache
- Fine-tuning
Model-Specific Documentation
Some models have detailed documentation with prompt formats, examples, and best practices:
| Model | Documentation |
|---|---|
| DeepSeek-OCR | Docs |
| DeepSeek-OCR-2 | Docs |
| DOTS-OCR | Docs |
| DOTS-MOCR | Docs |
| GLM-OCR | Docs |
| Phi-4 Reasoning Vision | Docs |
| MiniCPM-o | Docs |
| Phi-4 Multimodal | Docs |
| MolmoPoint | Docs |
| Moondream3 | Docs |
| Gemma 4 | Docs |
| Falcon-OCR | Docs |
| Granite Vision 3.2 | Docs |
| Granite 4.0 Vision | Docs |
Installation
The easiest way to get started is to install the mlx-vlm package using pip:
|
|
Usage
Command Line Interface (CLI)
Generate output from a model using the CLI:
|
|
Thinking Budget
For thinking models (e.g., Qwen3.5), you can limit the number of tokens spent in the thinking block:
|
|
| Flag | Description |
|---|---|
--enable-thinking |
Activate thinking mode in the chat template |
--thinking-budget |
Max tokens allowed inside the thinking block |
--thinking-start-token |
Token that opens a thinking block (default: <think>) |
--thinking-end-token |
Token that closes a thinking block (default: </think>) |
When the budget is exceeded, the model is forced to emit \n</think> and transition to the answer. If --enable-thinking is passed but the model’s chat template does not support it, the budget is applied only if the model generates the start token on its own.
Chat UI with Gradio
Launch a chat interface using Gradio:
|
|
Python Script
Here’s an example of how to use MLX-VLM in a Python script:
|
|
Audio Example
|
|
Multi-Modal Example (Image + Audio)
|
|
Server (FastAPI)
Start the server:
|
|
Server Options
--model: Preload a model at server startup, accepts a Hugging Face repo ID or local path (optional, loads lazily on first request if omitted)--adapter-path: Path for adapter weights to use with the preloaded model--host: Host address (default:0.0.0.0)--port: Port number (default:8080)--trust-remote-code: Trust remote code when loading models from Hugging Face Hub--kv-bits: Number of bits for KV cache quantization (e.g.3.5for TurboQuant)--kv-quant-scheme: KV cache quantization backend (uniformorturboquant)
You can also set trust remote code via environment variable:
|
|
The server provides multiple endpoints for different use cases and supports dynamic model loading/unloading with caching (one model at a time).
Available Endpoints
/modelsand/v1/models- List models available locally/chat/completionsand/v1/chat/completions- OpenAI-compatible chat-style interaction endpoint with support for images, audio, and text/responsesand/v1/responses- OpenAI-compatible responses endpoint/health- Check server status/unload- Unload current model from memory
Usage Examples
List available models
|
|
Text Input
|
|
Image Input
|
|
Audio Support (New)
|
|
Multi-Modal (Image + Audio)
|
|
Responses Endpoint
|
|
Request Parameters
model: Model identifier (required)messages: Chat messages for chat/OpenAI endpointsmax_tokens: Maximum tokens to generatetemperature: Sampling temperaturetop_p: Top-p sampling parametertop_k: Top-k sampling cutoffmin_p: Min-p sampling thresholdrepetition_penalty: Penalty applied to repeated tokensstream: Enable streaming responses
Activation Quantization (CUDA)
When running on NVIDIA GPUs with MLX CUDA, models quantized with mxfp8 or nvfp4 modes require activation quantization to work properly. This converts QuantizedLinear layers to QQLinear layers which quantize both weights and activations.
Command Line
Use the -qa or --quantize-activations flag:
|
|
Python API
Pass quantize_activations=True to the load function:
|
|
Supported Quantization Modes
mxfp8- 8-bit MX floating pointnvfp4- 4-bit NVIDIA floating point
Note: This feature is required for mxfp/nvfp quantized models on CUDA. On Apple Silicon (Metal), these models work without the flag.
Multi-Image Chat Support
MLX-VLM supports analyzing multiple images simultaneously with select models. This feature enables more complex visual reasoning tasks and comprehensive analysis across multiple images in a single conversation.
Usage Examples
Python Script
|
|
Command Line
|
|
Video Understanding
MLX-VLM also supports video analysis such as captioning, summarization, and more, with select models.
Supported Models
The following models support video chat:
- Qwen2-VL
- Qwen2.5-VL
- Idefics3
- LLaVA
With more coming soon.
Usage Examples
Command Line
|
|
These examples demonstrate how to use multiple images with MLX-VLM for more complex visual reasoning tasks.
Vision Feature Caching
In multi-turn conversations about an image, the vision encoder runs on every turn even though the image hasn’t changed. VisionFeatureCache stores projected vision features in an LRU cache keyed by image path, so the expensive vision encoder is only called once per unique image.
How It Works
- First turn (cache miss) –
encode_image()runs the full vision pipeline (vision tower + projector), stores the result in the cache, and passes it to the language model. - Subsequent turns (cache hit) – the cached features are passed directly via
cached_image_features, skipping the vision encoder entirely. - Image switch – when the image changes, it’s a new cache key so features are computed and cached. Switching back to a previous image is a cache hit.
The cache holds up to 8 entries (configurable) and uses LRU eviction.
CLI
All chat interfaces use VisionFeatureCache automatically:
|
|
Python
|
|
Server
The server caches vision features automatically across requests for the same image. No configuration needed – the cache is created when a model loads and cleared on unload.
|
|
Multi-turn conversations via /v1/chat/completions (streaming and non-streaming) and /responses all benefit. The same image sent across multiple requests will only be encoded once.
Performance
Tested on google/gemma-4-26b-a4b-it over 10 multi-turn conversation turns:
| Metric | Without Cache | With Cache |
|---|---|---|
| Prompt TPS | ~48 | ~550-825 |
| Speedup | – | 11x+ |
| Peak Memory | 52.66 GB | 52.66 GB (flat) |
Generation speed (~31 tok/s) and memory are unaffected – only prompt processing gets faster.
TurboQuant KV Cache
TurboQuant compresses the KV cache during generation, enabling longer context lengths with less memory while maintaining quality.
Quick Start
|
|
|
|
|
|
How It Works
TurboQuant uses random rotation + codebook quantization (arXiv:2504.19874) to compress KV cache entries from 16-bit to 2-4 bits per dimension:
- Keys & Values: MSE codebook quantization with Hadamard rotation
- Fractional bits (e.g. 3.5): uses lower bits for keys, higher for values (3-bit K + 4-bit V)
Custom Metal kernels fuse score computation and value aggregation directly on packed quantized data, avoiding full dequantization during decode.
Performance
Tested on Qwen3.5-4B-4bit at 128k context:
| Metric | Baseline | TurboQuant 3.5-bit |
|---|---|---|
| KV Memory | 4.1 GB | 0.97 GB (76% reduction) |
| Peak Memory | 18.3 GB | 17.3 GB (-1.0 GB) |
At 512k+ contexts, TurboQuant’s per-layer attention is faster than FP16 SDPA due to reduced memory bandwidth requirements.
Tested on gemma-4-31b-it at 128k context:
| Metric | Baseline | TurboQuant 3.5-bit |
|---|---|---|
| KV Memory | 13.3 GB | 4.9 GB (63% reduction) |
| Peak Memory | 75.2 GB | 65.8 GB (-9.4 GB) |
Supported Bit Widths
| Bits | Compression | Best For |
|---|---|---|
| 2 | ~8x | Maximum compression, some quality loss |
| 3 | ~5x | Good balance of quality and compression |
| 3.5 | ~4.5x | Recommended default (3-bit keys + 4-bit values) |
| 4 | ~4x | Best quality, moderate compression |
Compatibility
TurboQuant automatically quantizes KVCache layers (global attention). Models with RotatingKVCache (sliding window) or ArraysCache (MLA/absorbed keys) keep their native cache format for those layers since they are already memory-efficient.
Fine-tuning
MLX-VLM supports fine-tuning models with LoRA and QLoRA.
LoRA & QLoRA
To learn more about LoRA, please refer to the LoRA.md file.