Multimodal on Producthunt daily

OM1

Sun, 21 Sep 2025 15:24:57 +0800

OpenMind/OM1

Technical Paper | Documentation | X | Discord

OpenMind’s OM1 is a modular AI runtime that empowers developers to create and deploy multimodal AI agents across digital environments and physical robots, including Humanoids, Phone Apps, websites, Quadrupeds, and educational robots such as TurtleBot 4. OM1 agents can process diverse inputs like web data, social media, camera feeds, and LIDAR, while enabling physical actions including motion, autonomous navigation, and natural conversations. The goal of OM1 is to make it easy to create highly capable human-focused robots, that are easy to upgrade and (re)configure to accommodate different physical form factors.

Capabilities of OM1

Modular Architecture: Designed with Python for simplicity and seamless integration.
Data Input: Easily handles new data and sensors.
Hardware Support via Plugins: Supports new hardware through plugins for API endpoints and specific robot hardware connections to ROS2, Zenoh, and CycloneDDS. (We recommend Zenoh for all new development).
Web-Based Debugging Display: Monitor the system in action with WebSim (available at http://localhost:8000/) for easy visual debugging.
Pre-configured Endpoints: Supports Voice-to-Speech, OpenAI’s gpt-4o, DeepSeek, and multiple Visual Language Models (VLMs) with pre-configured endpoints for each service.

Architecture Overview

Getting Started - Hello World

To get started with OM1, let’s run the Spot agent. Spot uses your webcam to capture and label objects. These text captions are then sent to OpenAI 4o, which returns movement, speech and face action commands. These commands are displayed on WebSim along with basic timing and other debugging information.

Package Management and VENV

You will need the uv package manager.

Clone the Repo

git clone https://github.com/openmind/OM1.git
cd OM1
git submodule update --init
uv venv

Install Dependencies

For MacOS

`1`	`brew install portaudio ffmpeg`

For Linux

1
2

sudo apt-get update
sudo apt-get install portaudio19-dev python-dev ffmpeg

Obtain an OpenMind API Key

Obtain your API Key at OpenMind Portal. Copy it to config/spot.json5, replacing the openmind_free placeholder. Or, cp env.example .env and add your key to the .env.

Launching OM1

Run

`1`	`uv run src/run.py spot`

After launching OM1, the Spot agent will interact with you and perform (simulated) actions. For more help connecting OM1 to your robot hardware, see getting started.

What’s Next?

Try out some examples
Add new inputs and actions.
Design custom agents and robots by creating your own json5 config files with custom combinations of inputs and actions.
Change the system prompts in the configuration files (located in /config/) to create new behaviors.

Interfacing with New Robot Hardware

OM1 assumes that robot hardware provides a high-level SDK that accepts elemental movement and action commands such as backflip, run, gently pick up the red apple, move(0.37, 0, 0), and smile. An example is provided in actions/move_safe/connector/ros2.py:

...
elif output_interface.action == "shake paw":
    if self.sport_client:
        self.sport_client.Hello()
...

If your robot hardware does not yet provide a suitable HAL (hardware abstraction layer), traditional robotics approaches such as RL (reinforcement learning) in concert with suitable simulation environments (Unity, Gazebo), sensors (such as hand mounted ZED depth cameras), and custom VLAs will be needed for you to create one. It is further assumed that your HAL accepts motion trajectories, provides battery and thermal management/monitoring, and calibrates and tunes sensors such as IMUs, LIDARs, and magnetometers.

OM1 can interface with your HAL via USB, serial, ROS2, CycloneDDS, Zenoh, or websockets. For an example of an advanced humanoid HAL, please see Unitree’s C++ SDK. Frequently, a HAL, especially ROS2 code, will be dockerized and can then interface with OM1 through DDS middleware or websockets.

Recommended Development Platforms

OM1 is developed on:

Jetson AGX Orin 64GB (running Ubuntu 22.04 and JetPack 6.1)
Mac Studio with Apple M2 Ultra with 48 GB unified memory (running MacOS Sequoia)
Mac Mini with Apple M4 Pro with 48 GB unified memory (running MacOS Sequoia)
Generic Linux machines (running Ubuntu 22.04)

OM1 should run on other platforms (such as Windows) and microcontrollers such as the Raspberry Pi 5 16GB.

Full Autonomy Guidance

We’re excited to introduce full autonomy mode, where three services work together in a loop without manual intervention:

om1
unitree_go2_ros2_sdk – A ROS 2 package that provides SLAM (Simultaneous Localization and Mapping) capabilities for the Unitree Go2 robot using an RPLiDAR sensor, the SLAM Toolbox and the Nav2 stack.
om1-avatar – A modern React-based frontend application that provides the user interface and avatar display system for OM1 robotics software.

Intro to Backpack?

From research to real-world autonomy, a platform that learns, moves, and builds with you. We’ll shortly be releasing the BOM and details on DIY for the it. Stay tuned!

Clone the following repos -

Starting the system

To start all services, run the following commands:

For OM1

1
2

cd OM1
docker-compose up om1 -d --no-build

For unitree_go2_ros2_sdk

cd unitree_go2_ros2_sdk
docker-compose up orchestrator -d --no-build
docker-compose up om1_sensor -d --no-build
docker-compose up watchdog -d --no-build

For OM1-avatar

1
2

cd OM1-avatar
docker-compose up om1_avatar -d --no-build

Detailed Documentation

More detailed documentation can be accessed at docs.openmind.org.

Contributing

Please make sure to read the Contributing Guide before making a pull request.

License

This project is licensed under the terms of the MIT License, which is a permissive free software license that allows users to freely use, modify, and distribute the software. The MIT License is a widely used and well-established license that is known for its simplicity and flexibility. By using the MIT License, this project aims to encourage collaboration, modification, and distribution of the software.

ten-framework

Fri, 19 Sep 2025 15:27:39 +0800

TEN-framework/ten-framework

Official Site • Documentation • Blog

Table of Contents

👋 Welcome to TEN

TEN is a comprehensive open-source ecosystem for creating, customizing, and deploying real-time conversational AI agents with multimodal capabilities including voice, vision, and avatar interactions.

TEN includes TEN Framework, TEN Turn Detection, TEN VAD, TEN Agent, TMAN Designer, and TEN Portal. Check out 🌍 TEN Ecosystem for more details.

Community Channel	Purpose
	Follow TEN Framework on X for updates and announcements
	Follow TEN Framework on LinkedIn for updates and announcements
	Join our Discord community to connect with developers
	Join our Hugging Face community to explore our spaces and models
	Join our WeChat group for Chinese community discussions

[!IMPORTANT]

Star TEN Repositories ⭐️

Get instant notifications for new releases and updates. Your support helps us grow and improve TEN!

Star History

🎨 TMAN Designer

https://github.com/user-attachments/assets/44c6a087-ec7a-45b0-a084-dab5dac5e36b

TMAN Designer

TMAN Designer is a low/no-code option to create voice agents with an easy-to-use workflow UI. It can load apps and graphs, and includes an online editor, log viewer, and much more.

Check out this blog for more details.

✨ Features

1️⃣ Real-time Avatar

Build engaging AI avatars with TEN Agent using Trulience’s diverse collection of free avatar options. To get it up and running, you only need 2 steps:

Follow the README to finish setting up and running the Playground
Enter the avatar ID and token you get from Trulience

2️⃣ Real-time voice with MCP servers

TEN Agent now integrates seamlessly with MCP servers, expanding its LLM capabilities. To get started:

Open the Module Picker in Playground
Add the MCP server tool for LLM integration
Paste a URL from your MCP server in the extension
Start a realtime conversation with TEN Agent

This integration allows you to leverage MCP’s diverse servers offerings while maintaining TEN Agent’s powerful conversational abilities.

https://github.com/user-attachments/assets/78647eef-2d66-44e6-99a8-1918a940fb9f

3️⃣ Real-time communication with hardware

TEN Agent is now running on the Espressif ESP32-S3 Korvo V3 development board, an excellent way to integrate realtime communication with LLM on hardware.

Check out the integration guide for more details.

4️⃣ Real-time vision and real-time screenshare detection

Try Google Gemini Multimodal Live API with realtime vision and realtime screenshare detection capabilities, it is a ready-to-use extension, along with powerful tools like Weather Check and Web Search integrated perfectly into TEN Agent.

5️⃣ TEN with other LLM platforms

TEN Agent + Dify

TEN offers a great support to make the realtime interactive experience even better on other LLM platform as well, check out docs for more.

6️⃣ StoryTeller - TEN image generation

Experience the real-time image generation with StoryTeller, it is a ready-to-use extension, along with powerful tools like Weather Check and Web Search integrated perfectly into TEN.

👩‍💻 Get TEN Agent up and running

🅰️ Run TEN Agent in localhost

Step ⓵ - Prerequisites

Category	Requirements
Keys	• Agora App ID and App Certificate (free minutes every month) • OpenAI API key (any LLM that is compatible with OpenAI) • Deepgram ASR (free credits available with signup) • Elevenlabs TTS (free credits available with signup)
Installation	• Docker / Docker Compose • Node.js(LTS) v18
Minimum System Requirements	• CPU >= 2 Core • RAM >= 4 GB

[!NOTE]

macOS: Docker setting on Apple Silicon

Uncheck “Use Rosetta for x86/amd64 emulation” in Docker settings, it may result in slower build times on ARM, but performance will be normal when deployed to x64 servers.

Step ⓶ - Build agent in VM

1. Clone down the repo,`cd` to `ai-agents` and create `.env` file from `.env.example`

1
2

cd ai_agents
cp ./.env.example ./.env

2. Setup Agora App ID and App Certificate in `.env`

1
2

AGORA_APP_ID=
AGORA_APP_CERTIFICATE=

3. Start agent development containers

`1`	`docker compose up -d`

4. Enter container

`1`	`docker exec -it ten_agent_dev bash`

5. Build agent with the default `graph` ( ~5min - ~8min)

check the /examples folder for more examples

# use the chained voice assistant
task use AGENT=voice-assistant

# or use the speech-to-speech voice assistant realtime
task use AGENT=voice-assistant-realtime

6. Start the web server

# run task build if you changed any local source code, this is necessary if you are working on languages which require compilation like TypeScript or Golang.
task build

task run

Step ⓷ - Customize your agent with TMAN Designer

Open localhost:49483.
Right-click on the STT, LLM, and TTS extensions.
Open their properties and enter APIs respectively.
Right-click the canvas and select ‘Manage Apps’ to open the Apps Manager.
Right under the Actions, click the ▶ to run the App.
Check the ‘Run with TEN Agent’ option and click the Run button.

🅱️ Run TEN Agent in Codespace(no docker)

GitHub offers free Codespace for each repository, you can run the playground in Codespace without using Docker.Also, the speed of Codespace is much faster than localhost.

Check out this guide for more details.

🛳️ TEN Agent Self Hosting

🅰️ Deploying with Docker

Once you have customized your agent (either by using the TMAN Manager, Playground, or editing property.json directly), you can deploy it by creating a release Docker image for your service.

Read the Deployment Guide for detailed information about deployment.

🅱️ Deploying with other cloud services

coming soon

🌍 TEN Ecosystem

Project	Preview
🏚️ TEN Framework TEN is an open-source framework for real-time, multimodal conversational AI.
️🔂 TEN Turn Detection TEN is for full-duplex dialogue communication.
🔉 TEN VAD TEN VAD is a low-latency, lightweight and high-performance streaming voice activity detector (VAD).
🎙️ TEN Agent TEN Agent is a showcase of TEN Framewrok.
🎨 TMAN Designer TMAN Designer is low/no code option to make a voice agent with easy to use workflow UI.
📒 TEN Portal The official site of TEN framework, it has documentation and blog.

❓ Ask Questions

TEN Framework is available on these AI-powered Q&A platforms. They can help you find answers quickly and accurately in multiple languages, covering everything from basic setup to advanced implementation details.

Service	Link
DeepWiki
ReadmeX

🥰 Contributing

We welcome all forms of open-source collaboration! Whether you’re fixing bugs, adding features, improving documentation, or sharing ideas - your contributions help advance personalized AI tools. Check out our GitHub Issues and Projects to find ways to contribute and show your skills. Together, we can build something amazing!

[!TIP]

Welcome all kinds of contributions 🙏

Join us in building TEN better! Every contribution makes a difference, from code to documentation. Share your TEN Agent projects on social media with to inspire others!

Connect with one of the TEN maintainers @elliotchen100 on 𝕏 or @cyfyifanchen on GitHub for project updates, discussions and collaboration opportunities.

Code Contributors

Contribution Guidelines

Contributions are welcome! Please read the contribution guidelines first.

License

The entire TEN framework (except for the folders explicitly listed below) is released under the Apache License, Version 2.0, with additional restrictions. For details, please refer to the LICENSE file located in the root directory of the TEN framework.
The components within the packages directory are released under the Apache License, Version 2.0. For details, please refer to the LICENSE file located in each package’s root directory.
The third-party libraries used by the TEN framework are listed and described in detail. For more information, please refer to the third_party folder.

transformers

Sun, 14 Sep 2025 15:25:45 +0800

huggingface/transformers

English | 简体中文 | 繁體中文 | 한국어 | Español | 日本語 | हिन्दी | Русский | Português | తెలుగు | Français | Deutsch | Tiếng Việt | العربية | اردو |

State-of-the-art pretrained models for inference and training

Transformers acts as the model-definition framework for state-of-the-art machine learning models in text, computer vision, audio, video, and multimodal model, for both inference and training.

It centralizes the model definition so that this definition is agreed upon across the ecosystem. transformers is the pivot across frameworks: if a model definition is supported, it will be compatible with the majority of training frameworks (Axolotl, Unsloth, DeepSpeed, FSDP, PyTorch-Lightning, …), inference engines (vLLM, SGLang, TGI, …), and adjacent modeling libraries (llama.cpp, mlx, …) which leverage the model definition from transformers.

We pledge to help support new state-of-the-art models and democratize their usage by having their model definition be simple, customizable, and efficient.

There are over 1M+ Transformers model checkpoints on the Hugging Face Hub you can use.

Explore the Hub today to find a model and use Transformers to help you get started right away.

Installation

Transformers works with Python 3.9+ PyTorch 2.1+, TensorFlow 2.6+, and Flax 0.4.1+.

Create and activate a virtual environment with venv or uv, a fast Rust-based Python package and project manager.

# venv
python -m venv .my-env
source .my-env/bin/activate
# uv
uv venv .my-env
source .my-env/bin/activate

Install Transformers in your virtual environment.

# pip
pip install "transformers[torch]"

# uv
uv pip install "transformers[torch]"

Install Transformers from source if you want the latest changes in the library or are interested in contributing. However, the latest version may not be stable. Feel free to open an issue if you encounter an error.

git clone https://github.com/huggingface/transformers.git
cd transformers

# pip
pip install .[torch]

# uv
uv pip install .[torch]

Quickstart

Get started with Transformers right away with the Pipeline API. The Pipeline is a high-level inference class that supports text, audio, vision, and multimodal tasks. It handles preprocessing the input and returns the appropriate output.

Instantiate a pipeline and specify model to use for text generation. The model is downloaded and cached so you can easily reuse it again. Finally, pass some text to prompt the model.

from transformers import pipeline

pipeline = pipeline(task="text-generation", model="Qwen/Qwen2.5-1.5B")
pipeline("the secret to baking a really good cake is ")
[{'generated_text': 'the secret to baking a really good cake is 1) to use the right ingredients and 2) to follow the recipe exactly. the recipe for the cake is as follows: 1 cup of sugar, 1 cup of flour, 1 cup of milk, 1 cup of butter, 1 cup of eggs, 1 cup of chocolate chips. if you want to make 2 cakes, how much sugar do you need? To make 2 cakes, you will need 2 cups of sugar.'}]

To chat with a model, the usage pattern is the same. The only difference is you need to construct a chat history (the input to Pipeline) between you and the system.

[!TIP] You can also chat with a model directly from the command line.
1
transformers chat Qwen/Qwen2.5-0.5B-Instruct

import torch
from transformers import pipeline

chat = [
    {"role": "system", "content": "You are a sassy, wise-cracking robot as imagined by Hollywood circa 1986."},
    {"role": "user", "content": "Hey, can you tell me any fun things to do in New York?"}
]

pipeline = pipeline(task="text-generation", model="meta-llama/Meta-Llama-3-8B-Instruct", dtype=torch.bfloat16, device_map="auto")
response = pipeline(chat, max_new_tokens=512)
print(response[0]["generated_text"][-1]["content"])

Expand the examples below to see how Pipeline works for different modalities and tasks.

Automatic speech recognition

from transformers import pipeline

pipeline = pipeline(task="automatic-speech-recognition", model="openai/whisper-large-v3")
pipeline("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac")
{'text': ' I have a dream that one day this nation will rise up and live out the true meaning of its creed.'}

Image classification

from transformers import pipeline

pipeline = pipeline(task="image-classification", model="facebook/dinov2-small-imagenet1k-1-layer")
pipeline("https://huggingface.co/datasets/Narsil/image_dummy/raw/main/parrots.png")
[{'label': 'macaw', 'score': 0.997848391532898},
 {'label': 'sulphur-crested cockatoo, Kakatoe galerita, Cacatua galerita',
  'score': 0.0016551691805943847},
 {'label': 'lorikeet', 'score': 0.00018523589824326336},
 {'label': 'African grey, African gray, Psittacus erithacus',
  'score': 7.85409429227002e-05},
 {'label': 'quail', 'score': 5.502637941390276e-05}]

Visual question answering

from transformers import pipeline

pipeline = pipeline(task="visual-question-answering", model="Salesforce/blip-vqa-base")
pipeline(
    image="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/idefics-few-shot.jpg",
    question="What is in the image?",
)
[{'answer': 'statue of liberty'}]

Why should I use Transformers?

Easy-to-use state-of-the-art models:
- High performance on natural language understanding & generation, computer vision, audio, video, and multimodal tasks.
- Low barrier to entry for researchers, engineers, and developers.
- Few user-facing abstractions with just three classes to learn.
- A unified API for using all our pretrained models.
Lower compute costs, smaller carbon footprint:
- Share trained models instead of training from scratch.
- Reduce compute time and production costs.
- Dozens of model architectures with 1M+ pretrained checkpoints across all modalities.
Choose the right framework for every part of a models lifetime:
- Train state-of-the-art models in 3 lines of code.
- Move a single model between PyTorch/JAX/TF2.0 frameworks at will.
- Pick the right framework for training, evaluation, and production.
Easily customize a model or an example to your needs:
- We provide examples for each architecture to reproduce the results published by its original authors.
- Model internals are exposed as consistently as possible.
- Model files can be used independently of the library for quick experiments.

Why shouldn’t I use Transformers?

This library is not a modular toolbox of building blocks for neural nets. The code in the model files is not refactored with additional abstractions on purpose, so that researchers can quickly iterate on each of the models without diving into additional abstractions/files.
The training API is optimized to work with PyTorch models provided by Transformers. For generic machine learning loops, you should use another library like Accelerate.
The example scripts are only examples. They may not necessarily work out-of-the-box on your specific use case and you’ll need to adapt the code for it to work.

100 projects using Transformers

Transformers is more than a toolkit to use pretrained models, it’s a community of projects built around it and the Hugging Face Hub. We want Transformers to enable developers, researchers, students, professors, engineers, and anyone else to build their dream projects.

In order to celebrate Transformers 100,000 stars, we wanted to put the spotlight on the community with the awesome-transformers page which lists 100 incredible projects built with Transformers.

If you own or use a project that you believe should be part of the list, please open a PR to add it!

Example models

You can test most of our models directly on their Hub model pages.

Expand each modality below to see a few example models for various use cases.

Audio

Audio classification with Whisper
Automatic speech recognition with Moonshine
Keyword spotting with Wav2Vec2
Speech to speech generation with Moshi
Text to audio with MusicGen
Text to speech with Bark

Computer vision

Automatic mask generation with SAM
Depth estimation with DepthPro
Image classification with DINO v2
Keypoint detection with SuperPoint
Keypoint matching with SuperGlue
Object detection with RT-DETRv2
Pose Estimation with VitPose
Universal segmentation with OneFormer
Video classification with VideoMAE

Multimodal

Audio or text to text with Qwen2-Audio
Document question answering with LayoutLMv3
Image or text to text with Qwen-VL
Image captioning BLIP-2
OCR-based document understanding with GOT-OCR2
Table question answering with TAPAS
Unified multimodal understanding and generation with Emu3
Vision to text with Llava-OneVision
Visual question answering with Llava
Visual referring expression segmentation with Kosmos-2

NLP

Masked word completion with ModernBERT
Named entity recognition with Gemma
Question answering with Mixtral
Summarization with BART
Translation with T5
Text generation with Llama
Text classification with Qwen

Citation

We now have a paper you can cite for the 🤗 Transformers library:

@inproceedings{wolf-etal-2020-transformers,
    title = "Transformers: State-of-the-Art Natural Language Processing",
    author = "Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and Rémi Louf and Morgan Funtowicz and Joe Davison and Sam Shleifer and Patrick von Platen and Clara Ma and Yacine Jernite and Julien Plu and Canwen Xu and Teven Le Scao and Sylvain Gugger and Mariama Drame and Quentin Lhoest and Alexander M. Rush",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
    month = oct,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-demos.6",
    pages = "38--45"
}

kotaemon

Wed, 10 Sep 2025 15:28:12 +0800

Cinnamon/kotaemon

kotaemon

An open-source clean & customizable RAG UI for chatting with your documents. Built with both end users and developers in mind.

Live Demo #1 | Live Demo #2 | Online Install | Colab Notebook (Local RAG)

User Guide | Developer Guide | Feedback | Contact

Introduction

This project serves as a functional RAG UI for both end users who want to do QA on their documents and developers who want to build their own RAG pipeline.

+----------------------------------------------------------------------------+
| End users: Those who use apps built with `kotaemon`.                       |
| (You use an app like the one in the demo above)                            |
|     +----------------------------------------------------------------+     |
|     | Developers: Those who built with `kotaemon`.                   |     |
|     | (You have `import kotaemon` somewhere in your project)         |     |
|     |     +----------------------------------------------------+     |     |
|     |     | Contributors: Those who make `kotaemon` better.    |     |     |
|     |     | (You make PR to this repo)                         |     |     |
|     |     +----------------------------------------------------+     |     |
|     +----------------------------------------------------------------+     |
+----------------------------------------------------------------------------+

For end users

Clean & Minimalistic UI: A user-friendly interface for RAG-based QA.
Support for Various LLMs: Compatible with LLM API providers (OpenAI, AzureOpenAI, Cohere, etc.) and local LLMs (via ollama and llama-cpp-python).
Easy Installation: Simple scripts to get you started quickly.

For developers

Framework for RAG Pipelines: Tools to build your own RAG-based document QA pipeline.
Customizable UI: See your RAG pipeline in action with the provided UI, built with Gradio .
Gradio Theme: If you use Gradio for development, check out our theme here: kotaemon-gradio-theme.

Key Features

Host your own document QA (RAG) web-UI: Support multi-user login, organize your files in private/public collections, collaborate and share your favorite chat with others.
Organize your LLM & Embedding models: Support both local LLMs & popular API providers (OpenAI, Azure, Ollama, Groq).
Hybrid RAG pipeline: Sane default RAG pipeline with hybrid (full-text & vector) retriever and re-ranking to ensure best retrieval quality.
Multi-modal QA support: Perform Question Answering on multiple documents with figures and tables support. Support multi-modal document parsing (selectable options on UI).
Advanced citations with document preview: By default the system will provide detailed citations to ensure the correctness of LLM answers. View your citations (incl. relevant score) directly in the in-browser PDF viewer with highlights. Warning when retrieval pipeline return low relevant articles.
Support complex reasoning methods: Use question decomposition to answer your complex/multi-hop question. Support agent-based reasoning with ReAct, ReWOO and other agents.
Configurable settings UI: You can adjust most important aspects of retrieval & generation process on the UI (incl. prompts).
Extensible: Being built on Gradio, you are free to customize or add any UI elements as you like. Also, we aim to support multiple strategies for document indexing & retrieval. GraphRAG indexing pipeline is provided as an example.

Installation

If you are not a developer and just want to use the app, please check out our easy-to-follow User Guide. Download the .zip file from the latest release to get all the newest features and bug fixes.

System requirements

Python >= 3.10
Docker: optional, if you install with Docker
Unstructured if you want to process files other than .pdf, .html, .mhtml, and .xlsx documents. Installation steps differ depending on your operating system. Please visit the link and follow the specific instructions provided there.

With Docker (recommended)

We support both lite & full version of Docker images. With full version, the extra packages of unstructured will be installed, which can support additional file types (.doc, .docx, …) but the cost is larger docker image size. For most users, the lite image should work well in most cases.
- To use the full version.
  1 2 3 4 5 6
  
  docker run \ -e GRADIO_SERVER_NAME=0.0.0.0 \ -e GRADIO_SERVER_PORT=7860 \ -v ./ktem_app_data:/app/ktem_app_data \ -p 7860:7860 -it --rm \ ghcr.io/cinnamon/kotaemon:main-full
- To use the full version with bundled Ollama for local / private RAG.
  1 2
  
  # change image name to docker run <...> ghcr.io/cinnamon/kotaemon:main-ollama
- To use the lite version.
1 2

# change image name to docker run <...> ghcr.io/cinnamon/kotaemon:main-lite

We currently support and test two platforms: linux/amd64 and linux/arm64 (for newer Mac). You can specify the platform by passing --platform in the docker run command. For example:

# To run docker with platform linux/arm64
docker run \
-e GRADIO_SERVER_NAME=0.0.0.0 \
-e GRADIO_SERVER_PORT=7860 \
-v ./ktem_app_data:/app/ktem_app_data \
-p 7860:7860 -it --rm \
--platform linux/arm64 \
ghcr.io/cinnamon/kotaemon:main-lite

Once everything is set up correctly, you can go to http://localhost:7860/ to access the WebUI.
We use GHCR to store docker images, all images can be found here.

Without Docker

Clone and install required packages on a fresh python environment.

# optional (setup env)
conda create -n kotaemon python=3.10
conda activate kotaemon

# clone this repo
git clone https://github.com/Cinnamon/kotaemon
cd kotaemon

pip install -e "libs/kotaemon[all]"
pip install -e "libs/ktem"

Create a .env file in the root of this project. Use .env.example as a template

The .env file is there to serve use cases where users want to pre-config the models before starting up the app (e.g. deploy the app on HF hub). The file will only be used to populate the db once upon the first run, it will no longer be used in consequent runs.
(Optional) To enable in-browser PDF_JS viewer, download PDF_JS_DIST then extract it to libs/ktem/ktem/assets/prebuilt

Start the web server:
1

python app.py
- The app will be automatically launched in your browser.
- Default username and password are both admin. You can set up additional users directly through the UI.
Check the Resources tab and LLMs and Embeddings and ensure that your api_key value is set correctly from your .env file. If it is not set, you can set it there.

Setup GraphRAG

[!NOTE] Official MS GraphRAG indexing only works with OpenAI or Ollama API. We recommend most users to use NanoGraphRAG implementation for straightforward integration with Kotaemon.

Setup Nano GRAPHRAG

Install nano-GraphRAG: pip install nano-graphrag
nano-graphrag install might introduce version conflicts, see this issue
- To quickly fix: pip uninstall hnswlib chroma-hnswlib && pip install chroma-hnswlib
Launch Kotaemon with USE_NANO_GRAPHRAG=true environment variable.
Set your default LLM & Embedding models in Resources setting and it will be recognized automatically from NanoGraphRAG.

Setup LIGHTRAG

Install LightRAG: pip install git+https://github.com/HKUDS/LightRAG.git
LightRAG install might introduce version conflicts, see this issue
- To quickly fix: pip uninstall hnswlib chroma-hnswlib && pip install chroma-hnswlib
Launch Kotaemon with USE_LIGHTRAG=true environment variable.
Set your default LLM & Embedding models in Resources setting and it will be recognized automatically from LightRAG.

Setup MS GRAPHRAG

Non-Docker Installation: If you are not using Docker, install GraphRAG with the following command:
1

pip install "graphrag<=0.3.6" future
Setting Up API KEY: To use the GraphRAG retriever feature, ensure you set the GRAPHRAG_API_KEY environment variable. You can do this directly in your environment or by adding it to a .env file.
Using Local Models and Custom Settings: If you want to use GraphRAG with local models (like Ollama) or customize the default LLM and other configurations, set the USE_CUSTOMIZED_GRAPHRAG_SETTING environment variable to true. Then, adjust your settings in the settings.yaml.example file.

Setup Local Models (for local/private RAG)

See Local model setup.

Setup multimodal document parsing (OCR, table parsing, figure extraction)

These options are available:

Azure Document Intelligence (API)
Adobe PDF Extract (API)
Docling (local, open-source)
- To use Docling, first install required dependencies: pip install docling

Select corresponding loaders in Settings -> Retrieval Settings -> File loader

Customize your application

By default, all application data is stored in the ./ktem_app_data folder. You can back up or copy this folder to transfer your installation to a new machine.
For advanced users or specific use cases, you can customize these files:
- flowsettings.py
- .env

`flowsettings.py`

This file contains the configuration of your application. You can use the example here as the starting point.

Notable settings

# setup your preferred document store (with full-text search capabilities)
KH_DOCSTORE=(Elasticsearch | LanceDB | SimpleFileDocumentStore)

# setup your preferred vectorstore (for vector-based search)
KH_VECTORSTORE=(ChromaDB | LanceDB | InMemory | Milvus | Qdrant)

# Enable / disable multimodal QA
KH_REASONINGS_USE_MULTIMODAL=True

# Setup your new reasoning pipeline or modify existing one.
KH_REASONINGS = [
    "ktem.reasoning.simple.FullQAPipeline",
    "ktem.reasoning.simple.FullDecomposeQAPipeline",
    "ktem.reasoning.react.ReactAgentPipeline",
    "ktem.reasoning.rewoo.RewooAgentPipeline",
]

`.env`

This file provides another way to configure your models and credentials.

Configure model via the .env file

Alternatively, you can configure the models via the .env file with the information needed to connect to the LLMs. This file is located in the folder of the application. If you don’t see it, you can create one.
Currently, the following providers are supported:
- OpenAI
  
  In the .env file, set the OPENAI_API_KEY variable with your OpenAI API key in order to enable access to OpenAI’s models. There are other variables that can be modified, please feel free to edit them to fit your case. Otherwise, the default parameter should work for most people.
  1 2 3 4
  
  OPENAI_API_BASE=https://api.openai.com/v1 OPENAI_API_KEY=<your OpenAI API key here> OPENAI_CHAT_MODEL=gpt-3.5-turbo OPENAI_EMBEDDINGS_MODEL=text-embedding-ada-002
- Azure OpenAI
  
  For OpenAI models via Azure platform, you need to provide your Azure endpoint and API key. Your might also need to provide your developments’ name for the chat model and the embedding model depending on how you set up Azure development.
  1 2 3 4 5
  
  AZURE_OPENAI_ENDPOINT= AZURE_OPENAI_API_KEY= OPENAI_API_VERSION=2024-02-15-preview AZURE_OPENAI_CHAT_DEPLOYMENT=gpt-35-turbo AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT=text-embedding-ada-002
- Local Models
  - Using ollama OpenAI compatible server:
    - Install ollama and start the application.
    - Pull your model, for example:
      1 2
      
      ollama pull llama3.1:8b ollama pull nomic-embed-text
    - Set the model names on web UI and make it as default:
  - Using GGUF with llama-cpp-python
    
    You can search and download a LLM to be ran locally from the Hugging Face Hub. Currently, these model formats are supported:
    - GGUF
      
      You should choose a model whose size is less than your device’s memory and should leave about 2 GB. For example, if you have 16 GB of RAM in total, of which 12 GB is available, then you should choose a model that takes up at most 10 GB of RAM. Bigger models tend to give better generation but also take more processing time.
      
      Here are some recommendations and their size in memory:
    - Qwen1.5-1.8B-Chat-GGUF: around 2 GB
      
      Add a new LlamaCpp model with the provided model name on the web UI.

Adding your own RAG pipeline

Custom Reasoning Pipeline

Check the default pipeline implementation in here. You can make quick adjustment to how the default QA pipeline work.
Add new .py implementation in libs/ktem/ktem/reasoning/ and later include it in flowssettings to enable it on the UI.

Custom Indexing Pipeline

Check sample implementation in libs/ktem/ktem/index/file/graph

(more instruction WIP).

Citation

Please cite this project as

@misc{kotaemon2024,
    title = {Kotaemon - An open-source RAG-based tool for chatting with any content.},
    author = {The Kotaemon Team},
    year = {2024},
    howpublished = {\url{https://github.com/Cinnamon/kotaemon}},
}

Star History

Contribution

Since our project is actively being developed, we greatly value your feedback and contributions. Please see our Contributing Guide to get started. Thank you to all our contributors!

MiniCPM-V

Tue, 02 Sep 2025 15:29:41 +0800

OpenBMB/MiniCPM-V

A GPT-4o Level MLLM for Single Image, Multi Image and High-FPS Video Understanding on Your Phone

中文 | English

WeChat | Discord

MiniCPM-V 4.5 🤗 🤖 | MiniCPM-o 2.6 🤗 🤖 | 🍳 Cookbook | 📄 Technical Report (Coming Soon)

MiniCPM-V is a series of efficient end-side multimodal LLMs (MLLMs), which accept images, videos and text as inputs and deliver high-quality text outputs. MiniCPM-o additionally takes audio as inputs and provides high-quality speech outputs in an end-to-end fashion. Since February 2024, we have released 7 versions of the model, aiming to achieve strong performance and efficient deployment. The most notable models in the series currently include:

MiniCPM-V 4.5: 🔥🔥🔥 The latest and most capable model in the MiniCPM-V series. With a total of 8B parameters, this model outperforms GPT-4o-latest, Gemini-2.0 Pro, and Qwen2.5-VL 72B in vision-language capabilities, making it the most performant on-device multimodal model in the open-source community. This version brings new features including efficient high-FPS and long video understanding (up to 96x compression rate for video tokens), controllable hybrid fast/deep thinking, strong handwritten OCR and complex table/document parsing. It also advances MiniCPM-V’s popular features such as trustworthy behavior, multilingual support and end-side deployability.
MiniCPM-o 2.6: ⭐️⭐️⭐️ The most capable model in the MiniCPM-o series. With a total of 8B parameters, this end-to-end model achieves comparable performance to GPT-4o-202405 in vision, speech, and multimodal live streaming, making it one of the most versatile and performant models in the open-source community. For the new voice mode, MiniCPM-o 2.6 supports bilingual real-time speech conversation with configurable voices, and also allows for fun capabilities such as emotion/speed/style control, end-to-end voice cloning, role play, etc. Due to its superior token density, MiniCPM-o 2.6 can for the first time support multimodal live streaming on end-side devices such as iPad.

News

📌 Pinned

[2025.09.01] ⭐️⭐️⭐️ MiniCPM-V 4.5 has been officially supported by llama.cpp, vLLM, and LLaMA-Factory. You are welcome to use it directly through these official channels! Support for additional frameworks such as Ollama and SGLang is actively in progress.
[2025.08.26] 🔥🔥🔥 We open-source MiniCPM-V 4.5, which outperforms GPT-4o-latest, Gemini-2.0 Pro, and Qwen2.5-VL 72B. It advances popular capabilities of MiniCPM-V, and brings useful new features. Try it now!
[2025.08.01] ⭐️⭐️⭐️ We open-sourced the MiniCPM-V & o Cookbook! It provides comprehensive guides for diverse user scenarios, paired with our new Docs Site for smoother onboarding.
[2025.06.20] ⭐️⭐️⭐️ Our official Ollama repository is released. Try our latest models with one click！
[2025.03.01] 🚀🚀🚀 RLAIF-V, the alignment technique of MiniCPM-o, is accepted by CVPR 2025 Highlights！The code, dataset, paper are open-sourced!
[2025.01.24] 📢📢📢 MiniCPM-o 2.6 technical report is released! See here.
[2025.01.19] 📢 ATTENTION! We are currently working on merging MiniCPM-o 2.6 into the official repositories of llama.cpp, Ollama, and vllm. Until the merge is complete, please USE OUR LOCAL FORKS of llama.cpp, Ollama, and vllm. Using the official repositories before the merge may lead to unexpected issues.
[2025.01.19] ⭐️⭐️⭐️ MiniCPM-o tops GitHub Trending and reaches top-2 on Hugging Face Trending!
[2025.01.17] We have updated the usage of MiniCPM-o 2.6 int4 quantization version and resolved the model initialization error. Click here and try it now!
[2025.01.13] 🔥🔥🔥 We open-source MiniCPM-o 2.6, which matches GPT-4o-202405 on vision, speech and multimodal live streaming. It advances popular capabilities of MiniCPM-V 2.6, and supports various new fun features. Try it now!
[2024.08.17] 🚀🚀🚀 MiniCPM-V 2.6 is now fully supported by official llama.cpp! GGUF models of various sizes are available here.
[2024.08.06] 🔥🔥🔥 We open-source MiniCPM-V 2.6, which outperforms GPT-4V on single image, multi-image and video understanding. It advances popular features of MiniCPM-Llama3-V 2.5, and can support real-time video understanding on iPad. Try it now!
[2024.08.03] MiniCPM-Llama3-V 2.5 technical report is released! See here.
[2024.05.23] 🔥🔥🔥 MiniCPM-V tops GitHub Trending and Hugging Face Trending! Our demo, recommended by Hugging Face Gradio’s official account, is available here. Come and try it out!

Click to view more news.

[2025.08.02] 🚀🚀🚀 We open-source MiniCPM-V 4.0, which outperforms GPT-4.1-mini-20250414 in image understanding. It advances popular features of MiniCPM-V 2.6, and largely improves the efficiency. We also open-source the iOS App on iPhone and iPad. Try it now!
[2025.01.23] 💡💡💡 MiniCPM-o 2.6 is now supported by Align-Anything, a framework by PKU-Alignment Team for aligning any-to-any modality large models with human intentions. It supports DPO and SFT fine-tuning on both vision and audio. Try it now!
[2024.08.15] We now also support multi-image SFT. For more details, please refer to the document.
[2024.08.14] MiniCPM-V 2.6 now also supports fine-tuning with the SWIFT framework!
[2024.08.10] 🚀🚀🚀 MiniCPM-Llama3-V 2.5 is now fully supported by official llama.cpp! GGUF models of various sizes are available here.
[2024.07.19] MiniCPM-Llama3-V 2.5 supports vLLM now! See here.
[2024.06.03] Now, you can run MiniCPM-Llama3-V 2.5 on multiple low VRAM GPUs(12 GB or 16 GB) by distributing the model’s layers across multiple GPUs. For more details, check this link.
[2024.05.28] 🚀🚀🚀 MiniCPM-Llama3-V 2.5 now fully supports its feature in llama.cpp and Ollama! Please pull the latest code of our provided forks (llama.cpp, Ollama). GGUF models in various sizes are available here. MiniCPM-Llama3-V 2.5 series is not supported by the official repositories yet, and we are working hard to merge PRs. Please stay tuned!
[2024.05.28] 💫 We now support LoRA fine-tuning for MiniCPM-Llama3-V 2.5, using only 2 V100 GPUs! See more statistics here.
[2024.05.25] MiniCPM-Llama3-V 2.5 now supports streaming outputs and customized system prompts. Try it here!
[2024.05.24] We release the MiniCPM-Llama3-V 2.5 gguf, which supports llama.cpp inference and provides a 6~8 token/s smooth decoding on mobile phones. Try it now!
[2024.05.23] 🔍 We’ve released a comprehensive comparison between Phi-3-vision-128k-instruct and MiniCPM-Llama3-V 2.5, including benchmark evaluations, multilingual capabilities, and inference efficiency 🌟📊🌍🚀. Click here to view more details.
[2024.05.20] We open-soure MiniCPM-Llama3-V 2.5, it has improved OCR capability and supports 30+ languages, representing the first end-side MLLM achieving GPT-4V level performance! We provide efficient inference and simple fine-tuning. Try it now!
[2024.04.23] MiniCPM-V-2.0 supports vLLM now! Click here to view more details.
[2024.04.18] We create a HuggingFace Space to host the demo of MiniCPM-V 2.0 at here!
[2024.04.17] MiniCPM-V-2.0 supports deploying WebUI Demo now!
[2024.04.15] MiniCPM-V-2.0 now also supports fine-tuning with the SWIFT framework!
[2024.04.12] We open-source MiniCPM-V 2.0, which achieves comparable performance with Gemini Pro in understanding scene text and outperforms strong Qwen-VL-Chat 9.6B and Yi-VL 34B on OpenCompass, a comprehensive evaluation over 11 popular benchmarks. Click here to view the MiniCPM-V 2.0 technical blog.
[2024.03.14] MiniCPM-V now supports fine-tuning with the SWIFT framework. Thanks to Jintao for the contribution！
[2024.03.01] MiniCPM-V can now be deployed on Mac!
[2024.02.01] We open-source MiniCPM-V and OmniLMM-12B, which support efficient end-side deployment and powerful multimodal capabilities correspondingly.

MiniCPM-V 4.5

MiniCPM-V 4.5 is the latest and most capable model in the MiniCPM-V series. The model is built on Qwen3-8B and SigLIP2-400M with a total of 8B parameters. It exhibits a significant performance improvement over previous MiniCPM-V and MiniCPM-o models, and introduces new useful features. Notable features of MiniCPM-V 4.5 include:

🔥 State-of-the-art Vision-Language Capability. MiniCPM-V 4.5 achieves an average score of 77.0 on OpenCompass, a comprehensive evaluation of 8 popular benchmarks. With only 8B parameters, it surpasses widely used proprietary models like GPT-4o-latest, Gemini-2.0 Pro, and strong open-source models like Qwen2.5-VL 72B for vision-language capabilities, making it the most performant MLLM under 30B parameters.
🎬 Efficient High-FPS and Long Video Understanding. Powered by a new unified 3D-Resampler over images and videos, MiniCPM-V 4.5 can now achieve 96x compression rate for video tokens, where 6 448x448 video frames can be jointly compressed into 64 video tokens (normally 1,536 tokens for most MLLMs). This means that the model can perceive significantly more video frames without increasing the LLM inference cost. This brings state-of-the-art high-FPS (up to 10FPS) video understanding and long video understanding capabilities on Video-MME, LVBench, MLVU, MotionBench, FavorBench, etc., efficiently.
⚙️ Controllable Hybrid Fast/Deep Thinking. MiniCPM-V 4.5 supports both fast thinking for efficient frequent usage with competitive performance, and deep thinking for more complex problem solving. To cover efficiency and performance trade-offs in different user scenarios, this fast/deep thinking mode can be switched in a highly controlled fashion.
💪 Strong OCR, Document Parsing and Others. Based on LLaVA-UHD architecture, MiniCPM-V 4.5 can process high-resolution images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344), using 4x fewer visual tokens than most MLLMs. The model achieves leading performance on OCRBench, surpassing proprietary models such as GPT-4o-latest and Gemini 2.5. It also achieves state-of-the-art performance for PDF document parsing capability on OmniDocBench among general MLLMs. Based on the latest RLAIF-V and VisCPM techniques, it features trustworthy behaviors, outperforming GPT-4o-latest on MMHal-Bench, and supports multilingual capabilities in more than 30 languages.
💫 Easy Usage. MiniCPM-V 4.5 can be easily used in various ways: (1) llama.cpp and ollama support for efficient CPU inference on local devices, (2) int4, GGUF and AWQ format quantized models in 16 sizes, (3) SGLang and vLLM support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks with Transformers and LLaMA-Factory, (5) quick local WebUI demo, (6) optimized local iOS app on iPhone and iPad, and (7) online web demo on server. See our Cookbook for full usage!

Key Techniques

Architechture: Unified 3D-Resampler for High-density Video Compression. MiniCPM-V 4.5 introduces a 3D-Resampler that overcomes the performance-efficiency trade-off in video understanding. By grouping and jointly compressing up to 6 consecutive video frames into just 64 tokens (the same token count used for a single image in MiniCPM-V series), MiniCPM-V 4.5 achieves a 96× compression rate for video tokens. This allows the model to process more video frames without additional LLM computational cost, enabling high-FPS video and long video understanding. The architecture supports unified encoding for images, multi-image inputs, and videos, ensuring seamless capability and knowledge transfer.
Pre-training: Unified Learning for OCR and Knowledge from Documents. Existing MLLMs learn OCR capability and knowledge from documents in isolated training approaches. We observe that the essential difference between these two training approaches is the visibility of the text in images. By dynamically corrupting text regions in documents with varying noise levels and asking the model to reconstruct the text, the model learns to adaptively and properly switch between accurate text recognition (when text is visible) and multimodal context-based knowledge reasoning (when text is heavily obscured). This eliminates reliance on error-prone document parsers in knowledge learning from documents, and prevents hallucinations from over-augmented OCR data, resulting in top-tier OCR and multimodal knowledge performance with minimal engineering overhead.
Post-training: Hybrid Fast/Deep Thinking with Multimodal RL. MiniCPM-V 4.5 offers a balanced reasoning experience through two switchable modes: fast thinking for efficient daily use and deep thinking for complex tasks. Using a new hybrid reinforcement learning method, the model jointly optimizes both modes, significantly enhancing fast-mode performance without compromising deep-mode capability. Incorporated with RLPR and RLAIF-V, it generalizes robust reasoning skills from broad multimodal data while effectively reducing hallucinations.

Evaluation

Inference Efficiency

OpenCompass

Model	Size	Avg Score ↑	Total Inference Time ↓
GLM-4.1V-9B-Thinking	10.3B	76.6	17.5h
MiMo-VL-7B-RL	8.3B	76.4	11h
MiniCPM-V 4.5	8.7B	77.0	7.5h

Video-MME

Model	Size	Avg Score ↑	Total Inference Time ↓	GPU Mem ↓
Qwen2.5-VL-7B-Instruct	8.3B	71.6	3h	60G
GLM-4.1V-9B-Thinking	10.3B	73.6	2.63h	32G
MiniCPM-V 4.5	8.7B	73.5	0.26h	28G

Both Video-MME and OpenCompass were evaluated using 8×A100 GPUs for inference. The reported inference time of Video-MME includes full model-side computation, and excludes the external cost of video frame extraction (dependent on specific frame extraction tools) for fair comparison.

Examples

Click to view more cases.

We deploy MiniCPM-V 4.5 on iPad M4 with iOS demo. The demo video is the raw screen recording without edition.

MiniCPM-o 2.6

MiniCPM-o 2.6 is the latest and most capable model in the MiniCPM-o series. The model is built in an end-to-end fashion based on SigLip-400M, Whisper-medium-300M, ChatTTS-200M, and Qwen2.5-7B with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.6, and introduces new features for real-time speech conversation and multimodal live streaming. Notable features of MiniCPM-o 2.6 include:

🔥 Leading Visual Capability. MiniCPM-o 2.6 achieves an average score of 70.2 on OpenCompass, a comprehensive evaluation of 8 popular benchmarks. With only 8B parameters, it surpasses widely used proprietary models like GPT-4o-202405, Gemini 1.5 Pro, and Claude 3.5 Sonnet for single image understanding. It also outperforms GPT-4V and Claude 3.5 Sonnet in multi-image and video understanding, and shows promising in-context learning capability.
🎙 State-of-the-art Speech Capability. MiniCPM-o 2.6 supports bilingual real-time speech conversation with configurable voices in English and Chinese. It outperforms GPT-4o-realtime on audio understanding tasks such as ASR and STT translation, and shows state-of-the-art performance on speech conversation in both semantic and acoustic evaluations in the open-source community. It also allows for fun features such as emotion/speed/style control, end-to-end voice cloning, role play, etc.
🎬 Strong Multimodal Live Streaming Capability. As a new feature, MiniCPM-o 2.6 can accept continuous video and audio streams independent of user queries, and support real-time speech interaction. It outperforms GPT-4o-202408 and Claude 3.5 Sonnet and shows state-of-the-art performance in the open-source community on StreamingBench, a comprehensive benchmark for real-time video understanding, omni-source (video & audio) understanding, and multimodal contextual understanding.
💪 Strong OCR Capability and Others. Advancing popular visual capabilities from MiniCPM-V series, MiniCPM-o 2.6 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344). It achieves state-of-the-art performance on OCRBench for models under 25B, surpassing proprietary models such as GPT-4o-202405. Based on the latest RLAIF-V and VisCPM techniques, it features trustworthy behaviors, outperforming GPT-4o and Claude 3.5 Sonnet on MMHal-Bench, and supports multilingual capabilities on more than 30 languages.
🚀 Superior Efficiency. In addition to its friendly size, MiniCPM-o 2.6 also shows state-of-the-art token density (i.e., the number of pixels encoded into each visual token). It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models. This directly improves the inference speed, first-token latency, memory usage, and power consumption. As a result, MiniCPM-o 2.6 can efficiently support multimodal live streaming on end-side devices such as iPads.
💫 Easy Usage. MiniCPM-o 2.6 can be easily used in various ways: (1) llama.cpp support for efficient CPU inference on local devices, (2) int4 and GGUF format quantized models in 16 sizes, (3) vLLM support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks with LLaMA-Factory, (5) quick local WebUI demo, and (6) online web demo on server.

Model Architecture.

End-to-end Omni-modal Architecture. Different modality encoders/decoders are connected and trained in an end-to-end fashion to fully exploit rich multimodal knowledge. The model is trained in a fully end-to-end manner with only CE loss.
Omni-modal Live Streaming Mechanism. (1) We change the offline modality encoder/decoders into online ones for streaming inputs/outputs. (2) We devise a time-division multiplexing (TDM) mechanism for omni-modality streaming processing in the LLM backbone. It divides parallel omni-modality streams into sequential info within small periodic time slices.
Configurable Speech Modeling Design. We devise a multimodal system prompt, including traditional text system prompt, and a new audio system prompt to determine the assistant voice. This enables flexible voice configurations in inference time, and also facilitates end-to-end voice cloning and description-based voice creation.

Evaluation

Click to view visual understanding results.

Image Understanding

Model	Size	Token Density⁺	OpenCompass	OCRBench	MathVista mini	ChartQA	MMVet	MMStar	MME	MMB1.1 test	AI2D	MMMU val	HallusionBench	TextVQA val	DocVQA test	MathVerse mini	MathVision	MMHal Score
Proprietary
GPT-4o-20240513	-	1088	69.9	736	61.3	85.7	69.1	63.9	2328.7	82.2	84.6	69.2	55.0	-	92.8	50.2	30.4	3.6
Claude3.5-Sonnet	-	750	67.9	788	61.6	90.8	66.0	62.2	1920.0	78.5	80.2	65.9	49.9	-	95.2	-	-	3.4
Gemini 1.5 Pro	-	-	64.4	754	57.7	81.3	64.0	59.1	2110.6	73.9	79.1	60.6	45.6	73.5	86.5	-	19.2	-
GPT-4o-mini-20240718	-	1088	64.1	785	52.4	-	66.9	54.8	2003.4	76.0	77.8	60.0	46.1	-	-	-	-	3.3
Open Source
Cambrian-34B	34B	1820	58.3	591	50.3	75.6	53.2	54.2	2049.9	77.8	79.5	50.4	41.6	76.7	75.5	-	-	-
GLM-4V-9B	13B	784	59.1	776	51.1	-	58.0	54.8	2018.8	67.9	71.2	46.9	45.0	-	-	-	-	-
Pixtral-12B	12B	256	61.0	685	56.9	81.8	58.5	54.5	-	72.7	79.0	51.1	47.0	75.7	90.7	-	-	-
VITA-1.5	8B	784	63.3	741	66.2	-	52.7	60.2	2328.1	76.8	79.2	52.6	44.6	-	-	-	-	-
DeepSeek-VL2-27B (4B)	27B	672	66.4	809	63.9	86.0	60.0	61.9	2253.0	81.2	83.8	54.0	45.3	84.2	93.3	-	-	3.0
Qwen2-VL-7B	8B	784	67.1	866	58.2	83.0	62.0	60.7	2326.0	81.8	83.0	54.1	50.6	84.3	94.5	31.9	16.3	3.2
LLaVA-OneVision-72B	72B	182	68.1	741	67.5	83.7	60.6	65.8	2261.0	85.0	85.6	56.8	49.0	80.5	91.3	39.1	-	3.5
InternVL2.5-8B	8B	706	68.3	822	64.4	84.8	62.8	62.8	2344.0	83.6	84.5	56.0	50.1	79.1	93.0	39.5	19.7	3.4
MiniCPM-V 2.6	8B	2822	65.2	852*	60.6	79.4	60.0	57.5	2348.4*	78.0	82.1	49.8*	48.1*	80.1	90.8	25.7	18.3	3.6
MiniCPM-o 2.6	8B	2822	70.2	897*	71.9*	86.9*	67.5	64.0	2372.0*	80.5	85.8	50.4*	51.9	82.0	93.5	41.4*	23.1*	3.8

* We evaluate this benchmark using chain-of-thought prompting. Specifically, for MME, we used this technique only for the Cognition set.

⁺ Token Density: number of pixels encoded into each visual token at maximum resolution, i.e., # pixels at maximum resolution / # visual tokens.

Note: For proprietary models, we calculate token density based on the image encoding charging strategy defined in the official API documentation, which provides an upper-bound estimation.

Multi-image and Video Understanding

Model	Size	BLINK val	Mantis Eval	MIRB	Video-MME (wo / w subs)
Proprietary
GPT-4o-20240513	-	68.0	-	-	71.9/77.2
GPT4V	-	54.6	62.7	53.1	59.9/63.3
Open-source
VITA-1.5	8B	45.0	-	-	56.1/58.7
LLaVA-NeXT-Interleave 14B	14B	52.6	66.4	30.2	-
LLaVA-OneVision-72B	72B	55.4	77.6	-	66.2/69.5
MANTIS 8B	8B	49.1	59.5	34.8	-
Qwen2-VL-7B	8B	53.2	69.6*	67.6*	63.3/69.0
InternVL2.5-8B	8B	54.8	67.7	52.5	64.2/66.9
MiniCPM-V 2.6	8B	53.0	69.1	53.8	60.9/63.6
MiniCPM-o 2.6	8B	56.7	71.9	58.6	63.9/67.9

* We evaluate officially released checkpoints by ourselves.

Click to view audio understanding and speech conversation results.

Audio Understanding

Task	Size	ASR (zh)	ASR (en)	AST	Emotion
Metric		CER↓	WER↓	BLEU↑	ACC↑
Dataset		AISHELL-1	Fleurs zh	WenetSpeech test-net	LibriSpeech test-clean	GigaSpeech	TED-LIUM	CoVoST en2zh	CoVoST zh2en	MELD emotion
Proprietary
GPT-4o-Realtime	-	7.3*	5.4*	28.9*	2.6*	12.9*	4.8*	37.1*	15.7*	33.2*
Gemini 1.5 Pro	-	4.5*	5.9*	14.3*	2.9*	10.6*	3.0*	47.3*	22.6*	48.4*
Open-Source
Qwen2-Audio-7B	8B	-	7.5	-	1.6	-	-	45.2	24.4	55.3
Qwen2-Audio-7B-Instruct	8B	2.6*	6.9*	10.3*	3.1*	9.7*	5.9*	39.5*	22.9*	17.4*
VITA-1.5	8B	2.16	-	8.4	3.4	-	-	-	-	-
GLM-4-Voice-Base	9B	2.5	-	-	2.8	-	-	-	-
MiniCPM-o 2.6	8B	1.6	4.4	6.9	1.7	8.7	3.0	48.2	27.2	52.4

* We evaluate officially released checkpoints by ourselves.

Speech Generation

Task	Size	SpeechQA
Metric		ACC↑	G-Eval (10 point)↑	Semantic ELO score↑	Acoustic ELO score↑	Overall ELO score↑	UTMOS↑	ASR-WER↓
Dataset		Speech Llama Q.	Speech Web Q.	Speech Trivia QA	Speech AlpacaEval	AudioArena
Proprietary
GPT-4o-Realtime		71.7	51.6	69.7	7.4	1157	1203	1200	4.2	2.3
Open-Source
GLM-4-Voice	9B	50.0	32.0	36.4	5.1	999	1147	1035	4.1	11.7
Llama-Omni	8B	45.3	22.9	10.7	3.9	960	878	897	3.2	24.3
VITA-1.5	8B	46.7	28.1	23.3	2.0	-	-	-	-	-
Moshi	7B	43.7	23.8	16.7	2.4	871	808	875	2.8	8.2
Mini-Omni	1B	22.0	12.8	6.9	2.5	926	803	865	3.4	10.0
MiniCPM-o 2.6	8B	61.0	40.0	40.2	5.1	1088	1163	1131	4.2	9.8

All results are from AudioEvals, and the evaluation methods along with further details can be found in AudioEvals.

End-to-end Voice Cloning

Task	Voice cloning
Metric	SIMO↑	SIMO↑
Dataset	Seed-TTS test-zh	Seed-TTS test-en
F5-TTS	76	67
CosyVoice	75	64
FireRedTTS	63	46
MiniCPM-o 2.6	57	47

Click to view multimodal live streaming results.

Multimodal Live Streaming: results on StreamingBench

Model	Size	Real-Time Video Understanding	Omni-Source Understanding	Contextual Understanding	Overall
Proprietary
Gemini 1.5 Pro	-	77.4	67.8	51.1	70.3
GPT-4o-202408	-	74.5	51.0	48.0	64.1
Claude-3.5-Sonnet	-	74.0	41.4	37.8	59.7
Open-source
VILA-1.5	8B	61.5	37.5	26.7	49.5
LongVA	7B	63.1	35.9	30.2	50.7
LLaVA-Next-Video-34B	34B	69.8	41.7	34.3	56.7
Qwen2-VL-7B	8B	71.2	40.7	33.1	57.0
InternVL2-8B	8B	70.1	42.7	34.1	57.0
VITA-1.5	8B	70.9	40.8	35.8	57.4
LLaVA-OneVision-7B	8B	74.3	40.8	31.0	58.4
InternLM-XC2.5-OL-7B	8B	75.4	46.2	33.6	60.8
MiniCPM-V 2.6	8B	72.4	40.2	33.4	57.7
MiniCPM-o 2.6	8B	79.9	53.4	38.5	66.0

Examples

We deploy MiniCPM-o 2.6 on end devices. The demo video is the raw-speed recording on an iPad Pro and a Web demo.

Legacy Models

Model	Introduction and Guidance
MiniCPM-V 4.0	Document
MiniCPM-V 2.6	Document
MiniCPM-Llama3-V 2.5	Document
MiniCPM-V 2.0	Document
MiniCPM-V 1.0	Document
OmniLMM-12B	Document

MiniCPM-V & o Cookbook

Discover comprehensive, ready-to-deploy solutions for the MiniCPM-V and MiniCPM-o model series in our structured cookbook, which empowers developers to rapidly implement multimodal AI applications with integrated vision, speech, and live-streaming capabilities. Key features include:

Easy Usage Documentation

Our comprehensive documentation website presents every recipe in a clear, well-organized manner. All features are displayed at a glance, making it easy for you to quickly find exactly what you need.

Broad User Spectrum

We support a wide range of users, from individuals to enterprises and researchers.

Individuals: Enjoy effortless inference using Ollama and Llama.cpp with minimal setup.
Enterprises: Achieve high-throughput, scalable performance with vLLM and SGLang.
Researchers: Leverage advanced frameworks including Transformers, LLaMA-Factory, SWIFT, and Align-anything to enable flexible model development and cutting-edge experimentation.

Versatile Deployment Scenarios

Our ecosystem delivers optimal solution for a variety of hardware environments and deployment demands.

Web demo: Launch interactive multimodal AI web demo with FastAPI.
Quantized deployment: Maximize efficiency and minimize resource consumption using GGUF and BNB.
End devices: Bring powerful AI experiences to iPhone and iPad, supporting offline and privacy-sensitive applications.

Chat with Our Demo on Gradio 🤗

We provide online and local demos powered by Hugging Face Gradio , the most popular model deployment framework nowadays. It supports streaming outputs, progress bars, queuing, alerts, and other useful features.

Online Demo

Click here to try out the online demo of MiniCPM-o 2.6 | MiniCPM-V 2.6 | MiniCPM-Llama3-V 2.5 | MiniCPM-V 2.0.

Local WebUI Demo

You can easily build your own local WebUI demo using the following commands.

Please ensure that transformers==4.44.2 is installed, as other versions may have compatibility issues.

If you are using an older version of PyTorch, you might encounter this issue "weight_norm_fwd_first_dim_kernel" not implemented for 'BFloat16', Please add self.minicpmo_model.tts.float() during the model initialization.

For real-time voice/video call demo:

launch model server:

1
2
3

pip install -r requirements_o2.6.txt

python web_demos/minicpm-o_2.6/model_server.py

launch web server:

# Make sure Node and PNPM is installed.
sudo apt-get update
sudo apt-get install nodejs npm
npm install -g pnpm


cd web_demos/minicpm-o_2.6/web_server
# create ssl cert for https, https is required to request camera and microphone permissions.
bash ./make_ssl_cert.sh  # output key.pem and cert.pem

pnpm install  # install requirements
pnpm run dev  # start server

Open https://localhost:8088/ in browser and enjoy the real-time voice/video call.

For chatbot demo:

1
2
3

pip install -r requirements_o2.6.txt

python web_demos/minicpm-o_2.6/chatbot_web_demo_o2.6.py

Open http://localhost:8000/ in browser and enjoy the vision mode chatbot.

Inference

Model Zoo

Model	Device	Memory	Description	Download
MiniCPM-V 4.5	GPU	18 GB	The latest version, strong end-side multimodal performance for single image, multi-image and video understanding.	🤗
MiniCPM-V 4.5 gguf	CPU	8 GB	The gguf version, lower memory usage and faster inference.	🤗
MiniCPM-V 4.5 int4	GPU	9 GB	The int4 quantized version, lower GPU memory usage.	🤗
MiniCPM-V 4.5 AWQ	GPU	9 GB	The int4 quantized version, lower GPU memory usage.	🤗
MiniCPM-o 2.6	GPU	18 GB	The latest version, achieving GPT-4o level performance for vision, speech and multimodal live streaming on end-side devices.	🤗
MiniCPM-o 2.6 gguf	CPU	8 GB	The gguf version, lower memory usage and faster inference.	🤗
MiniCPM-o 2.6 int4	GPU	9 GB	The int4 quantized version, lower GPU memory usage.	🤗

Multi-turn Conversation

If you wish to enable long-thinking mode, provide the argument enable_thinking=True to the chat function.

`1`	`pip install -r requirements_o2.6.txt`

Please refer to the following codes to run.

import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer

torch.manual_seed(100)

model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True, # or openbmb/MiniCPM-o-2_6
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True) # or openbmb/MiniCPM-o-2_6

image = Image.open('./assets/minicpmo2_6/show_demo.jpg').convert('RGB')

enable_thinking=False # If `enable_thinking=True`, the long-thinking mode is enabled.

# First round chat 
question = "What is the landform in the picture?"
msgs = [{'role': 'user', 'content': [image, question]}]

answer = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    enable_thinking=enable_thinking
)
print(answer)

# Second round chat, pass history context of multi-turn conversation
msgs.append({"role": "assistant", "content": [answer]})
msgs.append({"role": "user", "content": ["What should I pay attention to when traveling here?"]})

answer = model.chat(
    msgs=msgs,
    tokenizer=tokenizer
)
print(answer)

You will get the following output:

# round1
The landform in the picture is karst topography. Karst landscapes are characterized by distinctive, jagged limestone hills or mountains with steep, irregular peaks and deep valleys—exactly what you see here These unique formations result from the dissolution of soluble rocks like limestone over millions of years through water erosion.

This scene closely resembles the famous karst landscape of Guilin and Yangshuo in China’s Guangxi Province. The area features dramatic, pointed limestone peaks rising dramatically above serene rivers and lush green forests, creating a breathtaking and iconic natural beauty that attracts millions of visitors each year for its picturesque views.

# round2
When traveling to a karst landscape like this, here are some important tips:

1. Wear comfortable shoes: The terrain can be uneven and hilly.
2. Bring water and snacks for energy during hikes or boat rides.
3. Protect yourself from the sun with sunscreen, hats, and sunglasses—especially since you’ll likely spend time outdoors exploring scenic spots.
4. Respect local customs and nature regulations by not littering or disturbing wildlife.

By following these guidelines, you'll have a safe and enjoyable trip while appreciating the stunning natural beauty of places such as Guilin’s karst mountains.

Chat with Multiple Images

Click to view Python code running MiniCPM-V-4_5 with multiple images input.

import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True,  # or openbmb/MiniCPM-o-2_6
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True)  # or openbmb/MiniCPM-o-2_6

image1 = Image.open('image1.jpg').convert('RGB')
image2 = Image.open('image2.jpg').convert('RGB')
question = 'Compare image 1 and image 2, tell me about the differences between image 1 and image 2.'

msgs = [{'role': 'user', 'content': [image1, image2, question]}]

answer = model.chat(
    msgs=msgs,
    tokenizer=tokenizer
)
print(answer)

In-context Few-shot Learning

Click to view Python code running MiniCPM-V-4_5 with few-shot input.

import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True,  # or openbmb/MiniCPM-o-2_6
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True)  # or openbmb/MiniCPM-o-2_6

question = "production date" 
image1 = Image.open('example1.jpg').convert('RGB')
answer1 = "2023.08.04"
image2 = Image.open('example2.jpg').convert('RGB')
answer2 = "2007.04.24"
image_test = Image.open('test.jpg').convert('RGB')

msgs = [
    {'role': 'user', 'content': [image1, question]}, {'role': 'assistant', 'content': [answer1]},
    {'role': 'user', 'content': [image2, question]}, {'role': 'assistant', 'content': [answer2]},
    {'role': 'user', 'content': [image_test, question]}
]

answer = model.chat(
    msgs=msgs,
    tokenizer=tokenizer
)
print(answer)

Chat with Video

Click to view Python code running MiniCPM-V-4_5 by with video input and 3D-Resampler.

## The 3d-resampler compresses multiple frames into 64 tokens by introducing temporal_ids. 
# To achieve this, you need to organize your video data into two corresponding sequences: 
#   frames: List[Image]
#   temporal_ids: List[List[Int]].

import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer
from decord import VideoReader, cpu    # pip install decord
from scipy.spatial import cKDTree
import numpy as np
import math

model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True,  # or openbmb/MiniCPM-o-2_6
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True)  # or openbmb/MiniCPM-o-2_6

MAX_NUM_FRAMES=180 # Indicates the maximum number of frames received after the videos are packed. The actual maximum number of valid frames is MAX_NUM_FRAMES * MAX_NUM_PACKING.
MAX_NUM_PACKING=3  # indicates the maximum packing number of video frames. valid range: 1-6
TIME_SCALE = 0.1 

def map_to_nearest_scale(values, scale):
    tree = cKDTree(np.asarray(scale)[:, None])
    _, indices = tree.query(np.asarray(values)[:, None])
    return np.asarray(scale)[indices]


def group_array(arr, size):
    return [arr[i:i+size] for i in range(0, len(arr), size)]

def encode_video(video_path, choose_fps=3, force_packing=None):
    def uniform_sample(l, n):
        gap = len(l) / n
        idxs = [int(i * gap + gap / 2) for i in range(n)]
        return [l[i] for i in idxs]
    vr = VideoReader(video_path, ctx=cpu(0))
    fps = vr.get_avg_fps()
    video_duration = len(vr) / fps
        
    if choose_fps * int(video_duration) <= MAX_NUM_FRAMES:
        packing_nums = 1
        choose_frames = round(min(choose_fps, round(fps)) * min(MAX_NUM_FRAMES, video_duration))
        
    else:
        packing_nums = math.ceil(video_duration * choose_fps / MAX_NUM_FRAMES)
        if packing_nums <= MAX_NUM_PACKING:
            choose_frames = round(video_duration * choose_fps)
        else:
            choose_frames = round(MAX_NUM_FRAMES * MAX_NUM_PACKING)
            packing_nums = MAX_NUM_PACKING

    frame_idx = [i for i in range(0, len(vr))]      
    frame_idx =  np.array(uniform_sample(frame_idx, choose_frames))

    if force_packing:
        packing_nums = min(force_packing, MAX_NUM_PACKING)
    
    print(video_path, ' duration:', video_duration)
    print(f'get video frames={len(frame_idx)}, packing_nums={packing_nums}')
    
    frames = vr.get_batch(frame_idx).asnumpy()

    frame_idx_ts = frame_idx / fps
    scale = np.arange(0, video_duration, TIME_SCALE)

    frame_ts_id = map_to_nearest_scale(frame_idx_ts, scale) / TIME_SCALE
    frame_ts_id = frame_ts_id.astype(np.int32)

    assert len(frames) == len(frame_ts_id)

    frames = [Image.fromarray(v.astype('uint8')).convert('RGB') for v in frames]
    frame_ts_id_group = group_array(frame_ts_id, packing_nums)
    
    return frames, frame_ts_id_group


video_path="video_test.mp4"
fps = 5 # fps for video
force_packing = None # You can set force_packing to ensure that 3D packing is forcibly enabled; otherwise, encode_video will dynamically set the packing quantity based on the duration.
frames, frame_ts_id_group = encode_video(video_path, fps, force_packing=force_packing)

question = "Describe the video"
msgs = [
    {'role': 'user', 'content': frames + [question]}, 
]


answer = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    use_image_id=False,
    max_slice_nums=1,
    temporal_ids=frame_ts_id_group
)
print(answer)

Speech and Audio Mode

Model initialization

import torch
import librosa
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)

model.init_tts()
model.tts.float()

Mimick

Mimick task reflects a model’s end-to-end speech modeling capability. The model takes audio input, and outputs an ASR transcription and subsequently reconstructs the original audio with high similarity. The higher the similarity between the reconstructed audio and the original audio, the stronger the model’s foundational capability in end-to-end speech modeling.

mimick_prompt = "Please repeat each user's speech, including voice style and speech content."
audio_input, _ = librosa.load('./assets/input_examples/Trump_WEF_2018_10s.mp3', sr=16000, mono=True) # load the audio to be mimicked

# `./assets/input_examples/fast-pace.wav`, 
# `./assets/input_examples/chi-english-1.wav` 
# `./assets/input_examples/exciting-emotion.wav` 
# for different aspects of speech-centric features.

msgs = [{'role': 'user', 'content': [mimick_prompt, audio_input]}]
res = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    max_new_tokens=128,
    use_tts_template=True,
    temperature=0.3,
    generate_audio=True,
    output_audio_path='output_mimick.wav', # save the tts result to output_audio_path
)

General Speech Conversation with Configurable Voices

A general usage scenario of MiniCPM-o-2.6 is role-playing a specific character based on the audio prompt. It will mimic the voice of the character to some extent and act like the character in text, including language style. In this mode, MiniCPM-o-2.6 sounds more natural and human-like. Self-defined audio prompts can be used to customize the voice of the character in an end-to-end manner.

ref_audio, _ = librosa.load('./assets/input_examples/icl_20.wav', sr=16000, mono=True) # load the reference audio
sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_roleplay', language='en')

# round one
user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]}
msgs = [sys_prompt, user_question]
res = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    max_new_tokens=128,
    use_tts_template=True,
    generate_audio=True,
    temperature=0.3,
    output_audio_path='result_roleplay_round_1.wav',
)

# round two
history = msgs.append({'role': 'assistant', 'content': res})
user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]}
msgs = history.append(user_question)
res = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    max_new_tokens=128,
    use_tts_template=True,
    generate_audio=True,
    temperature=0.3,
    output_audio_path='result_roleplay_round_2.wav',
)
print(res)

Speech Conversation as an AI Assistant

An enhanced feature of MiniCPM-o-2.6 is to act as an AI assistant, but only with limited choice of voices. In this mode, MiniCPM-o-2.6 is less human-like and more like a voice assistant. In this mode, the model is more instruction-following. For demo, you are suggested to use assistant_female_voice, assistant_male_voice, and assistant_default_female_voice. Other voices may work but not as stable as the default voices.

Please note that, assistant_female_voice and assistant_male_voice are more stable but sounds like robots, while assistant_default_female_voice is more human-alike but not stable, its voice often changes in multiple turns. We suggest you to try stable voices assistant_female_voice and assistant_male_voice.

ref_audio, _ = librosa.load('./assets/input_examples/assistant_female_voice.wav', sr=16000, mono=True) # or use `./assets/input_examples/assistant_male_voice.wav`
sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_assistant', language='en') 
user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]} # load the user's audio question

# round one
msgs = [sys_prompt, user_question]
res = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    max_new_tokens=128,
    use_tts_template=True,
    generate_audio=True,
    temperature=0.3,
    output_audio_path='result_assistant_round_1.wav',
)

# round two
history = msgs.append({'role': 'assistant', 'content': res})
user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]}
msgs = history.append(user_question)
res = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    max_new_tokens=128,
    use_tts_template=True,
    generate_audio=True,
    temperature=0.3,
    output_audio_path='result_assistant_round_2.wav',
)
print(res)

Instruction-to-Speech

MiniCPM-o-2.6 can also do Instruction-to-Speech, aka Voice Creation. You can describe a voice in detail, and the model will generate a voice that matches the description. For more Instruction-to-Speech sample instructions, you can refer to https://voxinstruct.github.io/VoxInstruct/.

instruction = 'Speak like a male charming superstar, radiating confidence and style in every word.'

msgs = [{'role': 'user', 'content': [instruction]}]

res = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    max_new_tokens=128,
    use_tts_template=True,
    generate_audio=True,
    temperature=0.3,
    output_audio_path='result_voice_creation.wav',
)

Voice Cloning

MiniCPM-o-2.6 can also do zero-shot text-to-speech, aka Voice Cloning. With this mode, model will act like a TTS model.

ref_audio, _ = librosa.load('./assets/input_examples/icl_20.wav', sr=16000, mono=True) # load the reference audio
sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='voice_cloning', language='en')
text_prompt = f"Please read the text below."
user_question = {'role': 'user', 'content': [text_prompt, "content that you want to read"]}

msgs = [sys_prompt, user_question]
res = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    max_new_tokens=128,
    use_tts_template=True,
    generate_audio=True,
    temperature=0.3,
    output_audio_path='result_voice_cloning.wav',
)

Addressing Various Audio Understanding Tasks

MiniCPM-o-2.6 can also be used to address various audio understanding tasks, such as ASR, speaker analysis, general audio captioning, and sound scene tagging.

For audio-to-text tasks, you can use the following prompts:

ASR with ZH(same as AST en2zh): 请仔细听这段音频片段，并将其内容逐字记录。
ASR with EN(same as AST zh2en): Please listen to the audio snippet carefully and transcribe the content.
Speaker Analysis: Based on the speaker's content, speculate on their gender, condition, age range, and health status.
General Audio Caption: Summarize the main content of the audio.
General Sound Scene Tagging: Utilize one keyword to convey the audio's content or the associated scene.

task_prompt = "Please listen to the audio snippet carefully and transcribe the content." + "\n" # can change to other prompts.
audio_input, _ = librosa.load('./assets/input_examples/audio_understanding.mp3', sr=16000, mono=True) # load the audio to be captioned

msgs = [{'role': 'user', 'content': [task_prompt, audio_input]}]

res = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    max_new_tokens=128,
    use_tts_template=True,
    generate_audio=True,
    temperature=0.3,
    output_audio_path='result_audio_understanding.wav',
)
print(res)

Multimodal Live Streaming

Click to view Python code running MiniCPM-o 2.6 with chat inference.

import math
import numpy as np
from PIL import Image
from moviepy.editor import VideoFileClip
import tempfile
import librosa
import soundfile as sf
import torch
from transformers import AutoModel, AutoTokenizer

def get_video_chunk_content(video_path, flatten=True):
    video = VideoFileClip(video_path)
    print('video_duration:', video.duration)
    
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=True) as temp_audio_file:
        temp_audio_file_path = temp_audio_file.name
        video.audio.write_audiofile(temp_audio_file_path, codec="pcm_s16le", fps=16000)
        audio_np, sr = librosa.load(temp_audio_file_path, sr=16000, mono=True)
    num_units = math.ceil(video.duration)
    
    # 1 frame + 1s audio chunk
    contents= []
    for i in range(num_units):
        frame = video.get_frame(i+1)
        image = Image.fromarray((frame).astype(np.uint8))
        audio = audio_np[sr*i:sr*(i+1)]
        if flatten:
            contents.extend(["<unit>", image, audio])
        else:
            contents.append(["<unit>", image, audio])
    
    return contents


model = AutoModel.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16)
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)

model.init_tts()

# If you are using an older version of PyTorch, you might encounter this issue "weight_norm_fwd_first_dim_kernel" not implemented for 'BFloat16', Please convert the TTS to float32 type.
# model.tts.float()

# https://huggingface.co/openbmb/MiniCPM-o-2_6/blob/main/assets/Skiing.mp4
video_path="assets/Skiing.mp4"
sys_msg = model.get_sys_prompt(mode='omni', language='en')
# if use voice clone prompt, please set ref_audio
# ref_audio_path = '/path/to/ref_audio'
# ref_audio, _ = librosa.load(ref_audio_path, sr=16000, mono=True)
# sys_msg = model.get_sys_prompt(ref_audio=ref_audio, mode='omni', language='en')

contents = get_video_chunk_content(video_path)
msg = {"role":"user", "content": contents}
msgs = [sys_msg, msg]

# please set generate_audio=True and output_audio_path to save the tts result
generate_audio = True
output_audio_path = 'output.wav'

res = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    temperature=0.5,
    max_new_tokens=4096,
    omni_input=True, # please set omni_input=True when omni inference
    use_tts_template=True,
    generate_audio=generate_audio,
    output_audio_path=output_audio_path,
    max_slice_nums=1,
    use_image_id=False,
    return_dict=True
)
print(res)

Click to view Python code running MiniCPM-o 2.6 with streaming inference.

Note: The streaming inference has a slight performance degradation because the audio encoding is not global.

# a new conversation need reset session first, it will reset the kv-cache
model.reset_session()

contents = get_video_chunk_content(video_path, flatten=False)
session_id = '123'
generate_audio = True

# 1. prefill system prompt
res = model.streaming_prefill(
    session_id=session_id,
    msgs=[sys_msg], 
    tokenizer=tokenizer
)

# 2. prefill video/audio chunks
for content in contents:
    msgs = [{"role":"user", "content": content}]
    res = model.streaming_prefill(
        session_id=session_id,
        msgs=msgs, 
        tokenizer=tokenizer
    )

# 3. generate
res = model.streaming_generate(
    session_id=session_id,
    tokenizer=tokenizer,
    temperature=0.5,
    generate_audio=generate_audio
)

audios = []
text = ""

if generate_audio:
    for r in res:
        audio_wav = r.audio_wav
        sampling_rate = r.sampling_rate
        txt = r.text

        audios.append(audio_wav)
        text += txt
        
    res = np.concatenate(audios)
    sf.write("output.wav", res, samplerate=sampling_rate)
    print("text:", text)
    print("audio saved to output.wav")
else:
    for r in res:
        text += r['text']
    print("text:", text)

Inference on Multiple GPUs

You can run MiniCPM-Llama3-V 2.5 on multiple low VRAM GPUs (12 GB or 16 GB) by distributing the model’s layers across multiple GPUs. Please refer to this tutorial for detailed instructions on how to load the model and inference using multiple low VRAM GPUs.

Inference on Mac

Click to view an example, to run MiniCPM-Llama3-V 2.5 on 💻 Mac with MPS (Apple silicon or AMD GPUs).

# test.py  Need more than 16GB memory.
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained('openbmb/MiniCPM-Llama3-V-2_5', trust_remote_code=True, low_cpu_mem_usage=True)
model = model.to(device='mps')

tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-Llama3-V-2_5', trust_remote_code=True)
model.eval()

image = Image.open('./assets/hk_OCR.jpg').convert('RGB')
question = 'Where is this photo taken?'
msgs = [{'role': 'user', 'content': question}]

answer, context, _ = model.chat(
    image=image,
    msgs=msgs,
    context=None,
    tokenizer=tokenizer,
    sampling=True
)
print(answer)

Run with command:

`1`	`PYTORCH_ENABLE_MPS_FALLBACK=1 python test.py`

Efficient Inference with llama.cpp, Ollama, vLLM

See our fork of llama.cpp for more detail. This implementation supports smooth inference of 16~18 token/s on iPad (test environment：iPad Pro + M4).

See our fork of Ollama for more detail. This implementation supports smooth inference of 16~18 token/s on iPad (test environment：iPad Pro + M4).

vLLM now officially supports MiniCPM-V 2.6, MiniCPM-Llama3-V 2.5 and MiniCPM-V 2.0. And you can use our fork to run MiniCPM-o 2.6 for now. Click to see.

Install vLLM(>=0.7.1):

`1`	`pip install vllm`

Run Example:

Fine-tuning

Simple Fine-tuning

We support simple fine-tuning with Hugging Face for MiniCPM-o 2.6, MiniCPM-V 2.6, MiniCPM-Llama3-V 2.5 and MiniCPM-V 2.0.

Reference Document

With Align-Anything

We support fine-tuning MiniCPM-o 2.6 by PKU-Alignment Team (both vision and audio, SFT and DPO) with the Align-Anything framework. Align-Anything is a scalable framework that aims to align any-modality large models with human intentions, open-sourcing the datasets, models and benchmarks. Benefiting from its concise and modular design, it supports 30+ open-source benchmarks, 40+ models and algorithms including SFT, SimPO, RLHF, etc. It also provides 30+ directly runnable scripts, making it suitable for beginners to quickly get started.

Best Practices: MiniCPM-o 2.6.

With LLaMA-Factory

We support fine-tuning MiniCPM-o 2.6 and MiniCPM-V 2.6 with the LLaMA-Factory framework. LLaMA-Factory provides a solution for flexibly customizing the fine-tuning (Lora/Full/Qlora) of 200+ LLMs without the need for coding through the built-in web UI LLaMABoard. It supports various training methods like sft/ppo/dpo/kto and advanced algorithms like Galore/BAdam/LLaMA-Pro/Pissa/LongLoRA.

Best Practices: MiniCPM-o 2.6 | MiniCPM-V 2.6.

With the SWIFT Framework

We now support MiniCPM-V series fine-tuning with the SWIFT framework. SWIFT supports training, inference, evaluation and deployment of nearly 200 LLMs and MLLMs . It supports the lightweight training solutions provided by PEFT and a complete Adapters Library including techniques such as NEFTune, LoRA+ and LLaMA-PRO.

Best Practices：MiniCPM-V 1.0, MiniCPM-V 2.0, MiniCPM-V 2.6.

Awesome work using MiniCPM-V & MiniCPM-o

text-extract-api: Document extraction API using OCRs and Ollama supported models
comfyui_LLM_party: Build LLM workflows and integrate into existing image workflows
Ollama-OCR: OCR package uses vlms through Ollama to extract text from images and PDF
comfyui-mixlab-nodes: ComfyUI node suite supports Workflow-to-APP、GPT&3D and more
OpenAvatarChat: Interactive digital human conversation implementation on single PC
pensieve: A privacy-focused passive recording project by recording screen content
paperless-gpt: Use LLMs to handle paperless-ngx, AI-powered titles, tags and OCR
Neuro: A recreation of Neuro-Sama, but running on local models on consumer hardware

FAQs

Click here to view the FAQs

Limitations

As an experimental trial, we find MiniCPM-o 2.6 has notable limitations worth further investigation and improvement.

Unstable speech output. The speech generation can be flawed with noisy backgrounds and unmeaningful sounds.
Repeated response. The model tends to repeat its response when encountering similar consecutive user queries.
High-latency on Web Demo. Users may experience unusual high-latency when using web demo hosted on overseas servers. We recommend deploying the demo locally or with good network connections.

Model License

This repository is released under the Apache-2.0 License.
The usage of MiniCPM-o/V model weights must strictly follow MiniCPM Model License.md.
The models and weights of MiniCPM are completely free for academic research. after filling out a “questionnaire” for registration, are also available for free commercial use.

Statement

As MLLMs, MiniCPM-o/V models generate content by learning a large number of multimodal corpora, but they cannot comprehend, express personal opinions, or make value judgements. Anything generated by MiniCPM-o/V models does not represent the views and positions of the model developers

We will not be liable for any problems arising from the use of MiniCPM-o/V models, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination, or misuse of the model.

Institutions

This project is developed by the following institutions:

🌟 Star History

Key Techniques and Other Multimodal Projects

👏 Welcome to explore key techniques of MiniCPM-o/V and other multimodal projects of our team:

VisCPM | RLPR | RLHF-V | LLaVA-UHD | RLAIF-V

Citation

If you find our model/code/paper helpful, please consider citing our papers 📝 and staring us ⭐️！

@article{yao2024minicpm,
  title={MiniCPM-V: A GPT-4V Level MLLM on Your Phone},
  author={Yao, Yuan and Yu, Tianyu and Zhang, Ao and Wang, Chongyi and Cui, Junbo and Zhu, Hongji and Cai, Tianchi and Li, Haoyu and Zhao, Weilin and He, Zhihui and others},
  journal={arXiv preprint arXiv:2408.01800},
  year={2024}
}

Hands-On-Large-Language-Models

Wed, 27 Aug 2025 15:29:45 +0800

HandsOnLLM/Hands-On-Large-Language-Models

# Hands-On Large Language Models

Welcome! In this repository you will find the code for all examples throughout the book Hands-On Large Language Models written by Jay Alammar and Maarten Grootendorst which we playfully dubbed:

"The Illustrated LLM Book"

Through the visually educational nature of this book and with almost 300 custom made figures, learn the practical tools and concepts you need to use Large Language Models today!

The book is available on:

We advise to run all examples through Google Colab for the easiest setup. Google Colab allows you to use a T4 GPU with 16GB of VRAM for free. All examples were mainly built and tested using Google Colab, so it should be the most stable platform. However, any other cloud provider should work.

Chapter	Notebook
Chapter 1: Introduction to Language Models
Chapter 2: Tokens and Embeddings
Chapter 3: Looking Inside Transformer LLMs
Chapter 4: Text Classification
Chapter 5: Text Clustering and Topic Modeling
Chapter 6: Prompt Engineering
Chapter 7: Advanced Text Generation Techniques and Tools
Chapter 8: Semantic Search and Retrieval-Augmented Generation
Chapter 9: Multimodal Large Language Models
Chapter 10: Creating Text Embedding Models
Chapter 11: Fine-tuning Representation Models for Classification
Chapter 12: Fine-tuning Generation Models

[!TIP] You can check the setup folder for a quick-start guide to install all packages locally and you can check the conda folder for a complete guide on how to setup your environment, including conda and PyTorch installation. Note that the depending on your OS, Python version, and dependencies your results might be slightly differ. However, they should this be similar to the examples in the book.

Reviews

“Jay and Maarten have continued their tradition of providing beautifully illustrated and insightful descriptions of complex topics in their new book. Bolstered with working code, timelines, and references to key papers, their book is a valuable resource for anyone looking to understand the main techniques behind how Large Language Models are built.”

Andrew Ng - founder of DeepLearning.AI

“This is an exceptional guide to the world of language models and their practical applications in industry. Its highly-visual coverage of generative, representational, and retrieval applications of language models empowers readers to quickly understand, use, and refine LLMs. Highly recommended!”

Nils Reimers - Director of Machine Learning at Cohere | creator of sentence-transformers

“I can’t think of another book that is more important to read right now. On every single page, I learned something that is critical to success in this era of language models.”

Josh Starmer - StatQuest

“If you’re looking to get up to speed in everything regarding LLMs, look no further! In this wonderful book, Jay and Maarten will take you from zero to expert in the history and latest advances in large language models. With very intuitive explanations, great real-life examples, clear illustrations, and comprehensive code labs, this book lifts the curtain on the complexities of transformer models, tokenizers, semantic search, RAG, and many other cutting-edge technologies. A must read for anyone interested in the latest AI technology!”

Luis Serrano, PhD - Founder and CEO of Serrano Academy

“Hands-On Large Language Models brings clarity and practical examples to cut through the hype of AI. It provides a wealth of great diagrams and visual aids to supplement the clear explanations. The worked examples and code make concrete what other books leave abstract. The book starts with simple introductory beginnings, and steadily builds in scope. By the final chapters, you will be fine-tuning and building your own large language models with confidence.”

Leland McInnes - Researcher at the Tutte Institute for Mathematics and Computing | creator of UMAP and HDBSCAN

Bonus content!

We attempted to put as much information into the book without it being overwhelming. However, even with a 400-page book there is still much to discover!

We continue to create more guides that compliment the book and go more in-depth into new and exciting topics:

A Visual Guide to Mamba	A Visual Guide to Quantization	The Illustrated Stable Diffusion

A Visual Guide to Mixture of Experts	A Visual Guide to Reasoning LLMs	The Illustrated DeepSeek-R1

Citation

Please consider citing the book if you consider it useful for your research:

@book{hands-on-llms-book,
  author       = {Jay Alammar and Maarten Grootendorst},
  title        = {Hands-On Large Language Models},
  publisher    = {O'Reilly},
  year         = {2024},
  isbn         = {978-1098150969},
  url          = {https://www.oreilly.com/library/view/hands-on-large-language/9781098150952/},
  github       = {https://github.com/HandsOnLLM/Hands-On-Large-Language-Models}
}

anthropic-cookbook

Sat, 21 Jun 2025 15:28:31 +0800

anthropics/anthropic-cookbook

Anthropic Cookbook

The Anthropic Cookbook provides code and guides designed to help developers build with Claude, offering copy-able code snippets that you can easily integrate into your own projects.

Prerequisites

To make the most of the examples in this cookbook, you’ll need an Anthropic API key (sign up for free here).

While the code examples are primarily written in Python, the concepts can be adapted to any programming language that supports interaction with the Anthropic API.

If you’re new to working with the Anthropic API, we recommend starting with our Anthropic API Fundamentals course to get a solid foundation.

Explore Further

Looking for more resources to enhance your experience with Claude and AI assistants? Check out these helpful links:

Contributing

The Anthropic Cookbook thrives on the contributions of the developer community. We value your input, whether it’s submitting an idea, fixing a typo, adding a new guide, or improving an existing one. By contributing, you help make this resource even more valuable for everyone.

To avoid duplication of efforts, please review the existing issues and pull requests before contributing.

If you have ideas for new examples or guides, share them on the issues page.

Table of recipes

Skills

Classification: Explore techniques for text and data classification using Claude.
Retrieval Augmented Generation: Learn how to enhance Claude’s responses with external knowledge.
Summarization: Discover techniques for effective text summarization with Claude.

Tool Use and Integration

Tool use: Learn how to integrate Claude with external tools and functions to extend its capabilities.

Third-Party Integrations

Retrieval augmented generation: Supplement Claude’s knowledge with external data sources.
Embeddings with Voyage AI

Multimodal Capabilities

Vision with Claude:
Generate images with Claude: Use Claude with Stable Diffusion for image generation.

Advanced Techniques

Sub-agents: Learn how to use Haiku as a sub-agent in combination with Opus.
Upload PDFs to Claude: Parse and pass PDFs as text to Claude.
Automated evaluations: Use Claude to automate the prompt evaluation process.
Enable JSON mode: Ensure consistent JSON output from Claude.
Create a moderation filter: Use Claude to create a content moderation filter for your application.
Prompt caching: Learn techniques for efficient prompt caching with Claude.

Additional Resources

Anthropic on AWS: Explore examples and solutions for using Claude on AWS infrastructure.
AWS Samples: A collection of code samples from AWS which can be adapted for use with Claude. Note that some samples may require modification to work optimally with Claude.

LLaMA-Factory

Tue, 27 May 2025 15:31:11 +0800

hiyouga/LLaMA-Factory

Used by Amazon, NVIDIA, Aliyun, etc.

Supporters ❤️

Warp, the agentic terminal for developers

Available for MacOS, Linux, & Windows

Easily fine-tune 100+ large language models with zero-code CLI and Web UI

👋 Join our WeChat or NPU user group.

\[ English | [中文](README_zh.md) \]

Fine-tuning a large language model can be easy as…

https://github.com/user-attachments/assets/3991a3a8-4276-4d30-9cab-4cb0c4b9b99e

Choose your path:

Documentation: https://llamafactory.readthedocs.io/en/latest/
Colab (free): https://colab.research.google.com/drive/1eRTPn37ltBbYsISy9Aw2NuI2Aq5CQrD9?usp=sharing
Local machine: Please refer to usage
PAI-DSW (free trial): https://gallery.pai-ml.com/#/preview/deepLearning/nlp/llama_factory

[!NOTE] Except for the above links, all other websites are unauthorized third-party websites. Please carefully use them.

Features

Various models: LLaMA, LLaVA, Mistral, Mixtral-MoE, Qwen, Qwen2-VL, DeepSeek, Yi, Gemma, ChatGLM, Phi, etc.
Integrated methods: (Continuous) pre-training, (multimodal) supervised fine-tuning, reward modeling, PPO, DPO, KTO, ORPO, etc.
Scalable resources: 16-bit full-tuning, freeze-tuning, LoRA and 2/3/4/5/6/8-bit QLoRA via AQLM/AWQ/GPTQ/LLM.int8/HQQ/EETQ.
Advanced algorithms: GaLore, BAdam, APOLLO, Adam-mini, Muon, DoRA, LongLoRA, LLaMA Pro, Mixture-of-Depths, LoRA+, LoftQ and PiSSA.
Practical tricks: FlashAttention-2, Unsloth, Liger Kernel, RoPE scaling, NEFTune and rsLoRA.
Wide tasks: Multi-turn dialogue, tool using, image understanding, visual grounding, video recognition, audio understanding, etc.
Experiment monitors: LlamaBoard, TensorBoard, Wandb, MLflow, SwanLab, etc.
Faster inference: OpenAI-style API, Gradio UI and CLI with vLLM worker or SGLang worker.

Day-N Support for Fine-Tuning Cutting-Edge Models

Support Date	Model Name
Day 0	Qwen3 / Qwen2.5-VL / Gemma 3 / InternLM 3 / MiniCPM-o-2.6
Day 1	Llama 3 / GLM-4 / Mistral Small / PaliGemma2 / Llama 4

Blogs

All Blogs

Changelog

[25/04/28] We supported fine-tuning the Qwen3 model family.

[25/04/21] We supported the Muon optimizer. See examples for usage. Thank @tianshijing’s PR.

[25/04/16] We supported fine-tuning the InternVL3 model. See PR #7258 to get started.

[25/04/14] We supported fine-tuning the GLM-Z1 and Kimi-VL models.

[25/04/06] We supported fine-tuning the Llama 4 model. See PR #7611 to get started.

Full Changelog

[25/03/31] We supported fine-tuning the Qwen2.5 Omni model. See PR #7537 to get started.

[25/03/15] We supported SGLang as inference backend. Try infer_backend: sglang to accelerate inference.

[25/03/12] We supported fine-tuning the Gemma 3 model.

[25/02/24] Announcing EasyR1, an efficient, scalable and multi-modality RL training framework for efficient GRPO training.

[25/02/11] We supported saving the Ollama modelfile when exporting the model checkpoints. See examples for usage.

[25/02/05] We supported fine-tuning the Qwen2-Audio and MiniCPM-o-2.6 on audio understanding tasks.

[25/01/31] We supported fine-tuning the DeepSeek-R1 and Qwen2.5-VL models.

[25/01/15] We supported APOLLO optimizer. See examples for usage.

[25/01/14] We supported fine-tuning the MiniCPM-o-2.6 and MiniCPM-V-2.6 models. Thank @BUAADreamer’s PR.

[25/01/14] We supported fine-tuning the InternLM 3 models. Thank @hhaAndroid’s PR.

[25/01/10] We supported fine-tuning the Phi-4 model.

[24/12/21] We supported using SwanLab for experiment tracking and visualization. See this section for details.

[24/11/27] We supported fine-tuning the Skywork-o1 model and the OpenO1 dataset.

[24/10/09] We supported downloading pre-trained models and datasets from the Modelers Hub. See this tutorial for usage.

[24/09/19] We supported fine-tuning the Qwen2.5 models.

[24/08/30] We supported fine-tuning the Qwen2-VL models. Thank @simonJJJ’s PR.

[24/08/27] We supported Liger Kernel. Try enable_liger_kernel: true for efficient training.

[24/08/09] We supported Adam-mini optimizer. See examples for usage. Thank @relic-yuexi’s PR.

[24/07/04] We supported contamination-free packed training. Use neat_packing: true to activate it. Thank @chuan298’s PR.

[24/06/16] We supported PiSSA algorithm. See examples for usage.

[24/06/07] We supported fine-tuning the Qwen2 and GLM-4 models.

[24/05/26] We supported SimPO algorithm for preference learning. See examples for usage.

[24/05/20] We supported fine-tuning the PaliGemma series models. Note that the PaliGemma models are pre-trained models, you need to fine-tune them with paligemma template for chat completion.

[24/05/18] We supported KTO algorithm for preference learning. See examples for usage.

[24/05/14] We supported training and inference on the Ascend NPU devices. Check installation section for details.

[24/04/26] We supported fine-tuning the LLaVA-1.5 multimodal LLMs. See examples for usage.

[24/04/22] We provided a Colab notebook for fine-tuning the Llama-3 model on a free T4 GPU. Two Llama-3-derived models fine-tuned using LLaMA Factory are available at Hugging Face, check Llama3-8B-Chinese-Chat and Llama3-Chinese for details.

[24/04/21] We supported Mixture-of-Depths according to AstraMindAI’s implementation. See examples for usage.

[24/04/16] We supported BAdam optimizer. See examples for usage.

[24/04/16] We supported unsloth’s long-sequence training (Llama-2-7B-56k within 24GB). It achieves 117% speed and 50% memory compared with FlashAttention-2, more benchmarks can be found in this page.

[24/03/31] We supported ORPO. See examples for usage.

[24/03/21] Our paper “LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models” is available at arXiv!

[24/03/20] We supported FSDP+QLoRA that fine-tunes a 70B model on 2x24GB GPUs. See examples for usage.

[24/03/13] We supported LoRA+. See examples for usage.

[24/03/07] We supported GaLore optimizer. See examples for usage.

[24/03/07] We integrated vLLM for faster and concurrent inference. Try infer_backend: vllm to enjoy 270% inference speed.

[24/02/28] We supported weight-decomposed LoRA (DoRA). Try use_dora: true to activate DoRA training.

[24/02/15] We supported block expansion proposed by LLaMA Pro. See examples for usage.

[24/02/05] Qwen1.5 (Qwen2 beta version) series models are supported in LLaMA-Factory. Check this blog post for details.

[24/01/18] We supported agent tuning for most models, equipping model with tool using abilities by fine-tuning with dataset: glaive_toolcall_en.

[23/12/23] We supported unsloth’s implementation to boost LoRA tuning for the LLaMA, Mistral and Yi models. Try use_unsloth: true argument to activate unsloth patch. It achieves 170% speed in our benchmark, check this page for details.

[23/12/12] We supported fine-tuning the latest MoE model Mixtral 8x7B in our framework. See hardware requirement here.

[23/12/01] We supported downloading pre-trained models and datasets from the ModelScope Hub. See this tutorial for usage.

[23/10/21] We supported NEFTune trick for fine-tuning. Try neftune_noise_alpha: 5 argument to activate NEFTune.

[23/09/27] We supported $S^2$-Attn proposed by LongLoRA for the LLaMA models. Try shift_attn: true argument to enable shift short attention.

[23/09/23] We integrated MMLU, C-Eval and CMMLU benchmarks in this repo. See examples for usage.

[23/09/10] We supported FlashAttention-2. Try flash_attn: fa2 argument to enable FlashAttention-2 if you are using RTX4090, A100 or H100 GPUs.

[23/08/12] We supported RoPE scaling to extend the context length of the LLaMA models. Try rope_scaling: linear argument in training and rope_scaling: dynamic argument at inference to extrapolate the position embeddings.

[23/08/11] We supported DPO training for instruction-tuned models. See examples for usage.

[23/07/31] We supported dataset streaming. Try streaming: true and max_steps: 10000 arguments to load your dataset in streaming mode.

[23/07/29] We released two instruction-tuned 13B models at Hugging Face. See these Hugging Face Repos (LLaMA-2 / Baichuan) for details.

[23/07/18] We developed an all-in-one Web UI for training, evaluation and inference. Try train_web.py to fine-tune models in your Web browser. Thank @KanadeSiina and @codemayq for their efforts in the development.

[23/07/09] We released FastEdit ⚡🩹, an easy-to-use package for editing the factual knowledge of large language models efficiently. Please follow FastEdit if you are interested.

[23/06/29] We provided a reproducible example of training a chat model using instruction-following datasets, see Baichuan-7B-sft for details.

[23/06/22] We aligned the demo API with the OpenAI’s format where you can insert the fine-tuned model in arbitrary ChatGPT-based applications.

[23/06/03] We supported quantized training and inference (aka QLoRA). See examples for usage.

[!TIP] If you cannot use the latest feature, please pull the latest code and install LLaMA-Factory again.

Supported Models

Model	Model size	Template
Baichuan 2	7B/13B	baichuan2
BLOOM/BLOOMZ	560M/1.1B/1.7B/3B/7.1B/176B	-
ChatGLM3	6B	chatglm3
Command R	35B/104B	cohere
DeepSeek (Code/MoE)	7B/16B/67B/236B	deepseek
DeepSeek 2.5/3	236B/671B	deepseek3
DeepSeek R1 (Distill)	1.5B/7B/8B/14B/32B/70B/671B	deepseekr1
Falcon	7B/11B/40B/180B	falcon
Gemma/Gemma 2/CodeGemma	2B/7B/9B/27B	gemma
Gemma 3	1B/4B/12B/27B	gemma3/gemma (1B)
GLM-4/GLM-4-0414/GLM-Z1	9B/32B	glm4/glmz1
GPT-2	0.1B/0.4B/0.8B/1.5B	-
Granite 3.0-3.3	1B/2B/3B/8B	granite3
Hunyuan	7B	hunyuan
Index	1.9B	index
InternLM 2-3	7B/8B/20B	intern2
InternVL 2.5-3	1B/2B/8B/14B/38B/78B	intern_vl
Kimi-VL	16B	kimi_vl
Llama	7B/13B/33B/65B	-
Llama 2	7B/13B/70B	llama2
Llama 3-3.3	1B/3B/8B/70B	llama3
Llama 4	109B/402B	llama4
Llama 3.2 Vision	11B/90B	mllama
LLaVA-1.5	7B/13B	llava
LLaVA-NeXT	7B/8B/13B/34B/72B/110B	llava_next
LLaVA-NeXT-Video	7B/34B	llava_next_video
MiMo	7B	mimo
MiniCPM	1B/2B/4B	cpm/cpm3
MiniCPM-o-2.6/MiniCPM-V-2.6	8B	minicpm_o/minicpm_v
Ministral/Mistral-Nemo	8B/12B	ministral
Mistral/Mixtral	7B/8x7B/8x22B	mistral
Mistral Small	24B	mistral_small
OLMo	1B/7B	-
PaliGemma/PaliGemma2	3B/10B/28B	paligemma
Phi-1.5/Phi-2	1.3B/2.7B	-
Phi-3/Phi-3.5	4B/14B	phi
Phi-3-small	7B	phi_small
Phi-4	14B	phi4
Pixtral	12B	pixtral
Qwen (1-2.5) (Code/Math/MoE/QwQ)	0.5B/1.5B/3B/7B/14B/32B/72B/110B	qwen
Qwen3 (MoE)	0.6B/1.7B/4B/8B/14B/32B/235B	qwen3
Qwen2-Audio	7B	qwen2_audio
Qwen2.5-Omni	3B/7B	qwen2_omni
Qwen2-VL/Qwen2.5-VL/QVQ	2B/3B/7B/32B/72B	qwen2_vl
Seed Coder	8B	seed_coder
Skywork o1	8B	skywork_o1
StarCoder 2	3B/7B/15B	-
TeleChat2	3B/7B/35B/115B	telechat2
XVERSE	7B/13B/65B	xverse
Yi/Yi-1.5 (Code)	1.5B/6B/9B/34B	yi
Yi-VL	6B/34B	yi_vl
Yuan 2	2B/51B/102B	yuan

[!NOTE] For the “base” models, the template argument can be chosen from default, alpaca, vicuna etc. But make sure to use the corresponding template for the “instruct/chat” models.

Remember to use the SAME template in training and inference.

*: You should install the transformers from main branch and use DISABLE_VERSION_CHECK=1 to skip version check.

**: You need to install a specific version of transformers to use the corresponding model.

Please refer to constants.py for a full list of models we supported.

You also can add a custom chat template to template.py.

Supported Training Approaches

Approach	Full-tuning	Freeze-tuning	LoRA	QLoRA
Pre-Training	:white_check_mark:	:white_check_mark:	:white_check_mark:	:white_check_mark:
Supervised Fine-Tuning	:white_check_mark:	:white_check_mark:	:white_check_mark:	:white_check_mark:
Reward Modeling	:white_check_mark:	:white_check_mark:	:white_check_mark:	:white_check_mark:
PPO Training	:white_check_mark:	:white_check_mark:	:white_check_mark:	:white_check_mark:
DPO Training	:white_check_mark:	:white_check_mark:	:white_check_mark:	:white_check_mark:
KTO Training	:white_check_mark:	:white_check_mark:	:white_check_mark:	:white_check_mark:
ORPO Training	:white_check_mark:	:white_check_mark:	:white_check_mark:	:white_check_mark:
SimPO Training	:white_check_mark:	:white_check_mark:	:white_check_mark:	:white_check_mark:

[!TIP] The implementation details of PPO can be found in this blog.

Provided Datasets

Pre-training datasets

Supervised fine-tuning datasets

Preference datasets

Some datasets require confirmation before using them, so we recommend logging in with your Hugging Face account using these commands.

1
2

pip install --upgrade huggingface_hub
huggingface-cli login

Requirement

Mandatory	Minimum	Recommend
python	3.9	3.10
torch	2.0.0	2.6.0
torchvision	0.15.0	0.21.0
transformers	4.45.0	4.50.0
datasets	2.16.0	3.2.0
accelerate	0.34.0	1.2.1
peft	0.14.0	0.15.1
trl	0.8.6	0.9.6

Optional	Minimum	Recommend
CUDA	11.6	12.2
deepspeed	0.10.0	0.16.4
bitsandbytes	0.39.0	0.43.1
vllm	0.4.3	0.8.2
flash-attn	2.5.6	2.7.2

Hardware Requirement

* estimated

Method	Bits	7B	14B	30B	70B	`x`B
Full (`bf16` or `fp16`)	32	120GB	240GB	600GB	1200GB	`18x`GB
Full (`pure_bf16`)	16	60GB	120GB	300GB	600GB	`8x`GB
Freeze/LoRA/GaLore/APOLLO/BAdam	16	16GB	32GB	64GB	160GB	`2x`GB
QLoRA	8	10GB	20GB	40GB	80GB	`x`GB
QLoRA	4	6GB	12GB	24GB	48GB	`x/2`GB
QLoRA	2	4GB	8GB	16GB	24GB	`x/4`GB

Getting Started

Installation

[!IMPORTANT] Installation is mandatory.

1
2
3

git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]" --no-build-isolation

Extra dependencies available: torch, torch-npu, metrics, deepspeed, liger-kernel, bitsandbytes, hqq, eetq, gptq, aqlm, vllm, sglang, galore, apollo, badam, adam-mini, qwen, minicpm_v, modelscope, openmind, swanlab, quality

[!TIP] Use pip install -e . --no-deps --no-build-isolation to resolve package conflicts.

Setting up a virtual environment with uv

Create an isolated Python environment with uv:

`1`	`uv sync --extra torch --extra metrics --prerelease=allow`

Run LLaMA-Factory in the isolated environment:

`1`	`uv run --prerelease=allow llamafactory-cli train examples/train_lora/llama3_lora_pretrain.yaml`

For Windows users

Install PyTorch

You need to manually install the GPU version of PyTorch on the Windows platform. Please refer to the official website and the following command to install PyTorch with CUDA support:

1
2
3

pip uninstall torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
python -c "import torch; print(torch.cuda.is_available())"

If you see True then you have successfully installed PyTorch with CUDA support.

Try dataloader_num_workers: 0 if you encounter Can't pickle local object error.

Install BitsAndBytes

If you want to enable the quantized LoRA (QLoRA) on the Windows platform, you need to install a pre-built version of bitsandbytes library, which supports CUDA 11.1 to 12.2, please select the appropriate release version based on your CUDA version.

`1`	`pip install https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.41.2.post2-py3-none-win_amd64.whl`

Install Flash Attention-2

To enable FlashAttention-2 on the Windows platform, please use the script from flash-attention-windows-wheel to compile and install it by yourself.

For Ascend NPU users

To install LLaMA Factory on Ascend NPU devices, please upgrade Python to version 3.10 or higher and specify extra dependencies: pip install -e ".[torch-npu,metrics]". Additionally, you need to install the Ascend CANN Toolkit and Kernels. Please follow the installation tutorial or use the following commands:

# replace the url according to your CANN version and devices
# install CANN Toolkit
wget https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/Milan-ASL/Milan-ASL%20V100R001C20SPC702/Ascend-cann-toolkit_8.0.0.alpha002_linux-"$(uname -i)".run
bash Ascend-cann-toolkit_8.0.0.alpha002_linux-"$(uname -i)".run --install

# install CANN Kernels
wget https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/Milan-ASL/Milan-ASL%20V100R001C20SPC702/Ascend-cann-kernels-910b_8.0.0.alpha002_linux-"$(uname -i)".run
bash Ascend-cann-kernels-910b_8.0.0.alpha002_linux-"$(uname -i)".run --install

# set env variables
source /usr/local/Ascend/ascend-toolkit/set_env.sh

Requirement	Minimum	Recommend
CANN	8.0.RC1	8.0.0.alpha002
torch	2.1.0	2.4.0
torch-npu	2.1.0	2.4.0.post2
deepspeed	0.13.2	0.13.2
vllm-ascend	-	0.7.3

Remember to use ASCEND_RT_VISIBLE_DEVICES instead of CUDA_VISIBLE_DEVICES to specify the device to use.

If you cannot infer model on NPU devices, try setting do_sample: false in the configurations.

Download the pre-built Docker images: 32GB | 64GB

Install BitsAndBytes

To use QLoRA based on bitsandbytes on Ascend NPU, please follow these 3 steps:

Manually compile bitsandbytes: Refer to the installation documentation for the NPU version of bitsandbytes to complete the compilation and installation. The compilation requires a cmake version of at least 3.22.1 and a g++ version of at least 12.x.

# Install bitsandbytes from source
# Clone bitsandbytes repo, Ascend NPU backend is currently enabled on multi-backend-refactor branch
git clone -b multi-backend-refactor https://github.com/bitsandbytes-foundation/bitsandbytes.git
cd bitsandbytes/

# Install dependencies
pip install -r requirements-dev.txt

# Install the dependencies for the compilation tools. Note that the commands for this step may vary depending on the operating system. The following are provided for reference
apt-get install -y build-essential cmake

# Compile & install  
cmake -DCOMPUTE_BACKEND=npu -S .
make
pip install .

Install transformers from the main branch.

1
2
3

git clone -b main https://github.com/huggingface/transformers.git
cd transformers
pip install .

Set double_quantization: false in the configuration. You can refer to the example.

Data Preparation

Please refer to data/README.md for checking the details about the format of dataset files. You can use datasets on HuggingFace / ModelScope / Modelers hub, load the dataset in local disk, or specify a path to s3/gcs cloud storage.

[!NOTE] Please update data/dataset_info.json to use your custom dataset.

You can also use Easy Dataset or GraphGen to create synthetic data for fine-tuning.

Quickstart

Use the following 3 commands to run LoRA fine-tuning, inference and merging of the Llama3-8B-Instruct model, respectively.

1
2
3

llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml
llamafactory-cli chat examples/inference/llama3_lora_sft.yaml
llamafactory-cli export examples/merge_lora/llama3_lora_sft.yaml

See examples/README.md for advanced usage (including distributed training).

[!TIP] Use llamafactory-cli help to show help information.

Read FAQs first if you encounter any problems.

Fine-Tuning with LLaMA Board GUI (powered by Gradio)

`1`	`llamafactory-cli webui`

Build Docker

For CUDA users:

1
2
3

cd docker/docker-cuda/
docker compose up -d
docker compose exec llamafactory bash

For Ascend NPU users:

1
2
3

cd docker/docker-npu/
docker compose up -d
docker compose exec llamafactory bash

For AMD ROCm users:

1
2
3

cd docker/docker-rocm/
docker compose up -d
docker compose exec llamafactory bash

Build without Docker Compose

For CUDA users:

docker build -f ./docker/docker-cuda/Dockerfile \
    --build-arg INSTALL_BNB=false \
    --build-arg INSTALL_VLLM=false \
    --build-arg INSTALL_DEEPSPEED=false \
    --build-arg INSTALL_FLASHATTN=false \
    --build-arg PIP_INDEX=https://pypi.org/simple \
    -t llamafactory:latest .

docker run -dit --gpus=all \
    -v ./hf_cache:/root/.cache/huggingface \
    -v ./ms_cache:/root/.cache/modelscope \
    -v ./om_cache:/root/.cache/openmind \
    -v ./data:/app/data \
    -v ./output:/app/output \
    -p 7860:7860 \
    -p 8000:8000 \
    --shm-size 16G \
    --name llamafactory \
    llamafactory:latest

docker exec -it llamafactory bash

For Ascend NPU users:

# Choose docker image upon your environment
docker build -f ./docker/docker-npu/Dockerfile \
    --build-arg INSTALL_DEEPSPEED=false \
    --build-arg PIP_INDEX=https://pypi.org/simple \
    -t llamafactory:latest .

# Change `device` upon your resources
docker run -dit \
    -v ./hf_cache:/root/.cache/huggingface \
    -v ./ms_cache:/root/.cache/modelscope \
    -v ./om_cache:/root/.cache/openmind \
    -v ./data:/app/data \
    -v ./output:/app/output \
    -v /usr/local/dcmi:/usr/local/dcmi \
    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
    -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
    -v /etc/ascend_install.info:/etc/ascend_install.info \
    -p 7860:7860 \
    -p 8000:8000 \
    --device /dev/davinci0 \
    --device /dev/davinci_manager \
    --device /dev/devmm_svm \
    --device /dev/hisi_hdc \
    --shm-size 16G \
    --name llamafactory \
    llamafactory:latest

docker exec -it llamafactory bash

For AMD ROCm users:

docker build -f ./docker/docker-rocm/Dockerfile \
    --build-arg INSTALL_BNB=false \
    --build-arg INSTALL_VLLM=false \
    --build-arg INSTALL_DEEPSPEED=false \
    --build-arg INSTALL_FLASHATTN=false \
    --build-arg PIP_INDEX=https://pypi.org/simple \
    -t llamafactory:latest .

docker run -dit \
    -v ./hf_cache:/root/.cache/huggingface \
    -v ./ms_cache:/root/.cache/modelscope \
    -v ./om_cache:/root/.cache/openmind \
    -v ./data:/app/data \
    -v ./output:/app/output \
    -v ./saves:/app/saves \
    -p 7860:7860 \
    -p 8000:8000 \
    --device /dev/kfd \
    --device /dev/dri \
    --shm-size 16G \
    --name llamafactory \
    llamafactory:latest

docker exec -it llamafactory bash

Details about volume

hf_cache: Utilize Hugging Face cache on the host machine. Reassignable if a cache already exists in a different directory.
ms_cache: Similar to Hugging Face cache but for ModelScope users.
om_cache: Similar to Hugging Face cache but for Modelers users.
data: Place datasets on this dir of the host machine so that they can be selected on LLaMA Board GUI.
output: Set export dir to this location so that the merged result can be accessed directly on the host machine.

Deploy with OpenAI-style API and vLLM

`1`	`API_PORT=8000 llamafactory-cli api examples/inference/llama3.yaml infer_backend=vllm vllm_enforce_eager=true`

[!TIP] Visit this page for API document.

Examples: Image understanding | Function calling

Download from ModelScope Hub

If you have trouble with downloading models and datasets from Hugging Face, you can use ModelScope.

`1`	export USE_MODELSCOPE_HUB=1 # `set USE_MODELSCOPE_HUB=1` for Windows

Train the model by specifying a model ID of the ModelScope Hub as the model_name_or_path. You can find a full list of model IDs at ModelScope Hub, e.g., LLM-Research/Meta-Llama-3-8B-Instruct.

Download from Modelers Hub

You can also use Modelers Hub to download models and datasets.

`1`	export USE_OPENMIND_HUB=1 # `set USE_OPENMIND_HUB=1` for Windows

Train the model by specifying a model ID of the Modelers Hub as the model_name_or_path. You can find a full list of model IDs at Modelers Hub, e.g., TeleAI/TeleChat-7B-pt.

Use W&B Logger

To use Weights & Biases for logging experimental results, you need to add the following arguments to yaml files.

1
2

report_to: wandb
run_name: test_run # optional

Set WANDB_API_KEY to your key when launching training tasks to log in with your W&B account.

Use SwanLab Logger

To use SwanLab for logging experimental results, you need to add the following arguments to yaml files.

1
2

use_swanlab: true
swanlab_run_name: test_run # optional

When launching training tasks, you can log in to SwanLab in three ways:

Add swanlab_api_key=<your_api_key> to the yaml file, and set it to your API key.
Set the environment variable SWANLAB_API_KEY to your API key.
Use the swanlab login command to complete the login.

Projects using LLaMA Factory

If you have a project that should be incorporated, please contact via email or create a pull request.

Click to show

Wang et al. ESRL: Efficient Sampling-based Reinforcement Learning for Sequence Generation. 2023. [arxiv]
Yu et al. Open, Closed, or Small Language Models for Text Classification? 2023. [arxiv]
Wang et al. UbiPhysio: Support Daily Functioning, Fitness, and Rehabilitation with Action Understanding and Feedback in Natural Language. 2023. [arxiv]
Luceri et al. Leveraging Large Language Models to Detect Influence Campaigns in Social Media. 2023. [arxiv]
Zhang et al. Alleviating Hallucinations of Large Language Models through Induced Hallucinations. 2023. [arxiv]
Wang et al. Know Your Needs Better: Towards Structured Understanding of Marketer Demands with Analogical Reasoning Augmented LLMs. KDD 2024. [arxiv]
Wang et al. CANDLE: Iterative Conceptualization and Instantiation Distillation from Large Language Models for Commonsense Reasoning. ACL 2024. [arxiv]
Choi et al. FACT-GPT: Fact-Checking Augmentation via Claim Matching with LLMs. 2024. [arxiv]
Zhang et al. AutoMathText: Autonomous Data Selection with Language Models for Mathematical Texts. 2024. [arxiv]
Lyu et al. KnowTuning: Knowledge-aware Fine-tuning for Large Language Models. 2024. [arxiv]
Yang et al. LaCo: Large Language Model Pruning via Layer Collaps. 2024. [arxiv]
Bhardwaj et al. Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic. 2024. [arxiv]
Yang et al. Enhancing Empathetic Response Generation by Augmenting LLMs with Small-scale Empathetic Models. 2024. [arxiv]
Yi et al. Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct Decoding. ACL 2024 Findings. [arxiv]
Cao et al. Head-wise Shareable Attention for Large Language Models. 2024. [arxiv]
Zhang et al. Enhancing Multilingual Capabilities of Large Language Models through Self-Distillation from Resource-Rich Languages. 2024. [arxiv]
Kim et al. Efficient and Effective Vocabulary Expansion Towards Multilingual Large Language Models. 2024. [arxiv]
Yu et al. KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models. ACL 2024. [arxiv]
Huang et al. Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning. 2024. [arxiv]
Duan et al. Negating Negatives: Alignment without Human Positive Samples via Distributional Dispreference Optimization. 2024. [arxiv]
Xie and Schwertfeger. Empowering Robotics with Large Language Models: osmAG Map Comprehension with LLMs. 2024. [arxiv]
Wu et al. Large Language Models are Parallel Multilingual Learners. 2024. [arxiv]
Zhang et al. EDT: Improving Large Language Models’ Generation by Entropy-based Dynamic Temperature Sampling. 2024. [arxiv]
Weller et al. FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions. 2024. [arxiv]
Hongbin Na. CBT-LLM: A Chinese Large Language Model for Cognitive Behavioral Therapy-based Mental Health Question Answering. COLING 2024. [arxiv]
Zan et al. CodeS: Natural Language to Code Repository via Multi-Layer Sketch. 2024. [arxiv]
Liu et al. Extensive Self-Contrast Enables Feedback-Free Language Model Alignment. 2024. [arxiv]
Luo et al. BAdam: A Memory Efficient Full Parameter Training Method for Large Language Models. 2024. [arxiv]
Du et al. Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model. 2024. [arxiv]
Ma et al. Parameter Efficient Quasi-Orthogonal Fine-Tuning via Givens Rotation. ICML 2024. [arxiv]
Liu et al. Dynamic Generation of Personalities with Large Language Models. 2024. [arxiv]
Shang et al. How Far Have We Gone in Stripped Binary Code Understanding Using Large Language Models. 2024. [arxiv]
Huang et al. LLMTune: Accelerate Database Knob Tuning with Large Language Models. 2024. [arxiv]
Deng et al. Text-Tuple-Table: Towards Information Integration in Text-to-Table Generation via Global Tuple Extraction. 2024. [arxiv]
Acikgoz et al. Hippocrates: An Open-Source Framework for Advancing Large Language Models in Healthcare. 2024. [arxiv]
Zhang et al. Small Language Models Need Strong Verifiers to Self-Correct Reasoning. ACL 2024 Findings. [arxiv]
Zhou et al. FREB-TQA: A Fine-Grained Robustness Evaluation Benchmark for Table Question Answering. NAACL 2024. [arxiv]
Xu et al. Large Language Models for Cyber Security: A Systematic Literature Review. 2024. [arxiv]
Dammu et al. “They are uncultured”: Unveiling Covert Harms and Social Threats in LLM Generated Conversations. 2024. [arxiv]
Yi et al. A safety realignment framework via subspace-oriented model fusion for large language models. 2024. [arxiv]
Lou et al. SPO: Multi-Dimensional Preference Sequential Alignment With Implicit Reward Modeling. 2024. [arxiv]
Zhang et al. Getting More from Less: Large Language Models are Good Spontaneous Multilingual Learners. 2024. [arxiv]
Zhang et al. TS-Align: A Teacher-Student Collaborative Framework for Scalable Iterative Finetuning of Large Language Models. 2024. [arxiv]
Zihong Chen. Sentence Segmentation and Sentence Punctuation Based on XunziALLM. 2024. [paper]
Gao et al. The Best of Both Worlds: Toward an Honest and Helpful Large Language Model. 2024. [arxiv]
Wang and Song. MARS: Benchmarking the Metaphysical Reasoning Abilities of Language Models with a Multi-task Evaluation Dataset. 2024. [arxiv]
Hu et al. Computational Limits of Low-Rank Adaptation (LoRA) for Transformer-Based Models. 2024. [arxiv]
Ge et al. Time Sensitive Knowledge Editing through Efficient Finetuning. ACL 2024. [arxiv]
Tan et al. Peer Review as A Multi-Turn and Long-Context Dialogue with Role-Based Interactions. 2024. [arxiv]
Song et al. Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters. 2024. [arxiv]
Gu et al. RWKV-CLIP: A Robust Vision-Language Representation Learner. 2024. [arxiv]
Chen et al. Advancing Tool-Augmented Large Language Models: Integrating Insights from Errors in Inference Trees. 2024. [arxiv]
Zhu et al. Are Large Language Models Good Statisticians?. 2024. [arxiv]
Li et al. Know the Unknown: An Uncertainty-Sensitive Method for LLM Instruction Tuning. 2024. [arxiv]
Ding et al. IntentionQA: A Benchmark for Evaluating Purchase Intention Comprehension Abilities of Language Models in E-commerce. 2024. [arxiv]
He et al. COMMUNITY-CROSS-INSTRUCT: Unsupervised Instruction Generation for Aligning Large Language Models to Online Communities. 2024. [arxiv]
Lin et al. FVEL: Interactive Formal Verification Environment with Large Language Models via Theorem Proving. 2024. [arxiv]
Treutlein et al. Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data. 2024. [arxiv]
Feng et al. SS-Bench: A Benchmark for Social Story Generation and Evaluation. 2024. [arxiv]
Feng et al. Self-Constructed Context Decompilation with Fined-grained Alignment Enhancement. 2024. [arxiv]
Liu et al. Large Language Models for Cuffless Blood Pressure Measurement From Wearable Biosignals. 2024. [arxiv]
Iyer et al. Exploring Very Low-Resource Translation with LLMs: The University of Edinburgh’s Submission to AmericasNLP 2024 Translation Task. AmericasNLP 2024. [paper]
Li et al. Calibrating LLMs with Preference Optimization on Thought Trees for Generating Rationale in Science Question Scoring. 2024. [arxiv]
Yang et al. Financial Knowledge Large Language Model. 2024. [arxiv]
Lin et al. DogeRM: Equipping Reward Models with Domain Knowledge through Model Merging. 2024. [arxiv]
Bako et al. Evaluating the Semantic Profiling Abilities of LLMs for Natural Language Utterances in Data Visualization. 2024. [arxiv]
Huang et al. RoLoRA: Fine-tuning Rotated Outlier-free LLMs for Effective Weight-Activation Quantization. 2024. [arxiv]
Jiang et al. LLM-Collaboration on Automatic Science Journalism for the General Audience. 2024. [arxiv]
Inouye et al. Applied Auto-tuning on LoRA Hyperparameters. 2024. [paper]
Qi et al. Research on Tibetan Tourism Viewpoints information generation system based on LLM. 2024. [arxiv]
Xu et al. Course-Correction: Safety Alignment Using Synthetic Preferences. 2024. [arxiv]
Sun et al. LAMBDA: A Large Model Based Data Agent. 2024. [arxiv]
Zhu et al. CollectiveSFT: Scaling Large Language Models for Chinese Medical Benchmark with Collective Instructions in Healthcare. 2024. [arxiv]
Yu et al. Correcting Negative Bias in Large Language Models through Negative Attention Score Alignment. 2024. [arxiv]
Xie et al. The Power of Personalized Datasets: Advancing Chinese Composition Writing for Elementary School through Targeted Model Fine-Tuning. IALP 2024. [paper]
Liu et al. Instruct-Code-Llama: Improving Capabilities of Language Model in Competition Level Code Generation by Online Judge Feedback. ICIC 2024. [paper]
Wang et al. Cybernetic Sentinels: Unveiling the Impact of Safety Data Selection on Model Security in Supervised Fine-Tuning. ICIC 2024. [paper]
Xia et al. Understanding the Performance and Estimating the Cost of LLM Fine-Tuning. 2024. [arxiv]
Zeng et al. Perceive, Reflect, and Plan: Designing LLM Agent for Goal-Directed City Navigation without Instructions. 2024. [arxiv]
Xia et al. Using Pre-trained Language Model for Accurate ESG Prediction. FinNLP 2024. [paper]
Liang et al. I-SHEEP: Self-Alignment of LLM from Scratch through an Iterative Self-Enhancement Paradigm. 2024. [arxiv]
Bai et al. Aligning Large Language Model with Direct Multi-Preference Optimization for Recommendation. CIKM 2024. [paper]
StarWhisper: A large language model for Astronomy, based on ChatGLM2-6B and Qwen-14B.
DISC-LawLLM: A large language model specialized in Chinese legal domain, based on Baichuan-13B, is capable of retrieving and reasoning on legal knowledge.
Sunsimiao: A large language model specialized in Chinese medical domain, based on Baichuan-7B and ChatGLM-6B.
CareGPT: A series of large language models for Chinese medical domain, based on LLaMA2-7B and Baichuan-13B.
MachineMindset: A series of MBTI Personality large language models, capable of giving any LLM 16 different personality types based on different datasets and training methods.
Luminia-13B-v3: A large language model specialized in generate metadata for stable diffusion. [demo]
Chinese-LLaVA-Med: A multimodal large language model specialized in Chinese medical domain, based on LLaVA-1.5-7B.
AutoRE: A document-level relation extraction system based on large language models.
NVIDIA RTX AI Toolkit: SDKs for fine-tuning LLMs on Windows PC for NVIDIA RTX.
LazyLLM: An easy and lazy way for building multi-agent LLMs applications and supports model fine-tuning via LLaMA Factory.
RAG-Retrieval: A full pipeline for RAG retrieval model fine-tuning, inference, and distillation. [blog]
360-LLaMA-Factory: A modified library that supports long sequence SFT & DPO using ring attention.
Sky-T1: An o1-like model fine-tuned by NovaSky AI with very small cost.
WeClone: One-stop solution for creating your digital avatar from chat logs.

License

This repository is licensed under the Apache-2.0 License.

Please follow the model licenses to use the corresponding model weights: Baichuan 2 / BLOOM / ChatGLM3 / Command R / DeepSeek / Falcon / Gemma / GLM-4 / GPT-2 / Granite / Index / InternLM / Llama / Llama 2 / Llama 3 / Llama 4 / MiniCPM / Mistral/Mixtral/Pixtral / OLMo / Phi-1.5/Phi-2 / Phi-3/Phi-4 / Qwen / Skywork / StarCoder 2 / TeleChat2 / XVERSE / Yi / Yi-1.5 / Yuan 2

Citation

If this work is helpful, please kindly cite as:

@inproceedings{zheng2024llamafactory,
  title={LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models},
  author={Yaowei Zheng and Richong Zhang and Junhao Zhang and Yanhan Ye and Zheyan Luo and Zhangchi Feng and Yongqiang Ma},
  booktitle={Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)},
  address={Bangkok, Thailand},
  publisher={Association for Computational Linguistics},
  year={2024},
  url={http://arxiv.org/abs/2403.13372}
}

Acknowledgement

This repo benefits from PEFT, TRL, QLoRA and FastChat. Thanks for their wonderful works.

Star History

llama-cookbook

Wed, 09 Apr 2025 15:29:20 +0800

meta-llama/llama-cookbook

Llama Cookbook: The Official Guide to building with Llama Models

Checkout our latest model tutorial here: Build with Llama 4 Scout

Welcome to the official repository for helping you get started with inference, fine-tuning and end-to-end use-cases of building with the Llama Model family.

This repository covers the most popular community approaches, use-cases and the latest recipes for Llama Text and Vision models.

[!TIP] Popular getting started links:

Build with Llama 4 Scout

Multimodal Inference with Llama 3.2 Vision

Inferencing using Llama Guard (Safety Model)

[!TIP] Popular end to end recipes:

Email Agent

NotebookLlama

Text to SQL

Note: We recently did a refactor of the repo, archive-main is a snapshot branch from before the refactor

Repository Structure:

3P Integrations: Getting Started Recipes and End to End Use-Cases from various Llama providers
End to End Use Cases: As the name suggests, spanning various domains and applications
Getting Started: Reference for inferencing, fine-tuning and RAG examples
src: Contains the src for the original llama-recipes library along with some FAQs for fine-tuning.

FAQ:

Q: What happened to llama-recipes? A: We recently renamed llama-recipes to llama-cookbook.
Q: Prompt Template changes for Multi-Modality? A: Llama 3.2 follows the same prompt template as Llama 3.1, with a new special token <|image|> representing the input image for the multimodal models. More details on the prompt templates for image reasoning, tool-calling, and code interpreter can be found on the documentation website.
Q: I have some questions for Fine-Tuning, is there a section to address these? A: Checkout the Fine-Tuning FAQ here.
Q: Some links are broken/folders are missing: A: We recently did a refactor of the repo, archive-main is a snapshot branch from before the refactor.
Q: Where can we find details about the latest models? A: Official Llama models website.

Contributing

Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.

License

See the License file for Meta Llama 3.2 here and Acceptable Use Policy here

See the License file for Meta Llama 3.1 here and Acceptable Use Policy here

See the License file for Meta Llama 3 here and Acceptable Use Policy here

See the License file for Meta Llama 2 here and Acceptable Use Policy here

Multimodal on Producthunt daily

OM1

Capabilities of OM1

Architecture Overview

Getting Started - Hello World

Package Management and VENV

Clone the Repo

Install Dependencies

Obtain an OpenMind API Key

Launching OM1

What’s Next?

Interfacing with New Robot Hardware

Recommended Development Platforms

Full Autonomy Guidance

Intro to Backpack?

Starting the system

Detailed Documentation

Contributing

License

ten-framework

Table of Contents

👋 Welcome to TEN

🎨 TMAN Designer

TMAN Designer

✨ Features

1️⃣ Real-time Avatar

2️⃣ Real-time voice with MCP servers

3️⃣ Real-time communication with hardware

4️⃣ Real-time vision and real-time screenshare detection

5️⃣ TEN with other LLM platforms

6️⃣ StoryTeller - TEN image generation

👩‍💻 Get TEN Agent up and running

🅰️ Run TEN Agent in localhost

Step ⓵ - Prerequisites

Step ⓶ - Build agent in VM

1. Clone down the repo,cd to ai-agents and create .env file from .env.example

2. Setup Agora App ID and App Certificate in .env

3. Start agent development containers

4. Enter container

5. Build agent with the default graph ( ~5min - ~8min)

6. Start the web server

Step ⓷ - Customize your agent with TMAN Designer

🅱️ Run TEN Agent in Codespace(no docker)

🛳️ TEN Agent Self Hosting

🅰️ Deploying with Docker

🅱️ Deploying with other cloud services

🌍 TEN Ecosystem

❓ Ask Questions

🥰 Contributing

Code Contributors

Contribution Guidelines

License

transformers

English | 简体中文 | 繁體中文 | 한국어 | Español | 日本語 | हिन्दी | Русский | Português | తెలుగు | Français | Deutsch | Tiếng Việt | العربية | اردو |

State-of-the-art pretrained models for inference and training

Installation

Quickstart

Why should I use Transformers?

Why shouldn’t I use Transformers?

100 projects using Transformers

Example models

Citation

kotaemon

kotaemon

Introduction

For end users

For developers

Key Features

Installation

System requirements

With Docker (recommended)

Without Docker

Setup GraphRAG

Setup Local Models (for local/private RAG)

Setup multimodal document parsing (OCR, table parsing, figure extraction)

Customize your application

flowsettings.py

.env

Adding your own RAG pipeline

Custom Reasoning Pipeline

1. Clone down the repo,`cd` to `ai-agents` and create `.env` file from `.env.example`

2. Setup Agora App ID and App Certificate in `.env`

5. Build agent with the default `graph` ( ~5min - ~8min)

`flowsettings.py`

`.env`