bytedance/Dolphin


Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting
Dolphin (Document Image Parsing via Heterogeneous Anchor Prompting) is a novel multimodal document image parsing model following an analyze-then-parse paradigm. This repository contains the demo code and pre-trained models for Dolphin.
๐ Overview
Document image parsing is challenging due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Dolphin addresses these challenges through a two-stage approach:
- ๐ Stage 1: Comprehensive page-level layout analysis by generating element sequence in natural reading order
- ๐งฉ Stage 2: Efficient parallel parsing of document elements using heterogeneous anchors and task-specific prompts

Dolphin achieves promising performance across diverse page-level and element-level parsing tasks while ensuring superior efficiency through its lightweight architecture and parallel parsing mechanism.
๐ Demo
Try our demo on Demo-Dolphin.
๐ Changelog
- ๐ฅ 2025.07.10 Released the Fox-Page Benchmark, a manually refined subset of the original Fox dataset. Download via: Baidu Yun | Google Drive.
- ๐ฅ 2025.06.30 Added TensorRT-LLM support for accelerated inference๏ผ
- ๐ฅ 2025.06.27 Added vLLM support for accelerated inference๏ผ
- ๐ฅ 2025.06.13 Added multi-page PDF document parsing capability.
- ๐ฅ 2025.05.21 Our demo is released at link. Check it out!
- ๐ฅ 2025.05.20 The pretrained model and inference code of Dolphin are released.
- ๐ฅ 2025.05.16 Our paper has been accepted by ACL 2025. Paper link: arXiv.
๐ ๏ธ Installation
-
Clone the repository:
1 2
git clone https://github.com/ByteDance/Dolphin.git cd Dolphin
-
Install the dependencies:
1
pip install -r requirements.txt
-
Download the pre-trained models using one of the following options:
Option A: Original Model Format (config-based)
Download from Baidu Yun or Google Drive and put them in the
./checkpoints
folder.Option B: Hugging Face Model Format
Visit our Huggingface model card, or download model by:
1 2 3 4 5 6
# Download the model from Hugging Face Hub git lfs install git clone https://huggingface.co/ByteDance/Dolphin ./hf_model # Or use the Hugging Face CLI pip install huggingface_hub huggingface-cli download ByteDance/Dolphin --local-dir ./hf_model
โก Inference
Dolphin provides two inference frameworks with support for two parsing granularities:
- Page-level Parsing: Parse the entire document page into a structured JSON and Markdown format
- Element-level Parsing: Parse individual document elements (text, table, formula)
๐ Page-level Parsing
Using Original Framework (config-based)
|
|
Using Hugging Face Framework
|
|
๐งฉ Element-level Parsing
Using Original Framework (config-based)
|
|
Using Hugging Face Framework
|
|
๐ Key Features
- ๐ Two-stage analyze-then-parse approach based on a single VLM
- ๐ Promising performance on document parsing tasks
- ๐ Natural reading order element sequence generation
- ๐งฉ Heterogeneous anchor prompting for different document elements
- โฑ๏ธ Efficient parallel parsing mechanism
- ๐ค Support for Hugging Face Transformers for easier integration
๐ฎ Notice
Call for Bad Cases: If you have encountered any cases where the model performs poorly, we would greatly appreciate it if you could share them in the issue. We are continuously working to optimize and improve the model.
๐ Acknowledgement
We would like to acknowledge the following open-source projects that provided inspiration and reference for this work:
๐ Citation
If you find this code useful for your research, please use the following BibTeX entry.
|
|