<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <title>Tagged PDF on Producthunt daily</title>
        <link>https://producthunt.programnotes.cn/en/tags/tagged-pdf/</link>
        <description>Recent content in Tagged PDF on Producthunt daily</description>
        <generator>Hugo -- gohugo.io</generator>
        <language>en</language>
        <lastBuildDate>Fri, 10 Apr 2026 16:25:53 +0800</lastBuildDate><atom:link href="https://producthunt.programnotes.cn/en/tags/tagged-pdf/index.xml" rel="self" type="application/rss+xml" /><item>
        <title>opendataloader-pdf</title>
        <link>https://producthunt.programnotes.cn/en/p/opendataloader-pdf/</link>
        <pubDate>Fri, 10 Apr 2026 16:25:53 +0800</pubDate>
        
        <guid>https://producthunt.programnotes.cn/en/p/opendataloader-pdf/</guid>
        <description>&lt;img src="https://images.unsplash.com/photo-1661793853662-bb363cbff451?ixid=M3w0NjAwMjJ8MHwxfHJhbmRvbXx8fHx8fHx8fDE3NzU4MDk0ODd8&amp;ixlib=rb-4.1.0" alt="Featured image of post opendataloader-pdf" /&gt;&lt;h1 id=&#34;opendataloader-projectopendataloader-pdf&#34;&gt;&lt;a class=&#34;link&#34; href=&#34;https://github.com/opendataloader-project/opendataloader-pdf&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;opendataloader-project/opendataloader-pdf&lt;/a&gt;
&lt;/h1&gt;&lt;!-- AI-AGENT-SUMMARY
name: opendataloader-pdf
category: PDF data extraction, PDF accessibility automation
license: Apache-2.0
solves: [PDF to structured data for RAG/LLM pipelines, automate PDF accessibility compliance — layout analysis + auto-tagging to Tagged PDF (first open-source end-to-end)]
input: PDF files (digital, scanned, tagged)
output: Markdown, JSON (with bounding boxes), HTML, Tagged PDF, PDF/UA (enterprise)
sdk: Python, Node.js, Java
requirements: Java 11+
pricing: open-source core (data extraction, layout analysis, auto-tagging to Tagged PDF), enterprise add-on (PDF/UA export, accessibility studio)
extraction-benchmark: #1 overall extraction accuracy (0.907) in hybrid mode, 0.928 table extraction accuracy, 0.015s/page local mode
accessibility-validation: PDF Association collaboration, Well-Tagged PDF specification, veraPDF automated validation
key-differentiators: [benchmark #1 PDF parser, deterministic output, bounding boxes for every element, XY-Cut++ reading order, AI safety filters, hybrid AI mode, first open-source PDF auto-tagging to Tagged PDF, PDF Association + Dual Lab (veraPDF) collaboration, Well-Tagged PDF spec compliance]
--&gt;
&lt;h1 id=&#34;opendataloader-pdf&#34;&gt;OpenDataLoader PDF
&lt;/h1&gt;&lt;p&gt;&lt;strong&gt;PDF Parser for AI-ready data. Automate PDF accessibility. Open-source.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;a class=&#34;link&#34; href=&#34;https://github.com/opendataloader-project/opendataloader-pdf/blob/main/LICENSE&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;&lt;img src=&#34;https://img.shields.io/badge/License-Apache_2.0-blue.svg&#34;
	
	
	
	loading=&#34;lazy&#34;
	
		alt=&#34;License&#34;
	
	
&gt;&lt;/a&gt;
&lt;a class=&#34;link&#34; href=&#34;https://pypi.org/project/opendataloader-pdf/&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;&lt;img src=&#34;https://img.shields.io/pypi/v/opendataloader-pdf.svg&#34;
	
	
	
	loading=&#34;lazy&#34;
	
		alt=&#34;PyPI version&#34;
	
	
&gt;&lt;/a&gt;
&lt;a class=&#34;link&#34; href=&#34;https://www.npmjs.com/package/@opendataloader/pdf&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;&lt;img src=&#34;https://img.shields.io/npm/v/@opendataloader/pdf.svg&#34;
	
	
	
	loading=&#34;lazy&#34;
	
		alt=&#34;npm version&#34;
	
	
&gt;&lt;/a&gt;
&lt;a class=&#34;link&#34; href=&#34;https://search.maven.org/artifact/org.opendataloader/opendataloader-pdf-core&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;&lt;img src=&#34;https://img.shields.io/maven-central/v/org.opendataloader/opendataloader-pdf-core.svg&#34;
	
	
	
	loading=&#34;lazy&#34;
	
		alt=&#34;Maven Central&#34;
	
	
&gt;&lt;/a&gt;
&lt;a class=&#34;link&#34; href=&#34;https://github.com/opendataloader-project/opendataloader-pdf#java&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;&lt;img src=&#34;https://img.shields.io/badge/Java-11%2B-blue.svg&#34;
	
	
	
	loading=&#34;lazy&#34;
	
		alt=&#34;Java&#34;
	
	
&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href=&#34;https://trendshift.io/repositories/21917&#34; target=&#34;_blank&#34;&gt;&lt;img src=&#34;https://trendshift.io/api/badge/repositories/21917&#34; alt=&#34;opendataloader-project%2Fopendataloader-pdf | Trendshift&#34; style=&#34;width: 250px; height: 55px;&#34; width=&#34;250&#34; height=&#34;55&#34;/&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;🔍 &lt;strong&gt;PDF parser for AI data extraction&lt;/strong&gt; — Extract Markdown, JSON (with bounding boxes), and HTML from any PDF. #1 in benchmarks (0.907 overall). Deterministic local mode + AI hybrid mode for complex pages.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;How accurate is it?&lt;/strong&gt; — #1 in benchmarks: 0.907 overall, 0.928 table accuracy across 200 real-world PDFs including multi-column and scientific papers. Deterministic local mode + AI hybrid mode for complex pages (&lt;a class=&#34;link&#34; href=&#34;#extraction-benchmarks&#34; &gt;benchmarks&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scanned PDFs and OCR?&lt;/strong&gt; — Yes. Built-in OCR (80+ languages) in hybrid mode. Works with poor-quality scans at 300 DPI+ (&lt;a class=&#34;link&#34; href=&#34;#hybrid-mode-1-accuracy-for-complex-pdfs&#34; &gt;hybrid mode&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Tables, formulas, images, charts?&lt;/strong&gt; — Yes. Complex/borderless tables, LaTeX formulas, and AI-generated picture/chart descriptions all via hybrid mode (&lt;a class=&#34;link&#34; href=&#34;#hybrid-mode-1-accuracy-for-complex-pdfs&#34; &gt;hybrid mode&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;How do I use this for RAG?&lt;/strong&gt; — &lt;code&gt;pip install opendataloader-pdf&lt;/code&gt;, convert in 3 lines. Outputs structured Markdown for chunking, JSON with bounding boxes for source citations, and HTML. LangChain integration available. Python, Node.js, Java SDKs (&lt;a class=&#34;link&#34; href=&#34;#get-started-in-30-seconds&#34; &gt;quick start&lt;/a&gt; | &lt;a class=&#34;link&#34; href=&#34;#langchain-integration&#34; &gt;LangChain&lt;/a&gt;)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;♿ &lt;strong&gt;PDF accessibility automation&lt;/strong&gt; — The same layout analysis engine also powers auto-tagging. First open-source tool to generate Tagged PDFs end-to-end (coming Q2 2026).&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;What&amp;rsquo;s the problem?&lt;/strong&gt; — Accessibility regulations are now enforced worldwide. Manual PDF remediation costs $50–200 per document and doesn&amp;rsquo;t scale (&lt;a class=&#34;link&#34; href=&#34;#pdf-accessibility--pdfua-conversion&#34; &gt;regulations&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;What&amp;rsquo;s free?&lt;/strong&gt; — Layout analysis + auto-tagging (Q2 2026, Apache 2.0). Untagged PDF in → Tagged PDF out. No proprietary SDK dependency (&lt;a class=&#34;link&#34; href=&#34;#auto-tagging-preview-coming-q2-2026&#34; &gt;auto-tagging preview&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;What about PDF/UA compliance?&lt;/strong&gt; — Converting Tagged PDF to PDF/UA-1 or PDF/UA-2 is an enterprise add-on. Auto-tagging generates the Tagged PDF; PDF/UA export is the final step (&lt;a class=&#34;link&#34; href=&#34;#accessibility-pipeline&#34; &gt;pipeline&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Why trust this?&lt;/strong&gt; — Built in collaboration with &lt;a class=&#34;link&#34; href=&#34;https://duallab.com&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Dual Lab&lt;/a&gt; (&lt;a class=&#34;link&#34; href=&#34;https://verapdf.org&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;veraPDF&lt;/a&gt; developers) based on &lt;a class=&#34;link&#34; href=&#34;https://pdfa.org&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;PDF Association&lt;/a&gt; specifications, best practice guides and expertise of the &lt;a class=&#34;link&#34; href=&#34;https://pdfa.org/community/&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;PDF Community&lt;/a&gt;. Auto-tagging follows the &lt;a class=&#34;link&#34; href=&#34;https://pdfa.org/wtpdf/&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Well-Tagged PDF specification&lt;/a&gt;, validated with veraPDF (&lt;a class=&#34;link&#34; href=&#34;https://opendataloader.org/docs/tagged-pdf-collaboration&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;collaboration&lt;/a&gt;)&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;get-started-in-30-seconds&#34;&gt;Get Started in 30 Seconds
&lt;/h2&gt;&lt;p&gt;&lt;strong&gt;Requires&lt;/strong&gt;: Java 11+ and Python 3.10+ (&lt;a class=&#34;link&#34; href=&#34;https://opendataloader.org/docs/quick-start-nodejs&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Node.js&lt;/a&gt; | &lt;a class=&#34;link&#34; href=&#34;https://opendataloader.org/docs/quick-start-java&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Java&lt;/a&gt; also available)&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Before you start: run &lt;code&gt;java -version&lt;/code&gt;. If not found, install JDK 11+ from &lt;a class=&#34;link&#34; href=&#34;https://adoptium.net/&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Adoptium&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;pip install -U opendataloader-pdf
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;5
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;6
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;7
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;8
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;opendataloader_pdf&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;opendataloader_pdf&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;convert&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;input_path&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;file1.pdf&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;file2.pdf&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;folder/&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;output_dir&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;output/&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;nb&#34;&gt;format&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;markdown,json&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;&lt;img src=&#34;https://raw.githubusercontent.com/opendataloader-project/opendataloader-pdf/main/samples/image/example_annotated_pdf.png&#34;
	
	
	
	loading=&#34;lazy&#34;
	
		alt=&#34;OpenDataLoader PDF layout analysis — headings, tables, images detected with bounding boxes&#34;
	
	
&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Annotated PDF output — each element (heading, paragraph, table, image) detected with bounding boxes and semantic type.&lt;/em&gt;&lt;/p&gt;
&lt;h2 id=&#34;what-problems-does-this-solve&#34;&gt;What Problems Does This Solve?
&lt;/h2&gt;&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Problem&lt;/th&gt;
          &lt;th&gt;Solution&lt;/th&gt;
          &lt;th&gt;Status&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;strong&gt;PDF structure lost during parsing&lt;/strong&gt; — wrong reading order, broken tables, no element coordinates&lt;/td&gt;
          &lt;td&gt;Deterministic local PDF to Markdown/JSON with bounding boxes, XY-Cut++ reading order&lt;/td&gt;
          &lt;td&gt;Shipped&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;strong&gt;Complex tables, scanned PDFs, formulas, charts&lt;/strong&gt; need AI-level understanding&lt;/td&gt;
          &lt;td&gt;Hybrid mode routes complex pages to AI backend (#1 in benchmarks)&lt;/td&gt;
          &lt;td&gt;Shipped&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;strong&gt;PDF accessibility compliance&lt;/strong&gt; — EAA, ADA, Section 508 enforced. Manual remediation $50–200/doc&lt;/td&gt;
          &lt;td&gt;Auto-tagging: layout analysis → Tagged PDF (free, Q2 2026). Built with PDF Association &amp;amp; veraPDF validation. PDF/UA export (enterprise add-on)&lt;/td&gt;
          &lt;td&gt;Auto-tag: Q2 2026&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;h2 id=&#34;capability-matrix&#34;&gt;Capability Matrix
&lt;/h2&gt;&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Capability&lt;/th&gt;
          &lt;th&gt;Supported&lt;/th&gt;
          &lt;th&gt;Tier&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;strong&gt;Data extraction&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Extract text with correct reading order&lt;/td&gt;
          &lt;td&gt;Yes&lt;/td&gt;
          &lt;td&gt;Free&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Bounding boxes for every element&lt;/td&gt;
          &lt;td&gt;Yes&lt;/td&gt;
          &lt;td&gt;Free&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Table extraction (simple borders)&lt;/td&gt;
          &lt;td&gt;Yes&lt;/td&gt;
          &lt;td&gt;Free&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Table extraction (complex/borderless)&lt;/td&gt;
          &lt;td&gt;Yes&lt;/td&gt;
          &lt;td&gt;Free (Hybrid)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Heading hierarchy detection&lt;/td&gt;
          &lt;td&gt;Yes&lt;/td&gt;
          &lt;td&gt;Free&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;List detection (numbered, bulleted, nested)&lt;/td&gt;
          &lt;td&gt;Yes&lt;/td&gt;
          &lt;td&gt;Free&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Image extraction with coordinates&lt;/td&gt;
          &lt;td&gt;Yes&lt;/td&gt;
          &lt;td&gt;Free&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AI chart/image description&lt;/td&gt;
          &lt;td&gt;Yes&lt;/td&gt;
          &lt;td&gt;Free (Hybrid)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;OCR for scanned PDFs&lt;/td&gt;
          &lt;td&gt;Yes&lt;/td&gt;
          &lt;td&gt;Free (Hybrid)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Formula extraction (LaTeX)&lt;/td&gt;
          &lt;td&gt;Yes&lt;/td&gt;
          &lt;td&gt;Free (Hybrid)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Tagged PDF structure extraction&lt;/td&gt;
          &lt;td&gt;Yes&lt;/td&gt;
          &lt;td&gt;Free&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AI safety (prompt injection filtering)&lt;/td&gt;
          &lt;td&gt;Yes&lt;/td&gt;
          &lt;td&gt;Free&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Header/footer/watermark filtering&lt;/td&gt;
          &lt;td&gt;Yes&lt;/td&gt;
          &lt;td&gt;Free&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;strong&gt;Accessibility&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Auto-tagging → Tagged PDF for untagged PDFs&lt;/td&gt;
          &lt;td&gt;Coming Q2 2026&lt;/td&gt;
          &lt;td&gt;Free (Apache 2.0)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;PDF/UA-1, PDF/UA-2 export&lt;/td&gt;
          &lt;td&gt;💼 Available&lt;/td&gt;
          &lt;td&gt;Enterprise&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Accessibility studio (visual editor)&lt;/td&gt;
          &lt;td&gt;💼 Available&lt;/td&gt;
          &lt;td&gt;Enterprise&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;strong&gt;Limitations&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Process Word/Excel/PPT&lt;/td&gt;
          &lt;td&gt;No&lt;/td&gt;
          &lt;td&gt;—&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;GPU required&lt;/td&gt;
          &lt;td&gt;No&lt;/td&gt;
          &lt;td&gt;—&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;h2 id=&#34;extraction-benchmarks&#34;&gt;Extraction Benchmarks
&lt;/h2&gt;&lt;p&gt;&lt;strong&gt;opendataloader-pdf [hybrid] ranks #1 overall (0.907)&lt;/strong&gt; across reading order, table, and heading extraction accuracy.&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Engine&lt;/th&gt;
          &lt;th&gt;Overall&lt;/th&gt;
          &lt;th&gt;Reading Order&lt;/th&gt;
          &lt;th&gt;Table&lt;/th&gt;
          &lt;th&gt;Heading&lt;/th&gt;
          &lt;th&gt;Speed (s/page)&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;strong&gt;opendataloader [hybrid]&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;&lt;strong&gt;0.907&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;&lt;strong&gt;0.934&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;&lt;strong&gt;0.928&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;0.821&lt;/td&gt;
          &lt;td&gt;0.463&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;docling&lt;/td&gt;
          &lt;td&gt;0.882&lt;/td&gt;
          &lt;td&gt;0.898&lt;/td&gt;
          &lt;td&gt;0.887&lt;/td&gt;
          &lt;td&gt;&lt;strong&gt;0.824&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;0.762&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;nutrient&lt;/td&gt;
          &lt;td&gt;0.880&lt;/td&gt;
          &lt;td&gt;0.924&lt;/td&gt;
          &lt;td&gt;0.662&lt;/td&gt;
          &lt;td&gt;0.811&lt;/td&gt;
          &lt;td&gt;0.230&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;marker&lt;/td&gt;
          &lt;td&gt;0.861&lt;/td&gt;
          &lt;td&gt;0.890&lt;/td&gt;
          &lt;td&gt;0.808&lt;/td&gt;
          &lt;td&gt;0.796&lt;/td&gt;
          &lt;td&gt;53.932&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;unstructured [hi_res]&lt;/td&gt;
          &lt;td&gt;0.841&lt;/td&gt;
          &lt;td&gt;0.904&lt;/td&gt;
          &lt;td&gt;0.588&lt;/td&gt;
          &lt;td&gt;0.749&lt;/td&gt;
          &lt;td&gt;3.008&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;edgeparse&lt;/td&gt;
          &lt;td&gt;0.837&lt;/td&gt;
          &lt;td&gt;0.894&lt;/td&gt;
          &lt;td&gt;0.717&lt;/td&gt;
          &lt;td&gt;0.706&lt;/td&gt;
          &lt;td&gt;0.036&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;opendataloader&lt;/td&gt;
          &lt;td&gt;0.831&lt;/td&gt;
          &lt;td&gt;0.902&lt;/td&gt;
          &lt;td&gt;0.489&lt;/td&gt;
          &lt;td&gt;0.739&lt;/td&gt;
          &lt;td&gt;&lt;strong&gt;0.015&lt;/strong&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;mineru&lt;/td&gt;
          &lt;td&gt;0.831&lt;/td&gt;
          &lt;td&gt;0.857&lt;/td&gt;
          &lt;td&gt;0.873&lt;/td&gt;
          &lt;td&gt;0.743&lt;/td&gt;
          &lt;td&gt;5.962&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;pymupdf4llm&lt;/td&gt;
          &lt;td&gt;0.732&lt;/td&gt;
          &lt;td&gt;0.885&lt;/td&gt;
          &lt;td&gt;0.401&lt;/td&gt;
          &lt;td&gt;0.412&lt;/td&gt;
          &lt;td&gt;0.091&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;unstructured&lt;/td&gt;
          &lt;td&gt;0.686&lt;/td&gt;
          &lt;td&gt;0.882&lt;/td&gt;
          &lt;td&gt;0.000&lt;/td&gt;
          &lt;td&gt;0.388&lt;/td&gt;
          &lt;td&gt;0.077&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;markitdown&lt;/td&gt;
          &lt;td&gt;0.589&lt;/td&gt;
          &lt;td&gt;0.844&lt;/td&gt;
          &lt;td&gt;0.273&lt;/td&gt;
          &lt;td&gt;0.000&lt;/td&gt;
          &lt;td&gt;0.114&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;liteparse&lt;/td&gt;
          &lt;td&gt;0.576&lt;/td&gt;
          &lt;td&gt;0.866&lt;/td&gt;
          &lt;td&gt;0.000&lt;/td&gt;
          &lt;td&gt;0.000&lt;/td&gt;
          &lt;td&gt;1.061&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;blockquote&gt;
&lt;p&gt;Scores normalized to [0, 1]. Higher is better for accuracy; lower is better for speed. &lt;strong&gt;Bold&lt;/strong&gt; = best. &lt;a class=&#34;link&#34; href=&#34;https://github.com/opendataloader-project/opendataloader-bench&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Full benchmark details&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;a class=&#34;link&#34; href=&#34;https://github.com/opendataloader-project/opendataloader-bench&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;&lt;img src=&#34;https://github.com/opendataloader-project/opendataloader-bench/raw/refs/heads/main/charts/benchmark.png&#34;
	
	
	
	loading=&#34;lazy&#34;
	
		alt=&#34;Benchmark&#34;
	
	
&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a class=&#34;link&#34; href=&#34;https://github.com/opendataloader-project/opendataloader-bench&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;&lt;img src=&#34;https://github.com/opendataloader-project/opendataloader-bench/raw/refs/heads/main/charts/benchmark_quality.png&#34;
	
	
	
	loading=&#34;lazy&#34;
	
		alt=&#34;Quality Breakdown&#34;
	
	
&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2 id=&#34;which-mode-should-i-use&#34;&gt;Which Mode Should I Use?
&lt;/h2&gt;&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Your Document&lt;/th&gt;
          &lt;th&gt;Mode&lt;/th&gt;
          &lt;th&gt;Install&lt;/th&gt;
          &lt;th&gt;Server Command&lt;/th&gt;
          &lt;th&gt;Client Command&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;Standard digital PDF&lt;/td&gt;
          &lt;td&gt;Fast (default)&lt;/td&gt;
          &lt;td&gt;&lt;code&gt;pip install opendataloader-pdf&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;None needed&lt;/td&gt;
          &lt;td&gt;&lt;code&gt;opendataloader-pdf file1.pdf file2.pdf folder/&lt;/code&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Complex or nested tables&lt;/td&gt;
          &lt;td&gt;&lt;strong&gt;Hybrid&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;&lt;code&gt;pip install &amp;quot;opendataloader-pdf[hybrid]&amp;quot;&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;&lt;code&gt;opendataloader-pdf-hybrid --port 5002&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;&lt;code&gt;opendataloader-pdf --hybrid docling-fast file1.pdf file2.pdf folder/&lt;/code&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Scanned / image-based PDF&lt;/td&gt;
          &lt;td&gt;Hybrid + OCR&lt;/td&gt;
          &lt;td&gt;&lt;code&gt;pip install &amp;quot;opendataloader-pdf[hybrid]&amp;quot;&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;&lt;code&gt;opendataloader-pdf-hybrid --port 5002 --force-ocr&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;&lt;code&gt;opendataloader-pdf --hybrid docling-fast file1.pdf file2.pdf folder/&lt;/code&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Non-English scanned PDF&lt;/td&gt;
          &lt;td&gt;Hybrid + OCR&lt;/td&gt;
          &lt;td&gt;&lt;code&gt;pip install &amp;quot;opendataloader-pdf[hybrid]&amp;quot;&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;&lt;code&gt;opendataloader-pdf-hybrid --port 5002 --force-ocr --ocr-lang &amp;quot;ko,en&amp;quot;&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;&lt;code&gt;opendataloader-pdf --hybrid docling-fast file1.pdf file2.pdf folder/&lt;/code&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Mathematical formulas&lt;/td&gt;
          &lt;td&gt;Hybrid + formula&lt;/td&gt;
          &lt;td&gt;&lt;code&gt;pip install &amp;quot;opendataloader-pdf[hybrid]&amp;quot;&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;&lt;code&gt;opendataloader-pdf-hybrid --enrich-formula&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;&lt;code&gt;opendataloader-pdf --hybrid docling-fast --hybrid-mode full file1.pdf file2.pdf folder/&lt;/code&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Charts needing description&lt;/td&gt;
          &lt;td&gt;Hybrid + picture&lt;/td&gt;
          &lt;td&gt;&lt;code&gt;pip install &amp;quot;opendataloader-pdf[hybrid]&amp;quot;&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;&lt;code&gt;opendataloader-pdf-hybrid --enrich-picture-description&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;&lt;code&gt;opendataloader-pdf --hybrid docling-fast --hybrid-mode full file1.pdf file2.pdf folder/&lt;/code&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Untagged PDFs needing accessibility&lt;/td&gt;
          &lt;td&gt;Auto-tagging → Tagged PDF&lt;/td&gt;
          &lt;td&gt;Coming Q2 2026&lt;/td&gt;
          &lt;td&gt;—&lt;/td&gt;
          &lt;td&gt;—&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;h2 id=&#34;quick-start&#34;&gt;Quick Start
&lt;/h2&gt;&lt;h3 id=&#34;python&#34;&gt;Python
&lt;/h3&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;pip install -U opendataloader-pdf
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;5
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;6
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;7
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;8
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;opendataloader_pdf&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;opendataloader_pdf&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;convert&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;input_path&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;file1.pdf&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;file2.pdf&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;folder/&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;output_dir&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;output/&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;nb&#34;&gt;format&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;markdown,json&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h3 id=&#34;nodejs&#34;&gt;Node.js
&lt;/h3&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;npm install @opendataloader/pdf
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;5
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;6
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-typescript&#34; data-lang=&#34;typescript&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;kr&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt; &lt;span class=&#34;nx&#34;&gt;convert&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;}&lt;/span&gt; &lt;span class=&#34;kr&#34;&gt;from&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;@opendataloader/pdf&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;k&#34;&gt;await&lt;/span&gt; &lt;span class=&#34;nx&#34;&gt;convert&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;([&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;file1.pdf&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;file2.pdf&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;folder/&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;nx&#34;&gt;outputDir&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;output/&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;nx&#34;&gt;format&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;markdown,json&amp;#39;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;p&#34;&gt;});&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h3 id=&#34;java&#34;&gt;Java
&lt;/h3&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-xml&#34; data-lang=&#34;xml&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nt&#34;&gt;&amp;lt;dependency&amp;gt;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;nt&#34;&gt;&amp;lt;groupId&amp;gt;&lt;/span&gt;org.opendataloader&lt;span class=&#34;nt&#34;&gt;&amp;lt;/groupId&amp;gt;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;nt&#34;&gt;&amp;lt;artifactId&amp;gt;&lt;/span&gt;opendataloader-pdf-core&lt;span class=&#34;nt&#34;&gt;&amp;lt;/artifactId&amp;gt;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nt&#34;&gt;&amp;lt;/dependency&amp;gt;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;&lt;a class=&#34;link&#34; href=&#34;https://opendataloader.org/docs/quick-start-python&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Python Quick Start&lt;/a&gt; | &lt;a class=&#34;link&#34; href=&#34;https://opendataloader.org/docs/quick-start-nodejs&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Node.js Quick Start&lt;/a&gt; | &lt;a class=&#34;link&#34; href=&#34;https://opendataloader.org/docs/quick-start-java&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Java Quick Start&lt;/a&gt;&lt;/p&gt;
&lt;h2 id=&#34;hybrid-mode-1-accuracy-for-complex-pdfs&#34;&gt;Hybrid Mode: #1 Accuracy for Complex PDFs
&lt;/h2&gt;&lt;p&gt;Hybrid mode combines fast local Java processing with AI backends. Simple pages stay local (0.02s); complex pages route to AI for +90% table accuracy.&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;pip install -U &lt;span class=&#34;s2&#34;&gt;&amp;#34;opendataloader-pdf[hybrid]&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;&lt;strong&gt;Terminal 1&lt;/strong&gt; — Start the backend server:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;opendataloader-pdf-hybrid --port &lt;span class=&#34;m&#34;&gt;5002&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;&lt;strong&gt;Terminal 2&lt;/strong&gt; — Process PDFs:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;opendataloader-pdf --hybrid docling-fast file1.pdf file2.pdf folder/
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;&lt;strong&gt;Python:&lt;/strong&gt;&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;5
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;6
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;opendataloader_pdf&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;convert&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;input_path&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;file1.pdf&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;file2.pdf&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;folder/&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;output_dir&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;output/&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;hybrid&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;docling-fast&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h3 id=&#34;ocr-for-scanned-pdfs&#34;&gt;OCR for Scanned PDFs
&lt;/h3&gt;&lt;p&gt;Start the backend with &lt;code&gt;--force-ocr&lt;/code&gt; for image-based PDFs with no selectable text:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;opendataloader-pdf-hybrid --port &lt;span class=&#34;m&#34;&gt;5002&lt;/span&gt; --force-ocr
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;For non-English documents, specify the language:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;opendataloader-pdf-hybrid --port &lt;span class=&#34;m&#34;&gt;5002&lt;/span&gt; --force-ocr --ocr-lang &lt;span class=&#34;s2&#34;&gt;&amp;#34;ko,en&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Supported languages: &lt;code&gt;en&lt;/code&gt;, &lt;code&gt;ko&lt;/code&gt;, &lt;code&gt;ja&lt;/code&gt;, &lt;code&gt;ch_sim&lt;/code&gt;, &lt;code&gt;ch_tra&lt;/code&gt;, &lt;code&gt;de&lt;/code&gt;, &lt;code&gt;fr&lt;/code&gt;, &lt;code&gt;ar&lt;/code&gt;, and more.&lt;/p&gt;
&lt;h3 id=&#34;formula-extraction-latex&#34;&gt;Formula Extraction (LaTeX)
&lt;/h3&gt;&lt;p&gt;Extract mathematical formulas as LaTeX from scientific PDFs:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;5
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# Server: enable formula enrichment&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;opendataloader-pdf-hybrid --enrich-formula
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;opendataloader-pdf --hybrid docling-fast --hybrid-mode full file1.pdf file2.pdf folder/
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Output in JSON:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;5
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;6
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-json&#34; data-lang=&#34;json&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;nt&#34;&gt;&amp;#34;type&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;formula&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;nt&#34;&gt;&amp;#34;page number&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;nt&#34;&gt;&amp;#34;bounding box&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;mf&#34;&gt;226.2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mf&#34;&gt;144.7&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mf&#34;&gt;377.1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mf&#34;&gt;168.7&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;nt&#34;&gt;&amp;#34;content&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;\\frac{f(x+h) - f(x)}{h}&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: Formula and picture description enrichments require &lt;code&gt;--hybrid-mode full&lt;/code&gt; on the client side.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3 id=&#34;chart--image-description&#34;&gt;Chart &amp;amp; Image Description
&lt;/h3&gt;&lt;p&gt;Generate AI descriptions for charts and images — useful for RAG search and accessibility alt text:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;5
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# Server&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;opendataloader-pdf-hybrid --enrich-picture-description
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;opendataloader-pdf --hybrid docling-fast --hybrid-mode full file1.pdf file2.pdf folder/
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Output in JSON:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;5
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;6
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-json&#34; data-lang=&#34;json&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;nt&#34;&gt;&amp;#34;type&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;picture&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;nt&#34;&gt;&amp;#34;page number&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;nt&#34;&gt;&amp;#34;bounding box&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;mf&#34;&gt;72.0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mf&#34;&gt;400.0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mf&#34;&gt;540.0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mf&#34;&gt;650.0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;nt&#34;&gt;&amp;#34;description&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;A bar chart showing waste generation by region from 2016 to 2030...&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;blockquote&gt;
&lt;p&gt;Uses SmolVLM (256M), a lightweight vision model. Custom prompts supported via &lt;code&gt;--picture-description-prompt&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3 id=&#34;hancom-data-loader-integration--coming-soon&#34;&gt;Hancom Data Loader Integration — Coming Soon
&lt;/h3&gt;&lt;p&gt;Enterprise-grade AI document analysis via &lt;a class=&#34;link&#34; href=&#34;https://sdk.hancom.com/en/services/1?utm_source=github&amp;amp;utm_medium=readme&amp;amp;utm_campaign=opendataloader-pdf&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Hancom Data Loader&lt;/a&gt; — customer-customized models trained on your domain-specific documents. 30+ element types (tables, charts, formulas, captions, footnotes, etc.), VLM-based image/chart understanding, complex table extraction (merged cells, nested tables), SLA-backed OCR for scanned documents, and native HWP/HWPX support. Supports PDF, DOCX, XLSX, PPTX, HWP, PNG, JPG. &lt;a class=&#34;link&#34; href=&#34;https://livedemo.sdk.hancom.com/en/dataloader?utm_source=github&amp;amp;utm_medium=readme&amp;amp;utm_campaign=opendataloader-pdf&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Live demo&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a class=&#34;link&#34; href=&#34;https://opendataloader.org/docs/hybrid-mode&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Hybrid Mode Guide&lt;/a&gt;&lt;/p&gt;
&lt;h2 id=&#34;output-formats&#34;&gt;Output Formats
&lt;/h2&gt;&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Format&lt;/th&gt;
          &lt;th&gt;Use Case&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;strong&gt;JSON&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;Structured data with bounding boxes, semantic types&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;strong&gt;Markdown&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;Clean text for LLM context, RAG chunks&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;strong&gt;HTML&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;Web display with styling&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;strong&gt;Annotated PDF&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;Visual debugging — see detected structures (&lt;a class=&#34;link&#34; href=&#34;https://opendataloader.org/demo/samples/01030000000000&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;sample&lt;/a&gt;)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;strong&gt;Text&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;Plain text extraction&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Combine formats: &lt;code&gt;format=&amp;quot;json,markdown&amp;quot;&lt;/code&gt;&lt;/p&gt;
&lt;h3 id=&#34;json-output-example&#34;&gt;JSON Output Example
&lt;/h3&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt; 1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 5
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 6
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 7
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 8
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 9
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;10
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;11
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;12
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-json&#34; data-lang=&#34;json&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;nt&#34;&gt;&amp;#34;type&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;heading&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;nt&#34;&gt;&amp;#34;id&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;42&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;nt&#34;&gt;&amp;#34;level&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;Title&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;nt&#34;&gt;&amp;#34;page number&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;nt&#34;&gt;&amp;#34;bounding box&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;mf&#34;&gt;72.0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mf&#34;&gt;700.0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mf&#34;&gt;540.0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mf&#34;&gt;730.0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;nt&#34;&gt;&amp;#34;heading level&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;nt&#34;&gt;&amp;#34;font&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;Helvetica-Bold&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;nt&#34;&gt;&amp;#34;font size&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mf&#34;&gt;24.0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;nt&#34;&gt;&amp;#34;text color&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;[0.0]&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;nt&#34;&gt;&amp;#34;content&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;Introduction&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Field&lt;/th&gt;
          &lt;th&gt;Description&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;type&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;Element type: heading, paragraph, table, list, image, caption, formula&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;id&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;Unique identifier for cross-referencing&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;page number&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;1-indexed page reference&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;bounding box&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;&lt;code&gt;[left, bottom, right, top]&lt;/code&gt; in PDF points (72pt = 1 inch)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;heading level&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;Heading depth (1+)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;content&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;Extracted text&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;a class=&#34;link&#34; href=&#34;https://opendataloader.org/docs/json-schema&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Full JSON Schema&lt;/a&gt;&lt;/p&gt;
&lt;h2 id=&#34;advanced-features&#34;&gt;Advanced Features
&lt;/h2&gt;&lt;h3 id=&#34;tagged-pdf-support&#34;&gt;Tagged PDF Support
&lt;/h3&gt;&lt;p&gt;When a PDF has structure tags, OpenDataLoader extracts the &lt;strong&gt;exact layout&lt;/strong&gt; the author intended — no guessing, no heuristics. Headings, lists, tables, and reading order are preserved from the source.&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;5
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;6
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;opendataloader_pdf&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;convert&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;input_path&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;file1.pdf&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;file2.pdf&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;folder/&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;output_dir&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;output/&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;use_struct_tree&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;           &lt;span class=&#34;c1&#34;&gt;# Use native PDF structure tags&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Most PDF parsers ignore structure tags entirely. &lt;a class=&#34;link&#34; href=&#34;https://opendataloader.org/docs/tagged-pdf&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Learn more&lt;/a&gt;&lt;/p&gt;
&lt;h3 id=&#34;ai-safety-prompt-injection-protection&#34;&gt;AI Safety: Prompt Injection Protection
&lt;/h3&gt;&lt;p&gt;PDFs can contain hidden prompt injection attacks. OpenDataLoader automatically filters:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Hidden text (transparent, zero-size fonts)&lt;/li&gt;
&lt;li&gt;Off-page content&lt;/li&gt;
&lt;li&gt;Suspicious invisible layers&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;To sanitize sensitive data (emails, URLs, phone numbers → placeholders), enable it explicitly:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;opendataloader-pdf file1.pdf file2.pdf folder/ --sanitize
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;&lt;a class=&#34;link&#34; href=&#34;https://opendataloader.org/docs/ai-safety&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;AI Safety Guide&lt;/a&gt;&lt;/p&gt;
&lt;h3 id=&#34;langchain-integration&#34;&gt;LangChain Integration
&lt;/h3&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;pip install -U langchain-opendataloader-pdf
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;5
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;6
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;7
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;kn&#34;&gt;from&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;langchain_opendataloader_pdf&lt;/span&gt; &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;OpenDataLoaderPDFLoader&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;loader&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;OpenDataLoaderPDFLoader&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;file_path&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;file1.pdf&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;file2.pdf&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;folder/&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;nb&#34;&gt;format&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;text&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;documents&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;loader&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;load&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;&lt;a class=&#34;link&#34; href=&#34;https://docs.langchain.com/oss/python/integrations/document_loaders/opendataloader_pdf&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;LangChain Docs&lt;/a&gt; | &lt;a class=&#34;link&#34; href=&#34;https://github.com/opendataloader-project/langchain-opendataloader-pdf&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;GitHub&lt;/a&gt; | &lt;a class=&#34;link&#34; href=&#34;https://pypi.org/project/langchain-opendataloader-pdf/&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;PyPI&lt;/a&gt;&lt;/p&gt;
&lt;h3 id=&#34;advanced-options&#34;&gt;Advanced Options
&lt;/h3&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;5
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;6
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;7
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;8
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;9
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;opendataloader_pdf&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;convert&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;input_path&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;file1.pdf&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;file2.pdf&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;folder/&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;output_dir&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;output/&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;nb&#34;&gt;format&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;json,markdown,pdf&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;image_output&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;embedded&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;        &lt;span class=&#34;c1&#34;&gt;# &amp;#34;off&amp;#34;, &amp;#34;embedded&amp;#34; (Base64), or &amp;#34;external&amp;#34; (default)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;image_format&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;jpeg&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;            &lt;span class=&#34;c1&#34;&gt;# &amp;#34;png&amp;#34; or &amp;#34;jpeg&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;use_struct_tree&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;           &lt;span class=&#34;c1&#34;&gt;# Use native PDF structure&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;&lt;a class=&#34;link&#34; href=&#34;https://opendataloader.org/docs/cli-options-reference&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Full CLI Options Reference&lt;/a&gt;&lt;/p&gt;
&lt;h2 id=&#34;pdf-accessibility--pdfua-conversion&#34;&gt;PDF Accessibility &amp;amp; PDF/UA Conversion
&lt;/h2&gt;&lt;p&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Millions of existing PDFs lack structure tags, failing accessibility regulations (EAA, ADA/Section 508, Korea Digital Inclusion Act). Manual remediation costs $50–200 per document and doesn&amp;rsquo;t scale.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;OpenDataLoader&amp;rsquo;s approach&lt;/strong&gt;: Built in collaboration with &lt;a class=&#34;link&#34; href=&#34;https://pdfa.org&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;PDF Association&lt;/a&gt; and &lt;a class=&#34;link&#34; href=&#34;https://duallab.com&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Dual Lab&lt;/a&gt; (developers of &lt;a class=&#34;link&#34; href=&#34;https://verapdf.org&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;veraPDF&lt;/a&gt;, the industry-reference open-source PDF/A and PDF/UA validator). Auto-tagging follows the &lt;a class=&#34;link&#34; href=&#34;https://pdfa.org/resource/well-tagged-pdf/&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Well-Tagged PDF specification&lt;/a&gt; and is validated programmatically using veraPDF — automated conformance checks against PDF accessibility standards, not manual review. No existing open-source tool generates Tagged PDFs end-to-end — most rely on proprietary SDKs for the tag-writing step. OpenDataLoader does it all under Apache 2.0. (&lt;a class=&#34;link&#34; href=&#34;https://opendataloader.org/docs/tagged-pdf-collaboration&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;collaboration details&lt;/a&gt;)&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Regulation&lt;/th&gt;
          &lt;th&gt;Deadline&lt;/th&gt;
          &lt;th&gt;Requirement&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;strong&gt;European Accessibility Act (EAA)&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;June 28, 2025&lt;/td&gt;
          &lt;td&gt;Accessible digital products across the EU&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;strong&gt;ADA &amp;amp; Section 508&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;In effect&lt;/td&gt;
          &lt;td&gt;U.S. federal agencies and public accommodations&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;strong&gt;Digital Inclusion Act&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;In effect&lt;/td&gt;
          &lt;td&gt;South Korea digital service accessibility&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;h3 id=&#34;standards--validation&#34;&gt;Standards &amp;amp; Validation
&lt;/h3&gt;&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Aspect&lt;/th&gt;
          &lt;th&gt;Detail&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;strong&gt;Specification&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;&lt;a class=&#34;link&#34; href=&#34;https://pdfa.org/resource/well-tagged-pdf/&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Well-Tagged PDF&lt;/a&gt; by PDF Association&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;strong&gt;Validation&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;&lt;a class=&#34;link&#34; href=&#34;https://verapdf.org&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;veraPDF&lt;/a&gt; — industry-reference open-source PDF/A &amp;amp; PDF/UA validator&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;strong&gt;Collaboration&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;PDF Association + &lt;a class=&#34;link&#34; href=&#34;https://duallab.com&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Dual Lab&lt;/a&gt; (veraPDF developers) co-develop tagging and validation&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;strong&gt;License&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;Auto-tagging → Tagged PDF: Apache 2.0 (free). PDF/UA export: Enterprise&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;h3 id=&#34;accessibility-pipeline&#34;&gt;Accessibility Pipeline
&lt;/h3&gt;&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Step&lt;/th&gt;
          &lt;th&gt;Feature&lt;/th&gt;
          &lt;th&gt;Status&lt;/th&gt;
          &lt;th&gt;Tier&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;1. &lt;strong&gt;Audit&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;Read existing PDF tags, detect untagged PDFs&lt;/td&gt;
          &lt;td&gt;Shipped&lt;/td&gt;
          &lt;td&gt;Free&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;2. &lt;strong&gt;Auto-tag → Tagged PDF&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;Generate structure tags for untagged PDFs&lt;/td&gt;
          &lt;td&gt;Coming Q2 2026&lt;/td&gt;
          &lt;td&gt;Free (Apache 2.0)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;3. &lt;strong&gt;Export PDF/UA&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;Convert to PDF/UA-1 or PDF/UA-2 compliant files&lt;/td&gt;
          &lt;td&gt;💼 Available&lt;/td&gt;
          &lt;td&gt;Enterprise&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;4. &lt;strong&gt;Visual editing&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;Accessibility studio — review and fix tags&lt;/td&gt;
          &lt;td&gt;💼 Available&lt;/td&gt;
          &lt;td&gt;Enterprise&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;💼 Enterprise features&lt;/strong&gt; are available on request. &lt;a class=&#34;link&#34; href=&#34;https://opendataloader.org/contact&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Contact us&lt;/a&gt; to get started.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3 id=&#34;auto-tagging-preview-coming-q2-2026&#34;&gt;Auto-Tagging Preview (Coming Q2 2026)
&lt;/h3&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;5
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;6
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# API shape preview — available Q2 2026&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;opendataloader_pdf&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;convert&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;input_path&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;file1.pdf&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;file2.pdf&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;folder/&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;output_dir&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;output/&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;auto_tag&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;                   &lt;span class=&#34;c1&#34;&gt;# Generate structure tags for untagged PDFs&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h3 id=&#34;end-to-end-compliance-workflow&#34;&gt;End-to-End Compliance Workflow
&lt;/h3&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt; 1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 5
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 6
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 7
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 8
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 9
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;10
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;11
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-gdscript3&#34; data-lang=&#34;gdscript3&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;Existing&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;PDFs&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;untagged&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;err&#34;&gt;│&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;err&#34;&gt;▼&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;err&#34;&gt;┌─────────────────┐&lt;/span&gt;    &lt;span class=&#34;err&#34;&gt;┌─────────────────┐&lt;/span&gt;    &lt;span class=&#34;err&#34;&gt;┌─────────────────┐&lt;/span&gt;    &lt;span class=&#34;err&#34;&gt;┌─────────────────┐&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;err&#34;&gt;│&lt;/span&gt;  &lt;span class=&#34;mf&#34;&gt;1.&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Audit&lt;/span&gt;       &lt;span class=&#34;err&#34;&gt;│───&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&#34;err&#34;&gt;│&lt;/span&gt;  &lt;span class=&#34;mf&#34;&gt;2.&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Auto&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Tag&lt;/span&gt;    &lt;span class=&#34;err&#34;&gt;│───&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&#34;err&#34;&gt;│&lt;/span&gt;  &lt;span class=&#34;mf&#34;&gt;3.&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Export&lt;/span&gt;       &lt;span class=&#34;err&#34;&gt;│───&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&#34;err&#34;&gt;│&lt;/span&gt;  &lt;span class=&#34;mf&#34;&gt;4.&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Studio&lt;/span&gt;       &lt;span class=&#34;err&#34;&gt;│&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;err&#34;&gt;│&lt;/span&gt;  &lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;check&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;tags&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;   &lt;span class=&#34;err&#34;&gt;│&lt;/span&gt;    &lt;span class=&#34;err&#34;&gt;│&lt;/span&gt;  &lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;err&#34;&gt;→&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Tagged&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;PDF&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;err&#34;&gt;│&lt;/span&gt;    &lt;span class=&#34;err&#34;&gt;│&lt;/span&gt;  &lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;PDF&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;/&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;UA&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;        &lt;span class=&#34;err&#34;&gt;│&lt;/span&gt;    &lt;span class=&#34;err&#34;&gt;│&lt;/span&gt;  &lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;visual&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;editor&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;err&#34;&gt;│&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;err&#34;&gt;└─────────────────┘&lt;/span&gt;    &lt;span class=&#34;err&#34;&gt;└─────────────────┘&lt;/span&gt;    &lt;span class=&#34;err&#34;&gt;└─────────────────┘&lt;/span&gt;    &lt;span class=&#34;err&#34;&gt;└─────────────────┘&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;        &lt;span class=&#34;err&#34;&gt;│&lt;/span&gt;                      &lt;span class=&#34;err&#34;&gt;│&lt;/span&gt;                      &lt;span class=&#34;err&#34;&gt;│&lt;/span&gt;                      &lt;span class=&#34;err&#34;&gt;│&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;        &lt;span class=&#34;err&#34;&gt;▼&lt;/span&gt;                      &lt;span class=&#34;err&#34;&gt;▼&lt;/span&gt;                      &lt;span class=&#34;err&#34;&gt;▼&lt;/span&gt;                      &lt;span class=&#34;err&#34;&gt;▼&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;n&#34;&gt;use_struct_tree&lt;/span&gt;         &lt;span class=&#34;n&#34;&gt;auto_tag&lt;/span&gt;              &lt;span class=&#34;n&#34;&gt;PDF&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;/&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;UA&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;export&lt;/span&gt;       &lt;span class=&#34;n&#34;&gt;Accessibility&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Studio&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Available&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;now&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;    &lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Q2&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;2026&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Apache&lt;/span&gt; &lt;span class=&#34;mf&#34;&gt;2.0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;    &lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Enterprise&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;          &lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Enterprise&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;&lt;a class=&#34;link&#34; href=&#34;https://opendataloader.org/docs/accessibility-compliance&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;PDF Accessibility Guide&lt;/a&gt;&lt;/p&gt;
&lt;h2 id=&#34;roadmap&#34;&gt;Roadmap
&lt;/h2&gt;&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Feature&lt;/th&gt;
          &lt;th&gt;Timeline&lt;/th&gt;
          &lt;th&gt;Tier&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;strong&gt;Auto-tagging → Tagged PDF&lt;/strong&gt; — Generate Tagged PDFs from untagged PDFs&lt;/td&gt;
          &lt;td&gt;Q2 2026&lt;/td&gt;
          &lt;td&gt;Free&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;strong&gt;&lt;a class=&#34;link&#34; href=&#34;https://sdk.hancom.com/en/services/1?utm_source=github&amp;amp;utm_medium=readme&amp;amp;utm_campaign=opendataloader-pdf&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Hancom Data Loader&lt;/a&gt;&lt;/strong&gt; — Enterprise AI document analysis, customer-customized models, VLM-based chart/image understanding, production-grade OCR&lt;/td&gt;
          &lt;td&gt;Q2-Q3 2026&lt;/td&gt;
          &lt;td&gt;Planned&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;strong&gt;Structure validation&lt;/strong&gt; — Verify PDF tag trees&lt;/td&gt;
          &lt;td&gt;Q2 2026&lt;/td&gt;
          &lt;td&gt;Planned&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;a class=&#34;link&#34; href=&#34;https://opendataloader.org/docs/upcoming-roadmap&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Full Roadmap&lt;/a&gt;&lt;/p&gt;
&lt;h2 id=&#34;frequently-asked-questions&#34;&gt;Frequently Asked Questions
&lt;/h2&gt;&lt;h3 id=&#34;what-is-the-best-pdf-parser-for-rag&#34;&gt;What is the best PDF parser for RAG?
&lt;/h3&gt;&lt;p&gt;For RAG pipelines, you need a parser that preserves document structure, maintains correct reading order, and provides element coordinates for citations. OpenDataLoader is designed specifically for this — it outputs structured JSON with bounding boxes, handles multi-column layouts with XY-Cut++, and runs locally without GPU. In hybrid mode, it ranks #1 overall (0.907) in benchmarks.&lt;/p&gt;
&lt;h3 id=&#34;what-is-the-best-open-source-pdf-parser&#34;&gt;What is the best open-source PDF parser?
&lt;/h3&gt;&lt;p&gt;OpenDataLoader PDF is the only open-source parser that combines: rule-based deterministic extraction (no GPU), bounding boxes for every element, XY-Cut++ reading order, built-in AI safety filters, native Tagged PDF support, and hybrid AI mode for complex documents. It ranks #1 in overall accuracy (0.907) while running locally on CPU.&lt;/p&gt;
&lt;h3 id=&#34;how-do-i-extract-tables-from-pdf-for-llm&#34;&gt;How do I extract tables from PDF for LLM?
&lt;/h3&gt;&lt;p&gt;OpenDataLoader detects tables using border analysis and text clustering, preserving row/column structure. For complex tables, enable hybrid mode for +90% accuracy improvement (0.489 to 0.928 TEDS score):&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;5
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;6
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;7
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;opendataloader_pdf&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;convert&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;input_path&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;file1.pdf&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;file2.pdf&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;folder/&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;output_dir&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;output/&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;nb&#34;&gt;format&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;json&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;hybrid&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;docling-fast&amp;#34;&lt;/span&gt;           &lt;span class=&#34;c1&#34;&gt;# For complex tables&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h3 id=&#34;how-does-it-compare-to-docling-marker-or-pymupdf4llm&#34;&gt;How does it compare to docling, marker, or pymupdf4llm?
&lt;/h3&gt;&lt;p&gt;OpenDataLoader [hybrid] ranks #1 overall (0.907) across reading order, table, and heading accuracy. Key differences: docling (0.882) is strong but lacks bounding boxes and AI safety filters. marker (0.861) requires GPU and is 1000x slower (53.932s/page). pymupdf4llm (0.732) is fast but has poor table (0.401) and heading (0.412) accuracy. OpenDataLoader is the only parser that combines deterministic local extraction, bounding boxes for every element, and built-in prompt injection protection. See &lt;a class=&#34;link&#34; href=&#34;https://github.com/opendataloader-project/opendataloader-bench&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;full benchmark&lt;/a&gt;.&lt;/p&gt;
&lt;h3 id=&#34;can-i-use-this-without-sending-data-to-the-cloud&#34;&gt;Can I use this without sending data to the cloud?
&lt;/h3&gt;&lt;p&gt;Yes. OpenDataLoader runs 100% locally. No API calls, no data transmission — your documents never leave your environment. The hybrid mode backend also runs locally on your machine. Ideal for legal, healthcare, and financial documents.&lt;/p&gt;
&lt;h3 id=&#34;does-it-support-ocr-for-scanned-pdfs&#34;&gt;Does it support OCR for scanned PDFs?
&lt;/h3&gt;&lt;p&gt;Yes, via hybrid mode. Install with &lt;code&gt;pip install &amp;quot;opendataloader-pdf[hybrid]&amp;quot;&lt;/code&gt;, start the backend with &lt;code&gt;--force-ocr&lt;/code&gt;, then process as usual. Supports multiple languages including Korean, Japanese, Chinese, Arabic, and more via &lt;code&gt;--ocr-lang&lt;/code&gt;.&lt;/p&gt;
&lt;h3 id=&#34;does-it-work-with-korean-japanese-or-chinese-documents&#34;&gt;Does it work with Korean, Japanese, or Chinese documents?
&lt;/h3&gt;&lt;p&gt;Yes. For digital PDFs, text extraction works out of the box. For scanned PDFs, use hybrid mode with &lt;code&gt;--force-ocr --ocr-lang &amp;quot;ko,en&amp;quot;&lt;/code&gt; (or &lt;code&gt;ja&lt;/code&gt;, &lt;code&gt;ch_sim&lt;/code&gt;, &lt;code&gt;ch_tra&lt;/code&gt;). Coming soon: &lt;a class=&#34;link&#34; href=&#34;https://sdk.hancom.com/en/services/1?utm_source=github&amp;amp;utm_medium=readme&amp;amp;utm_campaign=opendataloader-pdf&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Hancom Data Loader&lt;/a&gt; integration — enterprise-grade AI document analysis with built-in production-grade OCR and customer-customized models optimized for your specific document types and workflows.&lt;/p&gt;
&lt;h3 id=&#34;how-fast-is-it&#34;&gt;How fast is it?
&lt;/h3&gt;&lt;p&gt;Local mode processes 60+ pages per second on CPU (0.02s/page). Hybrid mode processes 2+ pages per second (0.46s/page) with significantly higher accuracy for complex documents. No GPU required. Benchmarked on Apple M4. &lt;a class=&#34;link&#34; href=&#34;https://github.com/opendataloader-project/opendataloader-bench&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Full benchmark details&lt;/a&gt;. With multi-process batch processing, throughput exceeds 100 pages per second on 8+ core machines.&lt;/p&gt;
&lt;h3 id=&#34;does-it-handle-multi-column-layouts&#34;&gt;Does it handle multi-column layouts?
&lt;/h3&gt;&lt;p&gt;Yes. OpenDataLoader uses XY-Cut++ reading order analysis to correctly sequence text across multi-column pages, sidebars, and mixed layouts. This works in both local and hybrid modes without any configuration.&lt;/p&gt;
&lt;h3 id=&#34;what-is-hybrid-mode&#34;&gt;What is hybrid mode?
&lt;/h3&gt;&lt;p&gt;Hybrid mode combines fast local Java processing with an AI backend. Simple pages are processed locally (0.02s/page); complex pages (tables, scanned content, formulas, charts) are automatically routed to the AI backend for higher accuracy. The backend runs locally on your machine — no cloud required. See &lt;a class=&#34;link&#34; href=&#34;#which-mode-should-i-use&#34; &gt;Which Mode Should I Use?&lt;/a&gt; and &lt;a class=&#34;link&#34; href=&#34;https://opendataloader.org/docs/hybrid-mode&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Hybrid Mode Guide&lt;/a&gt;.&lt;/p&gt;
&lt;h3 id=&#34;does-it-work-with-langchain&#34;&gt;Does it work with LangChain?
&lt;/h3&gt;&lt;p&gt;Yes. Install &lt;code&gt;langchain-opendataloader-pdf&lt;/code&gt; for an official LangChain document loader integration. See &lt;a class=&#34;link&#34; href=&#34;https://docs.langchain.com/oss/python/integrations/document_loaders/opendataloader_pdf&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;LangChain docs&lt;/a&gt;.&lt;/p&gt;
&lt;h3 id=&#34;how-do-i-chunk-pdfs-for-rag&#34;&gt;How do I chunk PDFs for RAG?
&lt;/h3&gt;&lt;p&gt;OpenDataLoader outputs structured Markdown with headings, tables, and lists preserved — ideal input for semantic chunking. Each element in JSON output includes &lt;code&gt;type&lt;/code&gt;, &lt;code&gt;heading level&lt;/code&gt;, and &lt;code&gt;page number&lt;/code&gt;, so you can split by section or page boundary. For most RAG pipelines: parse with &lt;code&gt;format=&amp;quot;markdown&amp;quot;&lt;/code&gt; for text chunks, or &lt;code&gt;format=&amp;quot;json&amp;quot;&lt;/code&gt; when you need element-level control. Pair with LangChain&amp;rsquo;s &lt;code&gt;RecursiveCharacterTextSplitter&lt;/code&gt; or your own heading-based splitter for best results.&lt;/p&gt;
&lt;h3 id=&#34;how-do-i-cite-pdf-sources-in-rag-answers&#34;&gt;How do I cite PDF sources in RAG answers?
&lt;/h3&gt;&lt;p&gt;Every element in JSON output includes a &lt;code&gt;bounding box&lt;/code&gt; (&lt;code&gt;[left, bottom, right, top]&lt;/code&gt; in PDF points) and &lt;code&gt;page number&lt;/code&gt;. When your RAG pipeline returns an answer, map the source chunk back to its bounding box to highlight the exact location in the original PDF. This enables &amp;ldquo;click to source&amp;rdquo; UX — users see which paragraph, table, or figure the answer came from. No other open-source parser provides bounding boxes for every element by default.&lt;/p&gt;
&lt;h3 id=&#34;how-do-i-convert-pdf-to-markdown-for-llm&#34;&gt;How do I convert PDF to Markdown for LLM?
&lt;/h3&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;5
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;6
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;7
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;8
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;opendataloader_pdf&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;opendataloader_pdf&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;convert&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;input_path&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;file1.pdf&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;file2.pdf&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;folder/&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;output_dir&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;output/&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;nb&#34;&gt;format&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;markdown&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;OpenDataLoader preserves heading hierarchy, table structure, and reading order in the Markdown output. For complex documents with borderless tables or scanned pages, use hybrid mode (&lt;code&gt;hybrid=&amp;quot;docling-fast&amp;quot;&lt;/code&gt;) for higher accuracy. The output is clean enough to feed directly into LLM context windows or RAG chunking pipelines.&lt;/p&gt;
&lt;h3 id=&#34;is-there-an-automated-pdf-accessibility-remediation-tool&#34;&gt;Is there an automated PDF accessibility remediation tool?
&lt;/h3&gt;&lt;p&gt;Yes. OpenDataLoader is the first open-source tool that automates PDF accessibility end-to-end. Built in collaboration with &lt;a class=&#34;link&#34; href=&#34;https://pdfa.org&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;PDF Association&lt;/a&gt; and &lt;a class=&#34;link&#34; href=&#34;https://duallab.com&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Dual Lab&lt;/a&gt; (veraPDF developers), auto-tagging follows the Well-Tagged PDF specification and is validated programmatically using veraPDF. The layout analysis engine detects document structure (headings, tables, lists, reading order) and generates accessibility tags automatically. Auto-tagging (Q2 2026) converts untagged PDFs into Tagged PDFs under Apache 2.0 — no proprietary SDK dependency. For organizations needing full PDF/UA compliance, enterprise add-ons provide PDF/UA export and a visual tag editor. This replaces manual remediation workflows that typically cost $50–200+ per document.&lt;/p&gt;
&lt;h3 id=&#34;is-this-really-the-first-open-source-pdf-auto-tagging-tool&#34;&gt;Is this really the first open-source PDF auto-tagging tool?
&lt;/h3&gt;&lt;p&gt;Yes. Existing tools either depend on proprietary SDKs for writing structure tags, only output non-PDF formats (e.g., Docling outputs Markdown/JSON but cannot produce Tagged PDFs), or require manual intervention. OpenDataLoader is the first to do layout analysis → tag generation → Tagged PDF output entirely under an open-source license (Apache 2.0), with no proprietary dependency. Auto-tagging follows the PDF Association&amp;rsquo;s Well-Tagged PDF specification and is validated using veraPDF, the industry-reference open-source PDF/A and PDF/UA validator.&lt;/p&gt;
&lt;h3 id=&#34;how-do-i-convert-existing-pdfs-to-pdfua&#34;&gt;How do I convert existing PDFs to PDF/UA?
&lt;/h3&gt;&lt;p&gt;OpenDataLoader provides an end-to-end pipeline: audit existing PDFs for tags (&lt;code&gt;use_struct_tree=True&lt;/code&gt;), auto-tag untagged PDFs into Tagged PDFs (Q2 2026, free under Apache 2.0), and export as PDF/UA-1 or PDF/UA-2 (enterprise add-on). Auto-tagging follows the PDF Association&amp;rsquo;s Well-Tagged PDF specification and is validated using veraPDF. Auto-tagging generates the Tagged PDF; PDF/UA export is the final step. &lt;a class=&#34;link&#34; href=&#34;https://opendataloader.org/contact&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Contact us&lt;/a&gt; for enterprise integration.&lt;/p&gt;
&lt;h3 id=&#34;how-do-i-make-my-pdfs-accessible-for-eaa-compliance&#34;&gt;How do I make my PDFs accessible for EAA compliance?
&lt;/h3&gt;&lt;p&gt;The European Accessibility Act requires accessible digital products by June 28, 2025. OpenDataLoader supports the full remediation workflow: audit → auto-tag → Tagged PDF → PDF/UA export. Auto-tagging follows the PDF Association&amp;rsquo;s Well-Tagged PDF specification and is validated using veraPDF, ensuring standards-compliant output. Auto-tagging to Tagged PDF will be open-sourced under Apache 2.0 (Q2 2026). PDF/UA export and accessibility studio are enterprise add-ons. See our &lt;a class=&#34;link&#34; href=&#34;https://opendataloader.org/docs/accessibility-compliance&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Accessibility Guide&lt;/a&gt;.&lt;/p&gt;
&lt;h3 id=&#34;is-opendataloader-pdf-free&#34;&gt;Is OpenDataLoader PDF free?
&lt;/h3&gt;&lt;p&gt;The core library is &lt;strong&gt;open-source under Apache 2.0&lt;/strong&gt; — free for commercial use. This includes all extraction features (text, tables, images, OCR, formulas, charts via hybrid mode), AI safety filters, Tagged PDF support, and auto-tagging to Tagged PDF (Q2 2026). We are committed to keeping the core accessibility pipeline (layout analysis → auto-tagging → Tagged PDF) free and open-source. Enterprise add-ons (PDF/UA export, accessibility studio) are available for organizations needing end-to-end regulatory compliance.&lt;/p&gt;
&lt;h3 id=&#34;why-did-the-license-change-from-mpl-20-to-apache-20&#34;&gt;Why did the license change from MPL 2.0 to Apache 2.0?
&lt;/h3&gt;&lt;p&gt;MPL 2.0 requires file-level copyleft, which often triggers legal review before enterprise adoption. Apache 2.0 is fully permissive — no copyleft obligations, easier to integrate into commercial projects. If you are using a pre-2.0 version, it remains under MPL 2.0 and you can continue using it. Upgrading to 2.0+ means your project follows Apache 2.0 terms, which are strictly more permissive — no additional obligations, no action needed on your side.&lt;/p&gt;
&lt;h2 id=&#34;documentation&#34;&gt;Documentation
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://opendataloader.org/docs/quick-start-python&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Quick Start (Python)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://opendataloader.org/docs/quick-start-nodejs&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Quick Start (Node.js)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://opendataloader.org/docs/quick-start-java&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Quick Start (Java)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://opendataloader.org/docs/json-schema&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;JSON Schema Reference&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://opendataloader.org/docs/cli-options-reference&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;CLI Options&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://opendataloader.org/docs/hybrid-mode&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Hybrid Mode Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://opendataloader.org/docs/tagged-pdf&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Tagged PDF Support&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://opendataloader.org/docs/ai-safety&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;AI Safety Features&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://opendataloader.org/docs/accessibility-compliance&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;PDF Accessibility&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;contributing&#34;&gt;Contributing
&lt;/h2&gt;&lt;p&gt;We welcome contributions! See &lt;a class=&#34;link&#34; href=&#34;CONTRIBUTING.md&#34; &gt;CONTRIBUTING.md&lt;/a&gt; for guidelines.&lt;/p&gt;
&lt;h2 id=&#34;license&#34;&gt;License
&lt;/h2&gt;&lt;p&gt;&lt;a class=&#34;link&#34; href=&#34;LICENSE&#34; &gt;Apache License 2.0&lt;/a&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; Versions prior to 2.0 are licensed under the &lt;a class=&#34;link&#34; href=&#34;https://www.mozilla.org/MPL/2.0/&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Mozilla Public License 2.0&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;hr&gt;
&lt;p&gt;&lt;strong&gt;Found this useful?&lt;/strong&gt; Give us a star to help others discover OpenDataLoader.&lt;/p&gt;
</description>
        </item>
        
    </channel>
</rss>
