document-processing

Here are 708 public repositories matching this topic...

Zipstack / unstract

LLM-Driven Extraction of Unstructured Data — Built for API Deployments & ETL Pipeline Workflows

data-extraction document-processing etl-pipelines unstructured-data-extraction api-deployments open-source-data-pipeline

Updated Apr 2, 2026
Python

ucbepic / docetl

Star

A system for agentic LLM-powered data processing and ETL

python workflow data etl semantic-data elt data-pipelines agents document-analysis document-processing unstructured-data unstructured-data-analysis llm

Updated Mar 27, 2026
Python

run-llama / liteparse

Star

A fast, helpful, and open-source document parser

pdf ocr text-extraction ocr-recognition pdf-parser document-processing document-ocr

Updated Apr 1, 2026
TypeScript

enoch3712 / ExtractThinker

Star

ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.

python nlp pdf machine-learning ocr ai openai pdf-to-text document-processing document-image-analysis document-intelligence llm document-parsing langchain

Updated Aug 27, 2025
Python

OpenOCR: An Open-Source Toolkit for General-OCR Research and Applications, integrates a unified training and evaluation benchmark, commercial-grade OCR and Document Parsing systems, and faithful reproductions of the core implementations from a wide range of academic papers.

ocr document-analysis document-processing scene-text-recognition scene-text-detection ocr-pytorch chineseocr document-parsing

Updated Mar 2, 2026
Python

ocrbase-hq / ocrbase

Star

📄 PDF ->.MD/.JSON API & SDK for PaddleOCR-VL with structured data extraction. Self-hostable.

markdown pdf json typescript ocr ai websocket self-hosted drizzle bun document-processing react-hooks paddleocr elysia

Updated Apr 2, 2026
TypeScript

eclaire-labs / eclaire

Star

Local-first, open-source AI assistant for your data. Unify tasks, notes, docs, photos, and bookmarks. Private, self-hosted, and extensible via APIs.

open-source automation privacy ocr ai rest-api bookmarks self-hosted data-extraction note-taking web-archiving bookmark-manager task-management document-processing on-device-ai local-first personal-knowledge-management ai-assistant llm

Updated Mar 23, 2026
TypeScript

SylphxAI / pdf-reader-mcp

Star

📄 Production-ready MCP server for PDF processing - 5-10x faster with parallel processing and 94%+ test coverage

nodejs pdf performance typescript mcp stdio pdf-reader parallel-processing pdf-parser document-processing pdf-parse ai-tools ai-agent model-context-protocol model-content-protocol llm-tool

Updated Mar 30, 2026
TypeScript

yfedoseev / pdf_oxide

Star

The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown conversion, PDF creation & editing. 0.8ms mean, 5× faster than industry leaders, 100% pass rate on 3,830 PDFs. MIT/Apache-2.0.

python markdown rust fast pdf text-extraction data-extraction pdf-generation pdf-to-text pdf-library pdf-parser document-processing rag pyo3 pdf-editor image-extraction llm pdf-to-markdown

Updated Apr 2, 2026
Rust

ShapeCrawler / ShapeCrawler

Star

PowerPoint .NET library for reading, modifying, and generating PPTX presentations without Microsoft Office

csharp dotnet presentation slides powerpoint openxml pptx document-processing office-open-xml

Updated Mar 31, 2026
C#

dhlab-epfl / dhSegment

Star

Generic framework for historical document processing

tensorflow python3 segmentation historical-data document-processing

Updated Jul 9, 2021
Python

vorojar / Folio-OCR

Star

Open-source batch OCR workbench — a free, local alternative to ABBYY FineReader. Powered by Ollama + GLM-OCR + PP-DocLayoutV3, ~0.5s/page on RTX 4090. Three-panel editor, layout-aware, PDF/image batch processing, Markdown/Word export. 批量OCR工作台，纯本地运行，免费平替ABBYY，适合书籍文档数字化。

privacy ocr offline book-digitization document-processing document-ocr layout-detection markdown-export pdf-ocr local-ai ollama batch-ocr glm-ocr abbyy-alternative

Updated Apr 2, 2026
Python

ucbepic / TWIX

Star

TWIX is an open-source data extraction tool that reconstructs structured data from documents at scale, accurately and at low cost, by inferring the shared underlying visual template across documents

document-processing document-data-extraction

Updated Nov 26, 2025
Python

awslabs / project-lakechain

Star

⚡ Cloud-native, AI-powered, document processing pipelines on AWS.

aws machine-learning natural-language-processing computer-vision serverless hacktoberfest document-processing aws-cdk generative-ai retrieval-augmented-generation

Updated Jan 22, 2026
TypeScript

bzsanti / oxidizePdf

Star

a PDF library for rust

rust pdf ocr encryption text-extraction data-extraction invoice crates-io rust-library pdf-generation pdf-reader pdf-manipulation pdfa pdf-library table-extraction pdf-parser digital-signatures document-processing

Updated Apr 2, 2026
Rust

formkiq / formkiq-core

Star

Open-source document management platform leveraging AWS managed services. RESTful API for document storage, processing, full-text search, and metadata management. Multi-tenant serverless architecture with auto-scaling... deployed entirely in your AWS account.

aws ocr serverless headless cloud-storage document-database amazon-web-services dms document-management optical-character-recognition document-processing document-management-system document-api document-apis intelligent-document-processing document-layer

Updated Apr 1, 2026
Java

Tele-AI / doc-ops-mcp

Star

MCP server for seamless document format conversion and processing

document-conversion file-converter pdf-conversion markdown-converter watermark document-processing document-converter docx-to-pdf pdf-processing docx2pdf document-rewriting

Updated Mar 30, 2026
TypeScript

watat83 / document-chat-system

Star

Open-source document chat platform with semantic search, RAG (Retrieval Augmented Generation), and multi-provider AI support (OpenRouter, OpenAI, ImageRouter).

chatbot embeddings webapp document-management document-processing rag vector-search llms

Updated Jan 31, 2026
TypeScript

MantisAI / sieves

Sponsor

Star

Plug-and-play document AI with zero-shot models.

nlp machine-learning zero-shot-learning document-processing few-shot-learning llm generative-ai structured-generation

Updated Feb 16, 2026
Python

docling-project / docling-graph

Star

Transform unstructured documents into validated, rich and queryable knowledge graphs.

ai convert knowledge-graph document-processing docling

Updated Apr 1, 2026
Python

Improve this page

Add a description, image, and links to the document-processing topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the document-processing topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

document-processing

Here are 708 public repositories matching this topic...

Zipstack / unstract

ucbepic / docetl

run-llama / liteparse

enoch3712 / ExtractThinker

Topdu / OpenOCR

ocrbase-hq / ocrbase

eclaire-labs / eclaire

SylphxAI / pdf-reader-mcp

yfedoseev / pdf_oxide

ShapeCrawler / ShapeCrawler

dhlab-epfl / dhSegment

vorojar / Folio-OCR

ucbepic / TWIX

awslabs / project-lakechain

bzsanti / oxidizePdf

formkiq / formkiq-core

Tele-AI / doc-ops-mcp

watat83 / document-chat-system

MantisAI / sieves

docling-project / docling-graph

Improve this page

Add this topic to your repo