LLM-Driven Extraction of Unstructured Data — Built for API Deployments & ETL Pipeline Workflows
-
Updated
Apr 2, 2026 - Python
LLM-Driven Extraction of Unstructured Data — Built for API Deployments & ETL Pipeline Workflows
A system for agentic LLM-powered data processing and ETL
A fast, helpful, and open-source document parser
ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.
OpenOCR: An Open-Source Toolkit for General-OCR Research and Applications, integrates a unified training and evaluation benchmark, commercial-grade OCR and Document Parsing systems, and faithful reproductions of the core implementations from a wide range of academic papers.
📄 PDF ->.MD/.JSON API & SDK for PaddleOCR-VL with structured data extraction. Self-hostable.
Local-first, open-source AI assistant for your data. Unify tasks, notes, docs, photos, and bookmarks. Private, self-hosted, and extensible via APIs.
📄 Production-ready MCP server for PDF processing - 5-10x faster with parallel processing and 94%+ test coverage
The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown conversion, PDF creation & editing. 0.8ms mean, 5× faster than industry leaders, 100% pass rate on 3,830 PDFs. MIT/Apache-2.0.
PowerPoint .NET library for reading, modifying, and generating PPTX presentations without Microsoft Office
Generic framework for historical document processing
Open-source batch OCR workbench — a free, local alternative to ABBYY FineReader. Powered by Ollama + GLM-OCR + PP-DocLayoutV3, ~0.5s/page on RTX 4090. Three-panel editor, layout-aware, PDF/image batch processing, Markdown/Word export. 批量OCR工作台,纯本地运行,免费平替ABBYY,适合书籍文档数字化。
TWIX is an open-source data extraction tool that reconstructs structured data from documents at scale, accurately and at low cost, by inferring the shared underlying visual template across documents
⚡ Cloud-native, AI-powered, document processing pipelines on AWS.
a PDF library for rust
Open-source document management platform leveraging AWS managed services. RESTful API for document storage, processing, full-text search, and metadata management. Multi-tenant serverless architecture with auto-scaling... deployed entirely in your AWS account.
MCP server for seamless document format conversion and processing
Open-source document chat platform with semantic search, RAG (Retrieval Augmented Generation), and multi-provider AI support (OpenRouter, OpenAI, ImageRouter).
Plug-and-play document AI with zero-shot models.
Transform unstructured documents into validated, rich and queryable knowledge graphs.
Add a description, image, and links to the document-processing topic page so that developers can more easily learn about it.
To associate your repository with the document-processing topic, visit your repo's landing page and select "manage topics."