Announcing Kreuzberg v2.0: A Lightweight, Modern Python Text Extraction library

#llm #rag #programming #ai

🔍 What’s Kreuzberg?

Kreuzberg is a Python library that provides a unified async/sync interface for extracting text from PDFs, images, Office documents, and more.

Async First: Optimized async using anyio and worker processes.
Minimal Dependencies: Much smaller footprint compared to alternatives.
Serverless-and-Docker Ready: Perfect for serverless functions and containerized deployments.
Local Processing: All processing is done locally, with no API calls or cloud services.
Modern Python: Built for Python 3.9+ with rigorous typing and extensive testing.
Versatile: Supports various formats, including PDFs, spreadsheets, Markdown, LaTeX, and more.

Kreuzberg v2.0 brings significant enhancements to performance, usability, and feature set. Here’s what’s new:

Sync APIs: Kreuzberg supports synchronous extraction methods alongside async workflows.
Batch Processing: Efficiently process multiple files or byte streams in parallel.
Smart PDF Handling: Automatically fall back to OCR when direct text extraction fails.
Metadata Extraction: Retrieve metadata like document titles or creators using Pandoc.
Excel Multi-Sheet Support: Handle even the most complex spreadsheets.
Enhanced Performance: Worker processes for faster, resource-efficient extraction.

Check out the v2.0 changelog for more details.

Kreuzberg is ideal for developers building:

If you’re looking for a lightweight, efficient solution for text extraction, Kreuzberg is a great choice.

Here’s how Kreuzberg stacks up against alternatives:

Python OSS Libraries
- Unstructured.io: Feature-rich but heavy, making it unsuitable for serverless or low-resource environments.
- Docling: Another strong alternative but larger and heavier—better suited for high-volume, GPU-based workloads.
Non-Python OSS Libraries
- Apache Tika: Requires a Java server running as a sidecar, with Python client libraries available.
- Grobid: Excellent for structured research text extraction but comes with a ~20GB Docker image.
Commercial APIs
Paid solutions like Azure Document Intelligence or AWS Textract offer best-in-class OCR and layout extraction. However, they come with pricing concerns and cloud dependencies, unlike Kreuzberg.

If Kreuzberg sounds like the library you’ve been looking for, check it out on GitHub.

Please star the repo ⭐—it helps others discover the project and motivates me to keep improving it!