đ Whatâs Kreuzberg?
Kreuzberg is a Python library that provides a unified async/sync interface for extracting text from PDFs, images, Office documents, and more.
Key Features
-
Async First: Optimized async using
anyio
and worker processes. - Minimal Dependencies: Much smaller footprint compared to alternatives.
- Serverless-and-Docker Ready: Perfect for serverless functions and containerized deployments.
- Local Processing: All processing is done locally, with no API calls or cloud services.
- Modern Python: Built for Python 3.9+ with rigorous typing and extensive testing.
- Versatile: Supports various formats, including PDFs, spreadsheets, Markdown, LaTeX, and more.
đ Whatâs New in Version 2.0?
Kreuzberg v2.0 brings significant enhancements to performance, usability, and feature set. Hereâs whatâs new:
- Sync APIs: Kreuzberg supports synchronous extraction methods alongside async workflows.
- Batch Processing: Efficiently process multiple files or byte streams in parallel.
- Smart PDF Handling: Automatically fall back to OCR when direct text extraction fails.
- Metadata Extraction: Retrieve metadata like document titles or creators using Pandoc.
- Excel Multi-Sheet Support: Handle even the most complex spreadsheets.
- Enhanced Performance: Worker processes for faster, resource-efficient extraction.
Check out the v2.0 changelog for more details.
đŻ Whoâs It For?
Kreuzberg is ideal for developers building:
- Retrieval-Augmented Generation (RAG) systems
- LLM-powered applications
- Document indexing, analysis, and automation tools
If youâre looking for a lightweight, efficient solution for text extraction, Kreuzberg is a great choice.
âď¸ How Kreuzberg Compares
Hereâs how Kreuzberg stacks up against alternatives:
-
Python OSS Libraries
- Unstructured.io: Feature-rich but heavy, making it unsuitable for serverless or low-resource environments.
- Docling: Another strong alternative but larger and heavierâbetter suited for high-volume, GPU-based workloads.
-
Non-Python OSS Libraries
- Apache Tika: Requires a Java server running as a sidecar, with Python client libraries available.
- Grobid: Excellent for structured research text extraction but comes with a ~20GB Docker image.
Commercial APIs
Paid solutions like Azure Document Intelligence or AWS Textract offer best-in-class OCR and layout extraction. However, they come with pricing concerns and cloud dependencies, unlike Kreuzberg.
Staring â is Caring
If Kreuzberg sounds like the library youâve been looking for, check it out on GitHub.
Please star the repo ââit helps others discover the project and motivates me to keep improving it!
Top comments (0)