Forem

Cover image for Announcing Kreuzberg v2.0: A Lightweight, Modern Python Text Extraction library
Na'aman Hirschfeld
Na'aman Hirschfeld

Posted on

Announcing Kreuzberg v2.0: A Lightweight, Modern Python Text Extraction library

🔍 What’s Kreuzberg?

Kreuzberg is a Python library that provides a unified async/sync interface for extracting text from PDFs, images, Office documents, and more.

Key Features

  • Async First: Optimized async using anyio and worker processes.
  • Minimal Dependencies: Much smaller footprint compared to alternatives.
  • Serverless-and-Docker Ready: Perfect for serverless functions and containerized deployments.
  • Local Processing: All processing is done locally, with no API calls or cloud services.
  • Modern Python: Built for Python 3.9+ with rigorous typing and extensive testing.
  • Versatile: Supports various formats, including PDFs, spreadsheets, Markdown, LaTeX, and more.

🚀 What’s New in Version 2.0?

Kreuzberg v2.0 brings significant enhancements to performance, usability, and feature set. Here’s what’s new:

  • Sync APIs: Kreuzberg supports synchronous extraction methods alongside async workflows.
  • Batch Processing: Efficiently process multiple files or byte streams in parallel.
  • Smart PDF Handling: Automatically fall back to OCR when direct text extraction fails.
  • Metadata Extraction: Retrieve metadata like document titles or creators using Pandoc.
  • Excel Multi-Sheet Support: Handle even the most complex spreadsheets.
  • Enhanced Performance: Worker processes for faster, resource-efficient extraction.

Check out the v2.0 changelog for more details.

🎯 Who’s It For?

Kreuzberg is ideal for developers building:

  • Retrieval-Augmented Generation (RAG) systems
  • LLM-powered applications
  • Document indexing, analysis, and automation tools

If you’re looking for a lightweight, efficient solution for text extraction, Kreuzberg is a great choice.

⚖️ How Kreuzberg Compares

Here’s how Kreuzberg stacks up against alternatives:

  1. Python OSS Libraries

    • Unstructured.io: Feature-rich but heavy, making it unsuitable for serverless or low-resource environments.
    • Docling: Another strong alternative but larger and heavier—better suited for high-volume, GPU-based workloads.
  2. Non-Python OSS Libraries

    • Apache Tika: Requires a Java server running as a sidecar, with Python client libraries available.
    • Grobid: Excellent for structured research text extraction but comes with a ~20GB Docker image.
  3. Commercial APIs
    Paid solutions like Azure Document Intelligence or AWS Textract offer best-in-class OCR and layout extraction. However, they come with pricing concerns and cloud dependencies, unlike Kreuzberg.

Staring ⭐ is Caring

If Kreuzberg sounds like the library you’ve been looking for, check it out on GitHub.

Please star the repo ⭐—it helps others discover the project and motivates me to keep improving it!

Top comments (0)