DEV Community

Linghua
Linghua

Posted on

Open-Source ETL to prepare data for RAG 🦀 🐍

Image description

I’ve built an open source ETL framework (CocoIndex) to prepare data for RAG with my friend.

🔥 Features:

  • Data flow programming
  • Support custom logic - you can plugin your own choice of chunking, embedding, vector stores; plugin your own logic like lego. We have three examples in the repo for now. In the long run, we also want to support dedupe, reconcile etc.
  • Incremental updates. We provide state management out-of-box to minimize re-computation. Right now, it checks if a file from a data source is updated. In future, it will be at smaller granularity, e.g., at chunk level.
  • Python SDK (RUST core 🦀 with Python binding 🐍)
  • 🔗 GitHub Repo: CocoIndex - Appreciate your support with a github star ⭐ !

Sincerely looking for feedback and learning from your thoughts. Would love contributors too if you are interested :) Thank you so much!

Top comments (1)

Collapse
 
badmonster0 profile image
Linghua

I made a video tutorial too, thanks a lot for your feedback! 🙏