DEV Community

Cover image for [Open Source Project] Fresh Data For AI
LJ
LJ

Posted on

[Open Source Project] Fresh Data For AI

Open Sourced - https://github.com/cocoindex-io/cocoindex

We are thrilled to announce the open-source release of CocoIndex, the world's first engine that supports both custom transformation logic and incremental updates specialized for data indexing.

Image description

CocoIndex is an ETL framework to prepare data for AI applications such as semantic search, retrieval-augmented generation (RAG). It offers a data-driven programming model that simplifies the creation and maintenance of data indexing pipelines, ensuring data freshness and consistency.

CocoIndex is now open source under the Apache License 2.0. This means the core functionality of CocoIndex is freely available for anyone to use, modify, and distribute. We believe that open sourcing CocoIndex will foster innovation, enable broader adoption, and create a vibrant community of contributors who can help shape its future. By choosing the Apache License 2.0, we're ensuring that both individual developers and enterprises can confidently build upon and integrate CocoIndex into their projects while maintaining the flexibility to create proprietary extensions.

πŸ”₯ Key Features

  • Data Flow Programming: Build indexing pipelines by composing transformations like Lego blocks, with built-in state management and observability.

  • Support Custom Logic: Plug in your choice of chunking, embedding, and vector stores. Extend with custom transformations like deduplication and reconciliation.

  • Incremental Updates: Smart state management minimizes re-computation by tracking changes at the file level, with future support for chunk-level granularity.

  • Python SDK: Built with a RUST core πŸ¦€ for performance, exposed through an intuitive Python binding 🐍 for ease of use.
    We are moving fast and a lot of features and improvements are coming soon.

πŸš€ Getting Started
For a detailed walkthrough, refer to our Quickstart Guide.

πŸ€— Community
We are super excited for community contributions of all kinds - whether it's code improvements, documentation updates, issue reports, feature requests on GitHub, and discussions in our Discord.

GitHub: Please give us a star - repository πŸ€—.
Documentation: Check out our documentation for detailed guides and API reference.
Discord: Join discussions, seek support, and share your experiences on our Discord server.
Social Media: Follow us on Twitter and LinkedIn for updates.

We would love to fostering an inclusive, welcoming, and supportive environment. Contributing to CocoIndex should feel collaborative, friendly and enjoyable for everyone. Together, we can build better AI applications through robust data infrastructure.

Looking forward to seeing what you build with CocoIndex!

Top comments (2)

Collapse
 
jameswood32 profile image
James Wood

An open-source project like Fresh Data for AI would focus on providing real-time, high-quality datasets for training machine learning models. It could gather fresh data from various domains, such as healthcare, finance, or social media, ensuring continuous learning for AI systems. Collaboration on data preprocessing, cleaning, and annotation would enhance its usability. Contributors could share scripts and tools for automating data collection and curation, ensuring the project stays up to date with the latest trends in AI development. Would you like to dive deeper into project ideas or technical specifics?

Collapse
 
badmonster0 profile image
LJ

Thanks! we have very detailed documentation here cocoindex.io/docs/core/basics

An indexing flow, once set up, maintains a long-lived relationship between source data and indexes. This means:

The indexes created by the flow remain available for querying at any time.
When source data changes, the indexes are automatically updated to reflect those changes.

CocoIndex intelligently manages these updates by:

  • Determining which parts of the index need to be recomputed
  • Reusing existing computations where possible
  • Only reprocessing the minimum necessary data

You can think of an indexing flow similar to formulas in a spreadsheet:

  • In a spreadsheet, you define formulas that transform input cells into output cells
  • When input values change, the spreadsheet automatically recalculates affected outputs
  • You focus on defining the transformation logic, not managing updates

We are super actively maintaining cocoindex 70 PR closed per week.
github.com/cocoindex-io/cocoindex

Looking forward to learn more!