DEV Community

Cover image for 10 Must-Know Open Source Platform Engineering Tools for AI/ML Workflows
Jesse Williams for KitOps

Posted on • Originally published at jozu.com

10 Must-Know Open Source Platform Engineering Tools for AI/ML Workflows

Building and shipping solutions faster has become the benchmark for innovation today. However, for artificial intelligence (AI) and machine learning (ML) teams, scaling workflows and delivering value at speed present unique challenges, including complex infrastructure, manual processes, and inefficiencies. Platform Engineering can streamline these workflows and automate repetitive tasks through Internal Developer Platforms (IDPs), thus enabling teams to focus on what matters most: delivering impactful AI/ML solutions.

The 2024 Dora Report emphasizes the significant impact of Platform Engineering, increasing deployment frequency by 60%, developer productivity by 8%, and overall team performance by 10%. These 10 open source platform engineering tools can help you achieve similar results in your AI/ML projects.

TL;DR. Top open source Platform Engineering tools for AI/ML
Here’s a quick list of Platform Engineering tools I recommend to simplify AI/ML workflows and reduce infrastructure complexity:

  1. KitOps: Centralized versioning for all AI/ML project assets.
  2. Kubeflow: Streamlined ML workflow management on Kubernetes.
  3. DVC (Data Version Control): Ensures reproducibility by tracking datasets, code, and experiments.
  4. Seldon Core: Kubernetes-native tool for deploying and monitoring ML models.
  5. BentoML: Simplifies model packaging and deployment into production.
  6. Apache Airflow: Automates, monitors, and schedules ML pipelines.
  7. Prometheus: Real time infrastructure and ML deployment monitoring.
  8. Comet: Tracks ML experiments and provides insights into model performance.
  9. MLflow: Manages the lifecycle, including tracking, deployment, and model versioning.
  10. Feast: A centralized feature store for managing ML feature data in real time. ## 1. KitOps

KitOps simplifies ML workflows with reusable components, centralized versioning, and secure ModelKit packaging. It integrates seamlessly with tools like Docker, Terraform, and Kubernetes to automate deployment and accelerate development cycles. KitOps also simplifies the transition of Jupyter Notebooks from development to production, as seen in the image below.

How to securely move a Jupyter Notebook from development to production using open source KitOps.

In addition, KitOps provides engineers with a centralized, tamperproof record to enhance transparency. It also offers a secure, read-to-use package called ModelKit for packaging, versioning, and tracking your projects' components—code, datasets, models, and metadata (as seen in the image below).

This shows the ModelKit of the KitOps

2. Kubeflow

Kubeflow is a Kubernetes-native, open source platform that simplifies ML workflow management on Kubernetes. It handles the complexities of containerization and supports end-to-end pipeline automation and distributed training on large datasets, making it ideal for production-grade ML systems.

The Kubeflow Ecosystem for software developement

With features for experiment tracking and model version management, Kubeflow ensures reproducibility, consistency, and collaboration among ML teams across on-premises, cloud, and hybrid infrastructures.

Kubeflow as a internal developer platform

3. Data Version Control (DVC)

Data Version Control is a powerful version control tool tailored for ML workflows. It ensures reproducibility by tracking and sharing data, pipelines, experiments, and models. With its Git-like interface, it integrates seamlessly with existing Git repositories. It supports various cloud storage like AWS S3 and Azure Blob, thus enabling versioning of large datasets without bloating your Git repositories.

DVC as an internal developer platform

DVC also integrates with CI/CD pipelines, making it easy to automate testing and model deployment throughout the ML lifecycle.

4. Seldon Core

Seldon Core addresses the complexity of Kubernetes, which has been reported by over 40% of engineers in a recent Cloud Native Computing Foundation (CNCF) survey, by enabling ML engineers to deploy models at scale without requiring Kubernetes expertise. It supports advanced deployment strategies from A/B testing, canary rollouts, and request logging to outlier detection and multi-armed bandits—making it ideal for optimizing workflows and ensuring flexibility in deployment.

Seldon Core as a Code management platform

In addition to deployment capabilities, Seldon Core offers robust monitoring tools to track model performance in production.

5. BentoML

BentoML is a Platform Engineering tool designed to deploy machine learning models at scale and build production-grade AI systems using any open source or custom fine-tuned models. It achieves this by utilizing a standardized framework to create a portable bundle called Bento, which encapsulates models, dependencies, and configurations in one place. This approach simplifies model management, enabling seamless deployment and integration across environments while ensuring consistency as you build.

jamba-1.5-inference-bentocloud.png

6. Apache Airflow

Apache Airflow offers simplicity when it comes to scheduling, authoring, and monitoring ML workflows using Python. The tool's greatest advantage is its compatibility with any system or process you are running. This also eliminates manual intervention and increases team productivity, which aligns with the principles of Platform Engineering tools.

Scheduling, authoring, and monitoring ML workflows with Apache Airflow.

In addition to scheduling and managing workflows, Apache Airflow supports multiple tools and services, thus extending the platform's customization capabilities. This feature further allows the team to manage ML infrastructure effectively.

7. Prometheus

Prometheus handles everything related to alerting and monitoring your metrics. As an open source monitoring platform tool, it allows AI developers and ML engineers to gain insights into their Infrastructures, create custom dashboards, and monitor their ML workflows in real time.

Prometheus handles everything related to alerting and monitoring your metrics.

It also supports advanced graphing features and integrates with visualization tools such as Grafana to interpret and visualize data. Prometheus’ powerful query language enables slicing, dicing, and manipulating time-series data, with integration options for third-party metrics from sources like Docker and JMX.

Grafana Prometheus query builder

8. Comet

Comet is an end-to-end ML platform developed to evaluate large language models (LLMs), track experiments, and monitor models in production. It supports self-hosted and cloud-based setups, allowing developers to log metrics, outputs, and hyperparameters in real time. With its powerful dashboard and visualization tools, Comet allows teams to present insights visually, making it easier to analyze and understand model performance.

Comet for LLM evaluations, experiment tracking, and production monitoring.

Since Comet integrates with various tools in the ecosystem, it offers adaptability. This versatility helps developers streamline workflows, optimize AI/ML processes, and maintain efficiency throughout the model lifecycle.

Comet platform, experiments screen

9. MLflow

MLflow provides developers with comprehensive tools for managing the entire ML lifecycle. Its four primary components—tracking, models, projects, and model registry—facilitate efficient, reproducible, and scalable ML pipeline building.

MLflow Tracking run details page, showing the MLflow Model artifacts.

Each of these components has various features (as listed below) to ease the complexity of the ML lifecycle.

  • Tracking enables developers to log and compare parameters, results, and metrics.
  • Models handle, manage, and deploy models.
  • Project package code is used in the ML project for reproducibility.
  • The registry allows developers to centralize model stores for lifecycle tracking and versioning.

MLflow Tracking Chart view, showing a number of runs and a comparison of their parameters and metrics.

10. Feast

Unlike the other tools, Feast solves a different issue: the management of ML feature data. Feast simplifies the features management by storing and managing the code used to generate machine learning features, and facilitates the deployment of these features into production. Typically, it integrates with your data sources to streamline management.

Feast for managing ML features.

This tool is particularly important when your ML model needs to be retrained frequently to deploy apps.

Wrapping up

The best time to start working with a Platform Engineering tool was yesterday; the second best time is right now. Neglecting to work with Platform Engineering tools can prevent you from building fast and slow down your team and developers’ productivity. Thankfully, this list shows that there are many open source Platform Engineering tools available at your disposal.

Whether you're building scalable pipelines, tracking experiments, or deploying models into production, KitOps can tackle the complexities of ML projects and model development while keeping your workflow efficient, user-friendly, and robust.

If you have read up to this point, you’re already halfway there. Jumpstart your journey by joining our Discord to connect with fellow ML and platform engineers or explore our get started guide today. Was this list helpful? If so, share it with the people in your network!

Top comments (0)