Vijay Kodam

Posted on Mar 4

MLOps 101

#machinelearning #mlops #beginners #ai

What is MLOps?

MLOps is set of practices that streamline and automate machine learning workflows. It integrates DevOps practices into machine learning workflows to streamline machine learning operations.

Why do we need MLOps?

Most of the time, as part of the machine learning workflow, they go through EDA, data prep, model training, tuning, then model deployment and monitoring just to find out that it is not ready for production. You have to repeat the process all over again and retrain the model.

Since the machine learning workflows were manual and several teams were involved in this process at different stages, it took lot of time and effort to maintain it.

Streamlining and automating such manual process speeds up time to product and decreases manual errors and risks. This leads to scalability of managing and monitoring thousands of machine learning models. This allows the data scientists and engineers to focus on model development and innovation.

Key components of MLOps

Machine learning lifecycle has several interconnected stages and all of these key components together make up MLOps. I have been going through various MLOps guides from AWS, Google, IBM and Databricks and realized all of them mostly follow the same key components.

Data Management

Data is the new oil. For ML data makes or breaks a model. It is the backbone of any machine learning model. Fetching right data, storage, preprocessing the data for model development and versioning are very important.

Primarily this stage consists of Exploratory Data Analysis (EDA) which includes exploring and understanding data. Data preparation and feature engineering are also part of this step, which includes collecting data, processing data.

Feature engineering preprocesses raw data into a machine-readable format. It optimizes ML model performance by transforming and selecting relevant features.

Some MLOps implementations separate EDA and Data preparation into two stages.

Here are some of the tools used for data management:

Data versioning: Data Version Control (DVC), Delta Lake, MLflow.
Data storage and management: Amazon S3, Google cloud storage, Azure Blob storage, Google BigQuery, Amazon RedShift, Snowflake
Data Preparation: Apache Airflow, Databricks

Model Development

This stage involves the design, training, tuning, and evaluation of machine learning models.

Here are some of the tools and services used as part of model development:

Model development frameworks: Tensorflow / Keras, PyTorch, Scikit-learn
Experiment tracking and Management: MLflow
AutoML: Amazon SageMaker Autopilot, Google AutoML, Azure Machine Learning Studio
IDEs: Jupyter Notebooks, R studio, VS Code, etc

Model deployment

Focuses on packaging models, shipping them and deploying them to production environments. This step ensures the model is accessible via an API, microservice or application.

Here are the tools and services used for model inferencing, serving and model deployment:

Containers and Orchestration: KServe + Kubernetes platforms like Amazon EKS, GKE, Azure Kubernetes service.
Managed model deployment services: Amazon Sagemaker, Google Vertex AI, Azure Machine Learning
Model Serving: Kubeflow, TorchServe, TensorFlow Serving

Model inference and serving

Model inference and serving involves making it available for use by applications and end users. It focuses on querying the deployed model to generate predictions.

Services like Amazon SageMaker Endpoints, Google Vertex AI Endpoints, Azure Machine Learning Endpoints, TensorFlow Serving, KServe and MLflow Models are used.

Model Monitoring

After deployment, continuous monitoring is essential to ensure that models perform as expected and maintain their accuracy over time.

Prometheus + Grafana is the opensource stack for monitoring. Good to get started.
Model monitoring services: AWS SageMaker Model Monitor, Evidently. There are also custom monitoring solutions like Kubeflow pipelines.

Governance and Compliance

This key component ensures ML models are developed and deployed responsibly and ethically.

Model explainability can be done using Local Interpretable Model-agnostic Explanations (LIME) and SHAP (SHapley Additive exPlanations). MLflow supports Audit and compliance. Amazon Macie handles security. Data and Model Lineage can be done using MLflow, Amazon SageMaker Model Registry and Google Cloud Vertex AI Model Registry.

Automated model retraining

Automated model retraining involves retraining the ML model when its performance degrades or when new data becomes available. In this stage model retraining is triggered when specific conditions are met, then retrain the model using latest data and then evaluate the retrained model.

Conclusion

As the adoption of machine learning is sky rocketing, the importance of MLOps is now higher than ever. MLOps helps automate and streamline machine learning operations. I have tried listing some of the tools used in MLOps is every key component/stage of machine learning workflow. Which tools or services you choose for MLOps depends on whether you are running on AWS, Google, Azure, Databricks, baremetal or opensource. 

Hope this provided you with a good MLOps overview! What tools and services do you use in your MLOps?

If you are new to my posts, I regularly post about AWS, AI/ML, EKS, Kubernetes, and Cloud computing related topics. Do follow me in LinkedIn and visit my dev.to posts. You can find all my previous blog posts in my blog

References

These are the references I used to learn and write this blog post.

MLOps from Google from google https://cloud.google.com/discover/what-is-mlops
MLOps from AWS https://aws.amazon.com/what-is/mlops/
MLOps from IBM https://www.ibm.com/think/topics/mlops
MLOps from Databricks https://www.databricks.com/glossary/mlops
Practitioners Guide to Machine Learning Operations (MLOps) https://cloud.google.com/resources/mlops-whitepaper

DEV Community

MLOps 101

What is MLOps?

Why do we need MLOps?

Key components of MLOps

Data Management

Model Development

Model deployment

Model inference and serving

Model Monitoring

Governance and Compliance

Automated model retraining

Conclusion

References

Top comments (0)

Read next

Pexl Keys - How to install smtp server on windows server 2022 ?

Set Up DeepSeek on Huawei Cloud with Docker and Open WebUI

The Rise of AI in Software Development: Can AI Write Code Better Than Humans?

Non-Human Identity Security in the Age of AI