DEV Community

Vijay Kodam
Vijay Kodam

Posted on

MLOps 101

What is MLOps?

MLOps is set of practices that streamline and automate machine learning workflows. It integrates DevOps practices into machine learning workflows to streamline machine learning operations.

Why do we need MLOps?

Most of the time, as part of the machine learning workflow, they go through EDA, data prep, model training, tuning, then model deployment and monitoring just to find out that it is not ready for production. You have to repeat the process all over again and retrain the model.

Since the machine learning workflows were manual and several teams were involved in this process at different stages, it took lot of time and effort to maintain it.

Streamlining and automating such manual process speeds up time to product and decreases manual errors and risks. This leads to scalability of managing and monitoring thousands of machine learning models. This allows the data scientists and engineers to focus on model development and innovation.

mlops-diagram

Key components of MLOps

Machine learning lifecycle has several interconnected stages and all of these key components together make up MLOps. I have been going through various MLOps guides from AWS, Google, IBM and Databricks and realized all of them mostly follow the same key components.

Data Management

Data is the new oil. For ML data makes or breaks a model. It is the backbone of any machine learning model. Fetching right data, storage, preprocessing the data for model development and versioning are very important.

Primarily this stage consists of Exploratory Data Analysis (EDA) which includes exploring and understanding data. Data preparation and feature engineering are also part of this step, which includes collecting data, processing data.

Feature engineering preprocesses raw data into a machine-readable format. It optimizes ML model performance by transforming and selecting relevant features.

Some MLOps implementations separate EDA and Data preparation into two stages.

Here are some of the tools used for data management:

  • Data versioning: Data Version Control (DVC), Delta Lake, MLflow.
  • Data storage and management: Amazon S3, Google cloud storage, Azure Blob storage, Google BigQuery, Amazon RedShift, Snowflake
  • Data Preparation: Apache Airflow, Databricks

Model Development

This stage involves the design, training, tuning, and evaluation of machine learning models.

Here are some of the tools and services used as part of model development:

  • Model development frameworks: Tensorflow / Keras, PyTorch, Scikit-learn
  • Experiment tracking and Management: MLflow
  • AutoML: Amazon SageMaker Autopilot, Google AutoML, Azure Machine Learning Studio
  • IDEs: Jupyter Notebooks, R studio, VS Code, etc

Model deployment

Focuses on packaging models, shipping them and deploying them to production environments. This step ensures the model is accessible via an API, microservice or application.

Here are the tools and services used for model inferencing, serving and model deployment:

  • Containers and Orchestration: KServe + Kubernetes platforms like Amazon EKS, GKE, Azure Kubernetes service.
  • Managed model deployment services: Amazon Sagemaker, Google Vertex AI, Azure Machine Learning
  • Model Serving: Kubeflow, TorchServe, TensorFlow Serving

Model inference and serving

Model inference and serving involves making it available for use by applications and end users. It focuses on querying the deployed model to generate predictions.

Services like Amazon SageMaker Endpoints, Google Vertex AI Endpoints, Azure Machine Learning Endpoints, TensorFlow Serving, KServe and MLflow Models are used.

Model Monitoring

After deployment, continuous monitoring is essential to ensure that models perform as expected and maintain their accuracy over time.

Prometheus + Grafana is the opensource stack for monitoring. Good to get started.
Model monitoring services: AWS SageMaker Model Monitor, Evidently. There are also custom monitoring solutions like Kubeflow pipelines.

Governance and Compliance

This key component ensures ML models are developed and deployed responsibly and ethically.

Model explainability can be done using Local Interpretable Model-agnostic Explanations (LIME) and SHAP (SHapley Additive exPlanations). MLflow supports Audit and compliance. Amazon Macie handles security. Data and Model Lineage can be done using MLflow, Amazon SageMaker Model Registry and Google Cloud Vertex AI Model Registry.

Automated model retraining

Automated model retraining involves retraining the ML model when its performance degrades or when new data becomes available. In this stage model retraining is triggered when specific conditions are met, then retrain the model using latest data and then evaluate the retrained model.

Conclusion

As the adoption of machine learning is sky rocketing, the importance of MLOps is now higher than ever. MLOps helps automate and streamline machine learning operations. I have tried listing some of the tools used in MLOps is every key component/stage of machine learning workflow. Which tools or services you choose for MLOps depends on whether you are running on AWS, Google, Azure, Databricks, baremetal or opensource.


Hope this provided you with a good MLOps overview! What tools and services do you use in your MLOps?

If you are new to my posts, I regularly post about AWS, AI/ML, EKS, Kubernetes, and Cloud computing related topics. Do follow me in LinkedIn and visit my dev.to posts. You can find all my previous blog posts in my blog

References

These are the references I used to learn and write this blog post.

Top comments (0)