DEV Community

Jimmy Guerrero for Voxel51

Posted on • Originally published at Medium

Observations on MLOps–A Fragmented Mosaic of Mismatched Expectations

Author: Jason Corso (Professor of Robotics and EECS at University of Michigan | Co-Founder and Chief Science Officer @ Voxel51)

Image description

This blog riffs on key elements of the state of practice in MLOps. Image Generated by DALL-E.

Over the last five years, I’ve spent much time talking with production AI/ML teams — think emphases on unstructured data, machine learning, curating datasets, training models, etc. Great conversations. Lots of learning for me: twenty years of work in the academic world does not a knowledgeable production-AI/ML expert make. And, I’ve developed some abstractions over what seems to be the axes of variation in production AI/ML teams.

So, when a colleague mentioned they had worked at Arrikto, an MLOps platform built on top of KubeFlow, my antennae perked up, and I headed over to the Arrikto site.

Building on my recent blog about AI/ML tools, I riff on three key points they emphasized on their site (screenshot below) and how they resonate with my observations. These three points talk about who the users of AI/ML tools are, the space of MLOps tools, and end-to-end solutions in AI/ML. I try to be candid in my discussions, but it’s complex out there in the MLOps world. To that end, it’s important to note that Arrikto was acquired by HPE, and their site is a bit dated, but not in a way that detracts from this discussion.

Image description

Snapshot from Arrikto that describes a view on the current state of machine learning from an MLOps perspective. Source: https://arrikto.com.

Data scientists aren’t IT experts

Yes, totally agree with this statement: data scientists aren’t IT experts. This notion may be one of the key blockers I’ve observed in maximizing value from underlying capabilities in open source and enterprise software. If one cannot implement and maintain effective operating environments for the data work and the model work that happens in production AI/ML, then one tends to work in individualized silos, often resorting to archaic mechanisms for managing and sharing work, such as spreadsheets for unstructured data annotation management. This is probably why tools like KubeFlow are so popular.

AI/ML staff — data scientists, data engineers, machine learning scientists, machine learning engineers, etc. — work with the data, the models, and the results, doing design, development, and analysis. Now, perhaps there are purple squirrels out there who grew up drinking the Debian kool-aid and just have a knack for setting up systems. But they are rare.

Of course, what sort of infrastructure setup is needed for one or another team depends highly on the type of AI/ML work they are doing. So, keeping this general, I have observed that startups often expect more from their AI/ML staff regarding the necessary MLOps work. Specifically, they expect AI/ML staff also to become infrastructure experts. This has good and bad sides: there is little to no red tape so things can move fast; they are closer to the metal; but there is often less experience and know-how specifically in getting infrastructure to work; there are frequently fewer actual resources to leverage; it’s usually seen as a distraction from the main work, and hence the infrastructure is in a state of constant change.

The picture is often quite different in larger organizations, with dedicated staff — MLOps Engineers — to implement and maintain AI/ML infrastructure. From the perspective of the AI/ML staff, these individuals are a godsend, as they facilitate the core production AI/ML work. Unsurprisingly, MLOps Engineers (more commonly called DevOps Engineers) are among the highest-paid individuals on AI/ML teams. However, in these scenarios where different roles handle the infrastructure work and the core AI/ML work, I’ve also observed some negative aspects, including significantly more red tape; a rigidness with interoperability with existing infrastructure even when that existing infrastructure is dated, slow and irrelevant to the production AI/ML work at hand; and longer time-to-value curves, despite the dedicated staff, which I find paradoxical.

MLOps tools are a fragmented mess

Each time I open LinkedIn or a blog, I find a new AI landscape image that describes an assortment of tools organized in a plausible but unique manner, like this one from segments.ai and this one on generative AI. From data annotation and model training, to model compression and closed-loop deployment, it’s clear that there are many ways to accomplish the diverse tasks involved in production AI/ML work.

Yet, the marketing message — MLOps tools are a fragmented mess — has such a negative connotation. It may be fragmented. It may be a mess. But, it’s like the mess you’d find in Monet’s studio. It’s a beautiful, flexible mess. From the perspective of the AI/ML staff, the mosaic of options is a necessary component of the development.

How can this be? The current state of practice in AI/ML work requires adaptivity, which is uncommon in classical computational fields. There are myriad tools that capture the work across the many instances of the AI/ML lifecycle. The idea that any one tool could sufficiently capture the dynamic work is unrealistic. Take, for example, an experiment tracking tool like W&B or MLFlow; some form of experiment tracking is necessary in typical model training lifecycles. Such a tool requires some notion of a dataset. However, a tool focusing on experiment tracking is orthogonal to the needs of analyzing model performance at the data sample level, which is critical to understanding the failure modes of models. The way one does this depends on the type of data and the AI/ML task at hand. In other words, MLOps is inherently an intricate mosaic, as the capabilities and best practices of AI/ML work evolve.

One related thing that comes up often in discussions relating to the challenge of supporting the space of tools is that thinking about deployment and efficiency too early in the lifecycle of a project is a terrible idea, yet it’s one I hear so often. Even in the early stages of project planning — long before a plausible model structure or data taxonomy is specified — MLOps individuals are often thinking and asking about compute needs for production. This is a waste of effort. Instead, flexibility and adaptability, which you can best achieve through a minimally structured operating environment tends to lead to faster development cycles and better system performance. When technical performance is at a suitable level, then it is time to consider infrastructure optimization.

No vendor has built an end-to-end solution

Damn straight. Oh, wait, some vendors have claimed to build an end-to-end solution. But, meh, that’s marketing talk. Take, for example, a well-known platform like Amazon Sagemaker, which describes itself as “a fully managed service that brings together a broad set of tools to enable high-performance, low-cost machine learning (ML) for any use case.” It’s a great platform. My startup has even partnered with them. Lots of features. However, it’s not the only tool you’ll need for a complete AI/ML stack. Its analysis tools are pretty limited unless you want to ask certain questions. It has little to no dataset management capabilities. It supports certain AI/ML tasks, but cutting-edge innovation requires more control. Furthermore, there is limited to no flexibility in the evolution of a model from training and initial deployment to an evolving landscape. Generally, most people I talk to integrate it into a broader stack, which seems reasonable.

I don’t mean to pick on just one platform, as it is a more fundamental challenge. In my opinion and experience,** the very notion of an end-to-end AI/ML solution is fiction**. An end-to-end AI/ML solution does not exist, at least not now. If it did exist, it stopped being useful a few minutes ago. If it will exist at some future time, then it will only be until the next innovation. Catch my drift? AI/ML in both research and production is still pretty much greenfield.

This greenfield nature of AI/ML suggests that flexibility is intrinsically critical to production AI/ML work, both in the selection of an AI/ML software stack and in the ability to customize that software to meet current and future needs. This ability is rare in the space of available software tools, especially due to the services-oriented nature of contemporary software ecosystems.

Not only is there no single best software stack with which to build a certain AI/ML capability, the research and development process in AI/ML is still quite individualized based on the training and preferences of the individuals along with the specific AI/ML work they are doing. This touches on numerous layers, from expectations on where to spend work-time (e.g., training more models versus getting better data) to individual preferences over specific frameworks and tools for various production AI/ML development lifecycle elements, which is one reason why vendor lock-in is particularly bad in AI/ML work.

Closing

Despite the hype around the maturity of the AI/ML space, it remains a challenge to navigate.

No vendor has truly delivered on the promise of an end-to-end solution that seamlessly integrates all aspects of the AI/ML lifecycle. It may be due to the rapid pace of innovation and the inherently individualized nature of AI/ML research and development. Or it may be due to the inherent complexity of the space — an end-to-end solution just may not exist; it may not be ideal given the conditions.

Data scientists and ML engineers are left to navigate a complex web of tools, often cobbling together bespoke solutions to fit their unique needs and preferences. This lack of standardization and cohesion leads to silos, inefficiencies, and a constant struggle to keep up with the latest advancements. Yet, it is in that mosaic of a landscape that these individuals seem to thrive. The flexibility and extensibility inherent in the highest-performing tools is a blessing, not a curse.

While dedicated MLOps teams can alleviate some of the burden, they are not a panacea. Organizational red tape, security concerns, and the need for customization can still hinder progress and slow time-to-value. Moreover, the high salaries commanded by MLOps engineers underscore the scarcity of talent and the challenges of building and maintaining robust AI/ML infrastructure.

I am a believer in open standards and extensible ecosystems. But, until we see true standards and openness in the AI/ML tooling landscape, teams will continue to struggle with the complexities of tooling and infrastructure, diverting precious resources from the core mission of building game-changing AI/ML applications. The path forward is unclear, but one thing is certain: the industry must move beyond empty promises and work towards a more cohesive, flexible, user-centric approach to AI/ML tooling. The best we can do is to stay informed, be adaptable, and keep pushing the boundaries of what’s possible in this exciting and rapidly evolving field.

Only then can we unlock the full potential of this technology and architect the next generation of intelligent systems.

Acknowledgements

Thank you to my colleagues Harpreet Sahota, Jacob Marks, Eric Hofesmann, and Michelle Brinich for reading early versions of this essay and providing insightful feedback.

Biography

Jason Corso is Professor of Robotics, Electrical Engineering and Computer Science at the University of Michigan and Co-Founder / Chief Science Officer of the AI startup Voxel51. He received his PhD and MSE degrees at Johns Hopkins University in 2005 and 2002, respectively, and a BS Degree with honors from Loyola University Maryland in 2000, all in Computer Science. He is the recipient of the University of Michigan EECS Outstanding Achievement Award 2018, Google Faculty Research Award 2015, Army Research Office Young Investigator Award 2010, National Science Foundation CAREER award 2009, SUNY Buffalo Young Investigator Award 2011, a member of the 2009 DARPA Computer Science Study Group, and a recipient of the Link Foundation Fellowship in Advanced Simulation and Training 2003. Corso has authored more than 150 peer-reviewed papers and hundreds of thousands of lines of open-source code on topics of his interest including computer vision, robotics, data science, and general computing. He is a member of the AAAI, ACM, MAA and a senior member of the IEEE.

Disclaimer

This article is provided for informational purposes only. It is not to be taken as legal or other advice in any way. The views expressed are those of the author only and not his employer or any other institution. The author does not assume and hereby disclaims any liability to any party for any loss, damage, or disruption caused by the content, errors, or omissions, whether such errors or omissions result from accident, negligence, or any other cause.

Copyright 2024 by Jason J. Corso. All Rights Reserved.
No part of this publication may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the publisher, except in the case of brief quotations embodied in critical reviews and certain other noncommercial uses permitted by copyright law. For permission requests, write to the publisher via direct message on X/Twitter at JasonCorso.

Top comments (0)