How we understand MLOps at DataSentics (MLOps Part 2)

The data scientists working on versioning of the models, deploying, operating, monitoring, or in other words, doing MLOps practices take it as a challenge as it is a complex and laborious process. With the development of the technologies, the machine learning lifecycle has been developed as well. Now the application's logic is no longer captured in the code created by the software developer but reproduced by the ML model trained by the data scientist. And as the same problems occur repeatedly, the engineers working with AI-powered products have built an array of frameworks and tools for developing each new product based on machine learning.

In the first part of the series, we went over the problems (and their causes) that companies face when authoring a data science (machine learning) solution and productionalising it. This "ability to productionalise" is absolutely crucial— without it, even the greatest data science solution will end up in some dark vault (recycle bin) and will never affect anything. When we talk about this ability, we often use the term "MLOps", and in this article, we will unravel how we think about MLOps at DataSentics.

What is MLOps

First, let us reiterate what exactly we mean by MLOps (or AIOps), as this term is slowly becoming a new buzzword. In DataSentics, we say:

MLOps is a set of practices for managing and streamlining machine learning (ML) models' lifecycle — from development all the way to production.

MLOps is not a platform, it is not a tool, it is not a single process — it is an entire system or culture of how to productionalise machine learning models/data science solutions. And the better this system is, the faster and more robust the productionalisation of the machine learning solutions is.

Why DevOps != MLOps?

People who are familiar with software development are probably asking: why yet another Ops? Can we not go just with existing and already mature DevOps methodology with all its tools and processes? Why MLOps? The problem is that ML-powered application development != software development. Current DevOps practices work (very well) for the standard software development. However, developing ML-powered applications is a different beast. You may find interest in Andrej Karpathy's article, which popularised terms "Software 1.0" for standard software applications and "Software 2.0" for machine learning applications, differentiating the two approaches. Now let's go over what the two actually represent:

Software 1.0 vs Software 2.0

In software 1.0, the application's logic is captured in the code written by a software developer. When data comes in, and the logic/code is applied, we get the desired outcome. In software 2.0, the application's logic is captured by the machine learning model trained by a data scientist on top of real data. The word "training" is important — it means we use statistical methods (whether it is a neural network or linear regression) to which we feed data, and the statistical methods output a "model" which encapsulates the logic we desire. Basically, the statistical method is writing the code for us :) Oh, we used the word "statistics" a lot, but let's just say that "statistics = machine learning" — we admit that machine learning has a nicer ring to it.

Of course, as time goes by, the underlying data or behaviour desired from a system may change; therefore, the logic of the model trained some time ago won't reflect the current world and will deprecate. Similarly to software 1.0, where the software developers have to rewrite the code to fix the logic, in software 2.0, we have to retrain the model on the latest data, which will (hopefully) get the logic back on track.

Looking at it this way, MLOps can be seen as DevOps for Software 2.0.

Components of the machine learning model lifecycle

Now, let's dive deeper into how the models are developed, turned into an application and upkeeped — we call this a machine learning model lifecycle.

Machine learning model lifecycle components and processes

As can be seen from the picture above, the ML model lifecycle can be very colourful. We can break it down into several steps/components:

Business comes up with a problem to solve
Data scientists think about how to solve this problem using data and machine learning
Then they go out and try to find the right data
They gather the right data
They experiment and design new features for the model
They train and fine-tune the model
Once they have the model validated and are happy with the result, they register it
Then the process of deploying the model (and the model features!!) to production is initiated. The goal is to make the model available to other business processes which can start leveraging it. The deployment strategy can vary depending on the type of the deployment (as a part of batch prediction pipeline, deployment as standalone API, take and bake the model into an existing application, etc.). This is usually where the machine learning engineers take over.
When the model runs in production, we need to monitor its performance — which includes the health of the service running the model, the statistical quality of predictions, and also the statistic quality of the input data.
When the model starts to deteriorate, it should be revisited, for instance, it can be automatically retrained, or data scientists should have a look and maybe replace it altogether with a completely new model.

This entire ML lifecycle should be supported by a strong MLOps system, as it must be ensured that we:

have a reliable source of data and ability to turn the data into inputs for the models - easy access to data, means to turn raw data into a model input data, versioning of the data, legal aspects, GDPR, …
have means to do the training efficiently - enough computational power, support for necessary libraries, unified environment, tests, data availability, …
know how we trained the model - which version of the training code, which version of data/features, which parameters, etc., produced the particular version of the model, …
have a solid (re)deployment process - turning the model into an application, testing, turning the training code into an application, …
have solid operations on top of the model application - we know which version of the model is running in production, monitoring the application, monitoring the input/output data, …
can react to changing world and subsequent deterioration of the model - model performance monitoring and alerting process, retraining process, …

Multiple people have to talk to each other during this process: business people with the data scientists, data scientists with the data engineers and ML engineers, ML engineers with the platform engineers and ops people, etc. And they don't always understand each other very well. So it is absolutely essential to have suitable interfaces between all those. And this is what MLOps means to us.

MLOps subtopics at DataSentics

The problem is indeed extensive. Internally, we split the problem further into several subtopics:

Feature Store (one central place for storing and managing features (data inputs to machine learning models) across the company)
Experiment tracking (a place to track data scientific experiment runs)
Model registry (place to store the "ML model bundles" — model artefacts and other necessary files/metadata about the model)
Model deployment (the process of building, testing, and deploying both the model training application and model serving application)
Model operations (monitoring/logging/retraining/optimising the model application)
Model reproducibility & portability & interpretability (explainability)
Standardisation & reusability (template / API for data scientists to ease the development and deployment)
ExplainOps (management of model "explainers" and getting the explanation along with prediction)
MLaaS (Machine learning as a service = utilising specialised ML services for common ML tasks such as Azure Cognitive services for face recognition or AWS Polly for text translation)

Getting all these aspects right ensures a great and seamless model development and productionalisation experience for all people involved.

Conclusion

Our view on "MLOps" is probably a little broader than how others understand it. It is not only about the deployment of the model artefact as an API. It is really important for us to think about the model lifecycle holistically and cover all the steps and components.

This view stems from our experience running AI-powered products (both ourselves and of our clients). And as we saw the same problems and patterns emerging repeatedly, our engineers decided to build an entire array of tools and frameworks and best-practice standards (called the "AI Suite"). This addresses the complexities mentioned earlier and which we now use to develop every new ML-powered product.