Case Study: Developing a practical MLOps approach to training and optimizing deep learning models
Creating features, training a model and selecting the best one is only the beginning of MLOps. In the real world, we need to tackle problems with version control and connections between data, models and source code. Having a reliable way to share code and models between data scientists and a set of model development best practices eliminates problems with inference and allows companies to safely use any model version of their choice without risking a production outage.
Seznam.cz portal is visited by almost four million real users daily and the reach of all services that fall under the Czech Internet leader is up to 95% of the Internet population in the Czech Republic. This makes Seznam.cz the most visited Czech website on the internet. In 2021, the company had revenues of nearly EUR 240 million, up 13.3% from the previous year. In addition to online media, Seznam.cz's portfolio also includes full-text search, its own browser, email service, maps, advertising system and others.
The main goal of this project was to demonstrate how MLOps powered by DataSentics can overcome the challenges of training and maintaining tens of machine learning models used by multiple teams in the company. The solution needed to include a reliable way to manage data-code-model coupling because retraining the models is costly and time-consuming.
To ensure proper knowledge transfer, DataSentics and Seznam.cz created a joint team (3 DataSentics + 2 Seznam.cz) which cooperated closely on a daily basis to build the solution.
- Maintaining consistency between training and inference pipeline and model. Each model requires different features and pre-processing steps. Keeping track of which source code creates which model and the required steps during inference for that model is crucial.
- Enabling reusable features. In the model development cycle, most users spend a lot of time on feature definitions. Once a feature is developed, other data scientists should be able to use it in their own models to eliminate repetitive work.
- Implementing continuous integration. The code and model must be tested before deploying into production. Automating the entire process avoids unnecessary and repeated workload, but it usually requires multiple environments and development tools that must be connected for fluent workflow.
The first step was to create an ETL pipeline for pre-processing queries and video URLs. It employed a so-called “medallion” architecture which is pattern often used in a Lakehouse for cleaner data management. In this solution, the gold layer is used as a feature layer. Features are appropriately registered to the Databricks feature store, which ensures consistency and reusability across multiple projects.
Once data are ready in the feature store, the training pipeline loads them and uses them to fine-tune the Electra model for relevance scoring. In our MLOps solution, we choose to support two model training approaches. The first is a manual, interactive mode, which is ideal for initial model development because data scientists can track metrics in real-time during training, and search iteratively for the best set of hyperparameters. The second approach is automated, with a CI pipeline that triggers the training pipeline, logs the model and checks its validity. This leads to automatic source code-model-pipelines consistency as well as cost savings since we are using job clusters for model training in the testing environment.
Independently of the chosen model training approach, once a data scientist is satisfied with a specific model version, the model is registered into the MLFlow model registry and the corresponding version ID is logged into its corresponding Git source branch. From there, the integration tests for training and inference pipelines are performed in the Databricks testing workspace, and the new model version is compared to the current production model. If it performs better, it is deployed along with its feature branch.
- Secure model switching and reproducibility, because the model version is connected to its source code branch
- Develop and test new models faster with access to shared feature store
- Greater flexibility, shareability and collaborative development environment by using cloud infrastructure
- Reduced costs and manual work through automated model training during the experimentation phase
- Allows users to make the final decision about which model version should challenge the current production model
- Easily reusable structure for most of future ML projects allowing to be easily enhanced with new components.
In the future, we plan to add more features to fully leverage other data sources and feature store, and to try different model architectures which can then be easily compared and evaluated through MLFlow. Since only the development and test environments were used, we will create a full production environment which will be used for inference — and the test environment will be adjusted accordingly. Finally, we are considering implementing better data historization.
The joint project took the search engine ML Ops platform a big step further. The know-how we have gathered allowed us to clearly structure the responsibilities for the different parts of the ML model development process. Thanks to the newly outlined boundaries, the ML Ops infrastructure in Seznam.cz search is becoming a platform that would be used by a number of other teams besides the search ranking team. Close teamwork with representation from both companies led to a smooth transfer of know-how and immediate benefits from the project.
Martin Kirschner | Product Team Manager of Web search