Reduced costs and manual work through automated model training during the experimentation phase
Allows users to make the final decision about which model version should challenge the current production model
Easily reusable structure for most of future ML projects allowing to be easily enhanced with new components.
"The joint project took the search engine ML Ops platform a big step further. The know-how we have gathered allowed us to clearly structure the responsibilities for the different parts of the ML model development process. Thanks to the newly outlined boundaries, the ML Ops infrastructure in Seznam.cz search is becoming a platform that would be used by a number of other teams besides the search ranking team. Close teamwork with representation from both companies led to a smooth transfer of know-how and immediate benefits from the project."
About the client
Seznam.cz portal is visited by almost four million real users daily and the reach of all services that fall under the Czech Internet leader is up to 95% of the Internet population in the Czech Republic. This makes Seznam.cz the most visited Czech website on the internet. In 2021, the company had revenues of nearly EUR 240 million, up 13.3% from the previous year. In addition to online media, Seznam.cz's portfolio also includes full-text search, its own browser, email service, maps, advertising system and others.
Creating features, training a model and selecting the best one is only the beginning of MLOps. In the real world, we need to tackle problems with version control and connections between data, models and source code. Having a reliable way to share code and models between data scientists and a set of model development best practices eliminates problems with inference and allows companies to safely use any model version of their choice without risking a production outage.
Maintaining consistency between training and inference pipeline and model. Each model requires different features and pre-processing steps. Keeping track of which source code creates which model and the required steps during inference for that model is crucial.
Enabling reusable features. In the model development cycle, most users spend a lot of time on feature definitions. Once a feature is developed, other data scientists should be able to use it in their own models to eliminate repetitive work.
Implementing continuous integration. The code and model must be tested before deploying into production. Automating the entire process avoids unnecessary and repeated workload, but it usually requires multiple environments and development tools that must be connected for fluent workflow.
The first step was to create an ETL pipeline for pre-processing queries and video URLs. It employed a so-called “medallion” architecture which is pattern often used in a Lakehouse for cleaner data management. In this solution, the gold layer is used as a feature layer. Features are appropriately registered to the Databricks feature store, which ensures consistency and reusability across multiple projects.
Once data are ready in the feature store, the training pipeline loads them and uses them to fine-tune the Electra model for relevance scoring. In our MLOps solution, we choose to support two model training approaches. The first is a manual, interactive mode, which is ideal for initial model development because data scientists can track metrics in real-time during training, and search iteratively for the best set of hyperparameters. The second approach is automated, with a CI pipeline that triggers the training pipeline, logs the model and checks its validity. This leads to automatic source code-model-pipelines consistency as well as cost savings since we are using job clusters for model training in the testing environment.
Independently of the chosen model training approach, once a data scientist is satisfied with a specific model version, the model is registered into the MLFlow model registry and the corresponding version ID is logged into its corresponding Git source branch. From there, the integration tests for training and inference pipelines are performed in the Databricks testing workspace, and the new model version is compared to the current production model. If it performs better, it is deployed along with its feature branch.
- Secure model switching and reproducibility, because the model version is connected to its source code branch
- Develop and test new models faster with access to shared feature store
- Greater flexibility, shareability and collaborative development environment by using cloud infrastructure