Why do Data Scientists Need DevOps for Machine Learning (MLOps)?


In this article, we explain why you need DevOps for machine learning (aka. MLOps), what is the difference between regular DevOps for software engineering and devOps for ML, and how can DevOps for ML be implemented?

We show that, while every step can be set up using purely open-source or available cloud tools, for many businesses this is too resource-intensive in either time or expertise. In such situations, Dotscience can help.

Why DevOps for ML?

Many people may ask, what is DevOps for ML, and do I need it? The data scientist may say, isn’t that for the engineers and IT? The engineers may say I know DevOps (the combining of software development and its deployment in production via IT operations), but why is it different for ML versus software engineering? And managers may say is this something I need to worry about now, or is it not urgent and would just be nice to have sometime in the future? The answer to all of these is, if you want to use ML in the real world to create value for your business, with reproducibility and accountability, then yes, you need DevOps for ML.

There is a fundamental reason for this: data science is iterative (Figure 1). At each of the major stages of the process (data preparation, model building, production, and monitoring), issues may be created or discovered that necessitate modifying one of the other stages. The most obvious is that the performance of a model in production degrades and the model has to be retrained. But there are many others, for example, a model that worked well on its training, validation, and test data fails completely in production, meaning the user has to return and retrain the model or re-prepare the data. Or, data has been prepared but a feature passed to the model turns out to be a cheat variable, leaking information from training set labels into the test set and causing unrealistic model performance, perhaps because of an overlooked subtlety in the business process. While these particular scenarios may not happen every time, it is generally the rule rather than the exception that some earlier step of the process will have to be revisited from a later step.

Figure 1

Figure 1: Data science is iterative at all stages

The iterative nature of data science means that, when you are adding AI to your business by using data science and machine learning, you cannot try experiments and build models, and then “code it properly later” by handing off to an engineering team. The input data and the models are going to change, business requirements and key personnel will change, and reflecting those changes in an updated workflow that was handed off as a finished static end-to-end piece of code will be a lot of work. What is needed is a process in which each component — the code, datasets, models, metrics, and run environment — are automatically tracked and versioned so that changes can be made quickly and easily, and accountability and auditability are achieved. The fact that most companies, even the ones with data science teams, do not use such a process, is leading to large amounts of ad hoc work and technical debt, which is costing companies time and money in wasted opportunity.

So how does DevOps for ML help? In the 1990s, software engineering was siloed and inefficient. Releases took months to ship, and involved many manual steps. Now, thanks to DevOps, and processes like continuous integration and continuous delivery (CI/CD), software can be shipped in seconds because the steps involved are automated. At the present time, ML models are in a similar situation to software in the 1990s: their creation is siloed and inefficient, they take months to ship into production, and require many manual steps. At Dotscience, we believe that the same transformation that has taken place as a result of DevOps for software engineering can be achieved for ML, and our tool helps to lower the barriers to this transformation for businesses who want to get more value from AI.

The difference between DevOps and DevOps for ML

DevOps for ML, also known as MLOps, is different from the original DevOps because the data science and machine learning process is intrinsically complex in ways different from software engineering, and contains elements that software DevOps does not. While software engineering is by no means easy or simple either, data science and ML require the user to track several conceptually new parts of their activity that are fundamental to the workflow. These include data provenance, datasets, models, model parameters & hyperparameters, metrics, and the outputs of models in production. This is in addition to code versioning, compute environment, CI/CD, and general production monitoring. Table 1 summarizes this.

Table 1

Table 1: Extra requirements of DevOps for ML versus DevOps for software

Each of these points can be expanded to show more detail about what is needed. In the final section below, we will suggest various tools that can help.

Common to DevOps and DevOps for ML

  • Code versioning: This is well-known - it is foundational for a software project to know which version of its code is being run.
  • Compute environment: Similarly, a compute environment (hardware, OS, libraries and their versions) on which the code is known to work is required for a coherent project. It may not be recorded exactly throughout, but when reproducibility is needed, as good a recording as possible is ideal.
  • Continuous integration & delivery (CI/CD): CI/CD tools enable software to be deployed, for example, as microservices in containers via a container orchestration tool. This enables much greater flexibility and speed versus monolithic software deployments.
  • Monitoring in production: The performance of software in production, for example, the request rates, speed and uptime for users, is commonly monitored to ensure that it is satisfactory.

Extra items for DevOps for ML

While is it oversimplifying to say that the above is all that DevOps for software ever entails, the additional items below are required in production ML if it is to be done without ad hoc manual steps that build technical debt.

  • Code versioning for notebooks: Since data scientists perform analyses with many discrete steps, a notebook interface is strongly preferred to flat files containing code for day-to-day work. These notebooks have to be synced with previous versions, or with those of others on the same project. This is more difficult than syncing just files.
  • Data Provenance: For a data flow to be reproducible and accountable, we need to know where the data came from. Ideally this means knowing the source, but at a minimum the data at the starting point of the analysis needs to be recorded so that it forms a fixed point from which a workflow can follow. An example would be a dataset file and its MD5 sum. Then if this file changes the user will know.
  • Datasets: A typical data science workflow can involve many steps of data formatting, cleaning, preparation, feature selection, feature engineering, preparation of training/validation/testing and production data, pre-processing for a deployed model, and post-processing. This can generate a large number of datasets and versions of datasets, all of which need to be tracked correctly with versions and dependencies in the workflow.
  • Models: As with datasets, in a typical real data science project, tens or hundreds of machine learning models are created, each one representing a particular ML algorithm, given dependencies on libraries, etc., and particular input and output data.
  • Parameters & hyperparameters: Each model, furthermore, has its own set of parameters (values learned during training) and hyperparameters (values specified outside of training), each containing potentially many values of various data types. These must be recorded in full so that the model produced is correctly described.
  • Metrics: Models are almost never 100% accurate, so metrics of their performance are monitored during both training and production. These metrics may map directly to business value, so it is crucial to know their values. An example is minimizing missed instances and false positives in fraud detection. When many models are built, metrics are often the method of choosing which model to put into production, so which model produced which metric and of what value, is also vital.
  • Models in production: Similar to tracking metrics, the input and outputs for a model that is in production need to be tracked to ensure that the model is performing correctly. This is because machine learning models are highly complex, nonlinear, and adaptable in some ways, but brittle in others. For example, if the input data was corrupted, or had a format change in its columns, the model could rapidly go from outputting accurate predictions to nonsensical output, with the obvious implications for anything relying upon it like a customer website. This is in addition to the usual expected changes like new trends in input data, seasonality, and so on. Common items to monitor include the distributions of features in the input data, the distributions of model outputs like predictions, and model performance versus later-updated ground truth (e.g., did the customer click on an item recommended by the model).
  • Mistakes: In a large project, most of what is run by the user will be wrong. Either things are still being debugged, or it is not the final version, or different approaches are being experimented with. The upshot is that if everything that has been done is recorded, most of it will be superfluous, and clutter up the project. This means that some mechanism is needed for knowing which of the analyses in the project is the correct one. The most obvious is to allow deletion of incorrect items, but this must not allow deletion of anything relevant to the correct analysis. One way to do this is to warn the user that deleting a given item will result in the deletion of child items (e.g., datasets derived from the dataset under consideration), and to disallow deletion of any part of a dataflow that includes a model that has been deployed in production at any time.
  • Track workflows not just items: Because of the interrelations between code, datasets, models, and so on, it makes sense to track not just the items and their relations, but to do so as given workflow instantiations each time. Known in Dotscience as “runs”, these are themselves tracked and versioned in the same way as the objects within them (notebooks, datasets, models, etc.).

Tracking all of the above correctly will enable a production system to have both reproducibility [1], and accountability. Accountability is needed for both legal reasons (audits, compliance, etc.), and ethical reasons (why did the model make a given decision).

[1] Note that there is a subtlety regarding reproducibility, even if everything above is tracked. In some situations, results are not reproducible such that the output of repeating a run will be identical to a previous time. This is because, for example, for large datasets in a distributed system, the line ordering of the file as it is handled is not necessarily determinate, meaning the results the model obtains may alter. It may then be appropriate to distinguish between identical reproducibility and statistical reproducibility, the latter meaning that the results are not identical but not significantly different, where significance is problem-dependent. Of course, tracking everything as described in this article will help with this, and such subtleties do not lessen the importance of the rest of devOps for ML.

How can DevOps for ML be implemented?

So let’s say we are convinced of the need for DevOps for ML, and would like to implement it. How can this be done?

Firstly, you need to decide if you want to implement it yourself, either coded in-house or using open-source tools. Or, if you want to buy a product to help, or outsource to consulting expertise.

If you decide to implement yourself, then the following describes a possible setup. Figure 2 summarizes some of the essential concepts.

Figure 2

Figure 2: Essentials for DevOps for ML, with example tools

  • Compute environment - Docker: Docker containers have become the de facto method of specifying a reproducible compute environment, and work as well with ML as they do with other software. This is because they allow software to “run anywhere” with a fixed environment, which makes continuous integration and delivery easier to implement, for example, with models deployed as microservices.
  • Code versioning (but not data versioning) - Git: Likewise, Git is the standard for code versioning. However, it is not designed to work with files larger than 100MB so it cannot be used to version most real datasets, which may run to petascale, or 10 million times this limit. It is also more difficult to version the notebooks (such as Jupyter) typically used by data scientists, especially if those from asynchronous collaborators are to be merged.
  • Dataset versioning - ZFS: The ZFS filesystem is able to handle petascale data and billions of files, and avoids the need to copy entire datasets each time an analysis is run. This makes it much easier to version each instance of a dataset while avoiding the proliferation of duplicated data from the many versions of a dataset produced by a typical analysis.
  • Data provenance - Manual, or data science platform: In principle, all provenance, data versions, and metadata could be tracked manually, but in practice such a process is incomplete and error prone, even for well-intentioned human users. Use of a data science platform to track automatically is recommended, ideally with some kind of workflow representation of the project like a directed acyclic graph (DAG). The question becomes does the platform chosen enable all the other tracking, versioning, and other steps for DevOps for ML recommended here.
  • Models, hyperparameters, metrics, and workflows: A similar principal applies for models and their metadata including hyperparameters and metrics, and for tracking workflows including given executions of them.
  • Continuous integration and delivery (CI/CD) - CI/CD tools & Kubernetes: A system can be set up where calling a deploy function for a model triggers a CI job (e.g., CircleCI) to pull the relevant files from some endpoint, build a container image for a model, and push it to a Docker registry. A CD tool can then deploy it, for example, as a microservice on Kubernetes, which is the de facto container orchestrator.
  • Monitoring in production - Prometheus and Grafana: Typically, deployed models are receiving data in real time, or some other dynamic situation where new data is coming in in some sequence. Requests and responses from a deployed model can be captured and passed to a time series database such as Prometheus. In turn, the generality of querying supported by a database (in the case of Prometheus, the PromQL query language), means that the full information needed to monitor a deployed ML model, more complex than just simple criteria like thresholds, such as distributions of values of inputs and outputs, can be derived. This information can then be visualized using a monitoring dashboard tool such as Grafana.

As you can see, setting up your own production system is definitely possible, although it might require considerable resources to implement for an enterprise. Alternatively, if you do not have the time, expertise, or resources to build all this yourself, an option is to buy a product to help, or outsource to consultants.

Within the product/consulting alternatives, Dotscience is designed to help enable DevOps for ML / MLOps. We can supply both the tools for doing your data science with DevOps for ML embedded from the start, and also provide consulting for particular engagements. Our framework is general and does not require the user to use particular tools or programming languages, but all of the example tools mentioned in this section are integrated.


We have discussed the following topics:

  • Why you need DevOps for ML (aka. MLOps)
  • The difference between regular DevOps for software engineering and DevOps for ML
  • How DevOps for ML can be implemented

The conclusion is that DevOps for ML is most certainly needed for any real-world data science project that is going to drive business value in production. Regular software engineering DevOps tools cannot be used because several intrinsically new concepts have to be tracked in DevOps for ML. While it is possible to implement it oneself using open source and/or available cloud tools, many businesses will lack the time or expertise to do so on their own. Products such as Dotscience can help such companies bridge the gap and derive greater value from their data via AI and machine learning.

DevOps for ML is also valuable beyond just business problems. For example, in science, academic reproducibility is a big issue in many fields, and a framework such as the one described here could help significantly in improving the situation. Since scientists generally do not have the time or expertise of engineers to set up their own systems like this either, use of a product would make a lot of sense in many projects, even when AI and machine learning are not involved.

Try it out!

You can try out Dotscience for free right now, or for more details about the product, head over to our product page.

Written by:

Dr. Nick Ball, Principal Data Scientist (Product) at Dotscience