Back in the mists of time (OK, 2019), as a data scientist viewing Dotscience’s webpages for the first time, some of the terminology was not familiar. I knew about machine learning, neural networks, and all that, but what about runners, ZFS, CI/CD, or Prometheus?
Like any subject, DevOps has its own jargon, which while useful for practitioners who want to speak about familiar concepts more efficiently, can leave the subject harder to understand for others. While Dotscience is a platform primarily for machine learning (ML) and data scientists, its strong focus on DevOps for ML necessitates the use of this terminology.
Here, we cover some of the concepts of DevOps when applied to machine learning, and used within Dotscience. For this entry, we don’t focus on data science terminology but rather stick with DevOps. While many readers will be data scientists, you do not need to be a data scientist or machine learning expert to follow along here. Our tour proceeds roughly in the order in which you might encounter the terms when performing an end-to-end data science analysis. The concepts are divided into general concepts and model building, followed by deployment and production.
General concepts and model building
DevOps for ML / MLOps
DevOps for ML, or MLOps, refers to the philosophy and approach of DevOps for software when applied to data science and machine learning. It is most prominent in the production phase of the process, when models are being deployed, but in fact pervades the whole end to end data science process because of concepts like reproducibilty and auditability, which mean ways of working similar to Git, automatic versioning, and so on. MLOps is distinguished from regular DevOps because ML needs extra steps, such as tracking datasets, model hyperparameters and performance, and preferred methods of working like Jupyter notebooks rather than developer IDEs. Other related terms for doing things robustly in the enterprise are also used, such as AIOps, DataOps, and ModelOps. For more on the differences between MLOps and regular DevOps, see our blog entry on why you need DevOps for ML .
The Dotscience Hub (Figure 1) is how we coordinate the various parts of what users are doing to ensure that we are a coherent platform that they can use. The Hub acts as a master storage for data (or pointers to data in another location), metadata, models, and notebooks. There are a couple of things to note about it: (1) The Hub does not execute any user code. This is done on the user’s own compute, i.e., their runner, and (2) Data does not have to be recopied from the hub to your runner every time it is used.
You can also have your own hub: while the main one on our website is set up, running, and available to users, companies who do not want to allow anything off their premises (data, code, etc.), or their private cloud, can have their own hub in a location of their choosing, in the cloud or on-premise. From the user point of view, the experience is the same.
Dotscience architecture, showing major components including the hub in the center
The idea of a runner is central to Dotscience. Borrowed from DevOps, this term refers to the machine on which you are running your computations. The concept is central because Dotscience is bring-your-own compute. You designate whatever machine as being the runner, and then the runner connects to the hub to ensure the correct versions of your code, datasets and so on are being run.
The runner can be your own machine, another machine that you have access to, a virtual machine on-premise or on the cloud, a cloud instance, or any other machine running a compatible operating system. This means that suitable hardware, for example GPUs or TPUs for deep learning with large training sets, can be used as needed. Each user can have as many different runners as needed.
Dotmesh and ZFS
ZFS is another core foundation of Dotscience. The background of our product is that it is based upon Dotmesh, an open source tool that is in turn built upon the ZFS filesystem. They key difference about ZFS compared to most filesystems is that it is aware of the storage system at both the filesystem level but also the physical block storage level. This might sound esoteric but it means that it can be used to keep files (e.g., datasets) synchronized across any 2 Linux systems by only having to track their changes, and not having to recopy them each time. This is because ZFS knows which blocks changed on disk. Thus data flows with large datasets can be handled efficiently. ZFS can support petascale datasets and billions of files.
Dotmesh builds upon ZFS by enabling its users to capture, organize and share application states: a snapshot of your run of code can be taken and treated as an entry in a Git repository. This allows version-controlled execution of projects and other DevOps features such as collaboration between users. Dotscience then uses this system to synchronize between the hub and your runners.
SaaS, notebooks, and CLI
The Dotscience graphical user interface (Figure 2) is SaaS, or software as a service. This means you can try it out on our hub without having to set anything up on your own machine. If you then want to set up projects using code, scripts, etc., you can use either our built-in JupyterLab, or install Dotscience on your own runner and use the Python library or command line interface.
Typical screen in Dotscience GUI
(Note the underlying architecture does not, in fact, require you to use Jupyter notebooks or even Python. These are just the most commonly used in data science at present.)
Versioning, forking, and pull requests
While most readers here will know what versioning is, it is important to emphasize that in Dotscience everything is versioned. This includes datasets (small and large), models, notebooks, runs, and projects. This versioning means that all users’ work is automatically version-controlled. In turn, combining this with our ability to do collaboration, an arrangement of working results that is similar to teams using Git.
This means, for example, that to work on an existing project that you didn’t create, you first fork it. If a collaborator has worked on your project and made changes, these can be diffed and merged, including with notebooks (Figure 3). When code is run and results published back to Dotscience, this is like doing a commit for a project. Projects themselves can be tracked as being X number of commits ahead or behind another instance of the project. A pull request occurs when, e.g., user B has worked on a project from user A, and is asking for their work to be merged back in.
At first this may seem slightly non-intuitive to data-science-type people not used to working this way, but ultimately it is far preferable for projects whose aim is to go into production in the enterprise, because it removes the ad hoc nature of creating notebooks, manually recording and sharing results, and proliferation of untracked versions of files. (Can you remember what that notebook from last week does? Which run was it that produced the best model? What about all the stuff from the colleague who just left?)
One place to quickly see some examples of these concepts is to try our demo, where to get started you fork one of our existing shared projects, and the results you see will be versioned, putting you a commit ahead of the master.
Usage of Git for software development of course encompasses more concepts than these, but most do not need to be learned to do data science work with us.
Notebooks with differences highlighted
Dotscience Python library
Dotscience advertises itself as being robust, versioned, and able to record the results of your projects and runs. The Dotscience Python library is how the user accesses this and tells the system things like what to record and what is a run. It is accessed in your notebook or script by doing, e.g.,
import dotscience as ds. Then the various methods available in ds give the functionality. Common ones include ds.run(), ds.input(), ds.model(), ds.parameter(), ds.metric(), and ds.publish().
These functions can be added to existing code without having to otherwise change it. While no one likes having to change their code just to run the same thing in a new tool, we consider it worth it because it keeps the system robust enough for production in the enterprise but generic enough to solve real data science problems.
The final major term that will come up a lot when doing analyses in Dotscience is reference to runs. Runs are an important part of versioning because the runs themselves are versioned. This means that a given dataflow - datasets, transformations, models, and the relations between everything - is recorded in a reproducible way. The state of the system at the end of a run is like the application state Dotmesh is designed to handle, and this and ZFS are used to synchronize it back to the hub. This is with, as mentioned above, not having to copy all the data each time, only tracking the changes. Runs are thus the way that data science gets done within Dotscience.
Deployment and Production
Once the data has been prepared, featurized, and the models made, they need to be deployed into production. This is the part of the process where the DevOps component is the most obvious, and so there are some more terms to become familiar with.
Continuous integration and delivery (CI/CD)
CI/CD is what has made the difference in the DevOps software world between products being released every few months back in the 1990s, and as often as needed today, down to minutes or seconds. Similar to software back then, ML is currently in the stage where it is common for deploying a model into production to take months. CI/CD is therefore needed to enable trained ML models to likewise be deployed in minutes and not months. The CI part, continuous integration, is what makes sure the set of items needed for a deployment is the correct one. CD is then the process for doing the deployment to, for example, a Kubernetes cluster (see below). In Dotscience, the CI and CD steps are done together when the user requests a deployment, via Docker and Kubernetes (see below).
Dotscience has its own built-in CI/CD tool, but we have recently integrated with GitLab, which enables more flexibility for advanced users in how the CI/CD pipeline from trained model to deployed model is set up.
Docker containers, images, and registry
Docker has emerged as the most common way of containerizing a compute environment, guaranteeing that something like a data science analysis can be run on different machines and produce the same results, because the versions of all the software used have been fixed. Containers are somewhat like virtual machines but much more lightweight because they use the operating system of the machine they are on rather than supplying their own.
When containers are not running, they are stored as Docker images, and in turn the images are part of a Docker registry. Since each image is versioned, in Dotscience this enables images containing models to form a model registry.
Microservices and container orchestration
A model that has been deployed by Dotscience in a Docker container can be thought of as a microservice: a swappable component of a larger system that provides a useful output. Typical production deployments will have many models. Microservices have become a common method of deploying modern applications when using DevOps and hence they are a good approach when using DevOps for ML.
Because there are usually many models, and each one has its own container, this means that there will be many containers. Thus, they need to be organized with respect to available resources, and scheduled. This is container orchestration.
Google’s Kubernetes has emerged as the go-to container orchestration system, and is thus now a common place for data science models to be deployed into production. The combination, however, of setting up a Kubernetes cluster, making it work with existing company processes, having it approved by IT, and deploying your model into it via CI/CD and Docker, is often more than the non-software-engineer data scientist signed up for (the same way non-data-scientists didn’t sign up to do that job in addition to their own job). This means many data science teams in companies still need help, either by hiring more people like ML engineers, or a tool that helps make it easier like Dotscience.
An endpoint is the point where an API, such as a model, interacts with another system. A Dotscience example is the CI stage, where the CI job might pull model files from a Dotscience endpoint that is compatible with the Amazon S3 storage system. A deployed model is also on an endpoint accessible by a REST API, which is one method of sending data to it.
Dotscience model proxy
The Dotscience model proxy is a way of passing the outputs from a deployed model to software that can do something useful with them, such as a time series database. One integration that we have, for example, passes outputs from TensorFlow Serving to Prometheus. Other tools could be plugged in using the same proxy.
Prometheus is a time series database. Because most models deployed live will be receiving data in real time and producing outputs, the resulting data will be in the form of a time series. In Dotscience, therefore, output is passed from the model proxy to Prometheus.
Prometheus has a second important attribute: arbitrary queries can be executed against it via its query language PromQL. This means that the needed monitoring for ML models deployed in production can be gotten by the user by writing the appropriate query.
Deployed models need to be monitored to check that their output continues to be sensible, and is not undergoing degradation effects like data drift or model drift. Similar to leveraging Prometheus, we haven’t tried to build our own monitoring system but include one that is already widely used for monitoring software, Grafana (Figure 4).
Grafana allows the definition of arbitrary queries to time series databases, and their display in a variety of formats. Prometheus is one of its integrated tools, so it is straightforward to construct queries that allow you to monitor your ML models, as well as the more traditional microservice metrics that you will still want to monitor such as the RED metrics (request rate, errors, and duration).
As with Python, Kubernetes, etc., above, the tools used with the underlying Dotscience ZFS+Dotmesh architecture are not fixed, and similarly here data could be passed out from the Dotscience model proxy to tools other than Prometheus and Grafana.
We plan to show some example monitoring queries running on Dotscience deployed models in a future post.
Dotscience model monitoring in Grafana
OK so this is not strictly a DevOps term, but it is important to remember the overarching reason why we want the rigor of DevOps to be infused into data science for production: to better solve whatever problem is at hand to create value. (Often business value but it doesn’t have to be: non-business motivations like data science for good or reproducible scientific work would similarly benefit.) It always comes back to best solving the problem.
We have summarized various important DevOps terminology that a user will encounter when using Dotscience. At the same time this has shown some of the functionality that is available. For more terms, see the glossary that is part of the Dotscience documentation. In addition to DevOps, the glossary includes some common data science and machine learning terms that were not focused upon here.