In this interview, Dotscience Founder & CEO Luke Marsden discusses how “solving the MLOps drivers” can unblock AI in the enterprise with Ganesh Nagarathnam, Director of Analytics and Machine Learning Engineering at S&P Global.
The Innovation pillar is where people are trying new stuff, prototyping and testing, for example, they might try applying a new neural network architecture from a paper to a text classification business problem, and prove some early wins: good accuracy on a limited training and test set, for example.
The Incubation pillar, then, is where you’re trying to take ideas like that tested in Innovation, and get them ready to move into production. In this pillar, you face challenges like building robust data pipelines so that you’ll be able to retrain your models in the future, and going from single data scientists doing ad-hoc work on local machines, to attempting to scale out the team, which brings with it the need to be more collaborative – and for their work to be reproducible, in preparation for the next phase.
The Productionization pillar is where these models actually make it into production, and start delivering value to the business. In this mode, the stakes are higher. We need to be able to trace back from a model to exactly how it was created, so that if there’s an issue with it we can forensically debug exactly what data it was trained on, by whom, with what code and in which environment. And we need to be able to cope with a data scientist leaving the team - there must be a sufficient record of what was done that one data scientist leaving doesn’t mean you have to rewrite your model. If a model is in prod, then it has to be resilient in the face of human, data and machine factors!
So to answer your question, what I mean about AI being blocked is that 50% or more of the time I see models fail to get through past the Incubation pillar due to the challenges described above. That is a huge waste of time and effort, and it risks AI efforts being branded as a failure. What we need is tooling that enables a smooth flow of models into production, and to do that we need collaboration and handoff between data scientists (often in different timezones) that can happen asynchronously, so that models can move through the pillars as fast as a hurricane!
The interesting thing is that when you and I started talking, we both realized we are very aligned on what we believe are the solutions to these problems. A lot of the things you talk about in your DevOps for ML Manifesto, at S&P we have been calling MLOps Drivers.
Many orgs have highly distributed teams, and time zone differences make collaboration a challenge – one obvious implication of this is that collaboration needs to be asynchronous. That means, I can pick up something you’re working on, make changes to it, and propose them back to you without even needing you to be awake! (Otherwise I have to try and cram all my collaboration with you into the small amount of time zone overlap between, say, NYC and India). There are benefits of asynchronous collaboration for co-located teams as well though, as it reduces the number of interruptions. Like software engineering, data science and ML is a highly focused task, and getting interrupted from being “in the flow” can set you back many times longer than the interruption itself!
Software engineers have had the ability to do asynchronous collaboration for a long time now: with distributed version control systems, it’s easy for one engineer to fork a project, make a branch, do some work, and propose changes back to the owner of the project, even integrating changes that the owner of the project has made while the branch was in development (known to git users as “merging/rebasing from master”). However, in data science, tooling in this space is lacking. It’s a challenge to work with version control systems when it comes to collaborating on ML Models, and tracking hyperparameters and metrics is tricky.
The result? At best, people will email or Slack Jupyter notebooks to each other, track params and datasets in spreadsheets, and manually try to eyeball integration. This is hard though, and so data scientists often just don’t collaborate. That leads to silos and far reduced productivity compared to my vision for how it could work: if any data scientist could fork anyone else’s work, including all the data, runs, code, notebooks and metrics, merge and propose changes back while keeping track of all the variables that are specific to data science and ML, then collaboration could be accelerated and be more like a hurricane! That’s why I’m interested in the promising collaboration capabilities in Dotscience.
The default way (in 2019) I’ve seen people doing model development involves individual data scientists manually keeping track of what they’re doing. This works fine in the Innovation pillar, where people are doing a lot of experimentation and prototyping – in this world, it’s totally possible for one data scientist to be quite productive by manually keeping track of their dataset versions, hyperparameters and code changes. But between Innovation and Productionization is Incubation – and that means preparing for models being in production.
When a model is running in production, the bar for model accountability – knowing exactly how it came to be, and on what basis it’s making its decisions – is much higher. This is especially true if the model is making life-changing decisions: imagine an autonomous vehicle model deciding when to hit the brakes, or a credit risk model determining affordability for loans. What’s more, these models will often be operating in regulated industries: that means models have to be accountable to regulators and authorities, as well as users (who may contend decisions), their creators and other stakeholders. This higher bar requires strong model accountability, which requires provenance.
Provenance is the ability to track back from a model that’s making decisions in production, back to the exact version of the code, data & parameters that trained the model, and back further from that – up the chain of data manipulations that created the training and test set – all the way back to raw data. In order to have strong model accountability, it must be possible to reliably trace back exactly the sequence of data engineering steps, model training steps all the way through to the docker build that created the deployed artifact, and the exact version of the model that’s running in production.
What’s more, in order to be able to forensically debug and fix models when they go wrong, you need to be able to set up the conditions for it to be retrained: that means you need perfect reproducibility of datasets including exact data versions, code/notebook versions, environment (python libraries, etc) and so forth. And it must still be possible to do this 9 or 18 months after you trained the model! All of this is why I’m curious about the provenance graph and data and environment versioning and reproducibility capabilities in Dotscience.
I’ve seen a great deal of data scientists’ time get chewed up (some say as much as 75%) with infrastructure tasks. Spinning up VMs on a cloud, attaching cloud volumes, downloading data, manually copying data between VMs, configuring GPU drivers, setting up virtualenvs, fiddling with python and library versions, trying to reproduce a colleague’s work based on a half-baked README in the repo – it’s all time-consuming and error-prone when managed manually.
What’s more, how many times have you forgotten to stop a VM on the cloud when you’re done using it for the day? Cloud costs add up when you have one VM per data scientist, especially when they have expensive GPUs attached to them! And if you try and share GPUs between developers, they can step into each other’s shoes, in that sharing GPU memory between multiple sessions is fiddly and requires manual coordination (which we want to avoid, see Collaboration).
There are two features of Dotscience I’d like to explore in this context: the ability to “save” a project in the Dotscience hub (every run – the unit of versioning – gets auto-saved, in a sense – frozen and put on the shelf to be unfrozen later in exactly the same state it was in before). Later, a project can be spun back up, irrespective of which runner it’s launched on. Individual compute and storage can come and go, but the project (model development environment) with all its context (data, environment & code) persists and can be easily moved around. Plus, there’s the ability to auto-scale runners on cloud infrastructure, in other words for data scientists to be able to self-serve exactly the compute they need and switch easily between CPU and GPU runners, getting dedicated resources exactly when they need them (no more stepping into each other’s shoes) plus the fact that the platform will automatically shut down idle runners within an hour or two – liberating data scientists from having to worry that they left their VM on over the weekend, racking up a huge cloud bill. Similarly, it opens the door for centralized ops teams defining a budget for an ML team, but the data scientists getting flexibility and freedom on exactly how and when they spend it on just the resources they need to be productive.
When you have teams of data scientists working on many problems, there are a couple of requirements that emerge: firstly, the ability to join components together into workflows or pipelines, enabling rapid experimentation, and secondly, the ability to share and reuse these componentized algorithms and “recipes” within the organization.
When we started working with Dotscience, neither of these capabilities existed at all in the platform – instead, only the primitive ability to execute a script via CLI or API, known as “ds run” existed. The ability to execute a certain script with certain inputs and parameters in a tracked context (recording its results as a data or model run) is clearly a building block for building a workflow/pipeline feature, but the pipelining feature itself didn’t yet exist.
I was pleasantly surprised, then, when the Dotscience team rapidly prototyped a pipelining feature in a matter of days. Now it’s possible to combine pipeline stages like data engineering, model training, model deployment and statistical monitoring within the event-driven context of the visual flow editor, trigger new model builds based on a schedule or whenever code/data changes, and easily wire up different stages of a model lifecycle. I’m interested to see how this feature, which is currently in beta, matures within the Dotscience product.
The flows created in Node-RED are stored in a simple JSON format which can be easily imported and exported for sharing flows, and an online flow library can be created for sharing flows and components within the organization.
It’s also worth noting that in parallel with this built-in visual pipelining feature, we also support integrating Dotscience with CI/CD systems like GitLab, as this allows an alternative (and more software-development style) approach to creating production data and model pipelines.
Back to you, Ganesh!
Following on from the discussion about Collaboration, both productivity and governance (being able to track and control the changes that are made to models) are enabled by having the ability to holistically view and review experiments that have been attempted.
This means having the ability for a data scientist to see what’s been tried before, both by themselves and others: which kinds of models have been trained against which datasets, with which hyperparameters, and what the results were in terms of accuracy, loss, f-score or other metrics.
It also means the ability for a data science or ML manager to be able to see the current progress on projects, compare the efforts of different data scientists or different teams against solving the same business problem. Enabling both of these things requires the notion of a central “hub”: where different data scientists have their work automatically synchronized to the hub whenever they publish results.
Another valuable aspect of having such a central hub is the ability to enable review of proposed changes to a model. Software engineers have for over a decade enjoyed the ability to review a pull request when another engineer proposes a change, to comment line-by-line on the change and to be able to integrate potentially conflicting changes. Merging the history of the experiments (the metrics, hyperparameters and data provenance of each run, and any changes to data engineering) as well as just the code changes levels up data science teams, improving productivity and collaboration.
Both of these reasons are why I’m curious to explore the Dotscience Metric Explorer for reviewing experiments (and connecting runs with certain metrics to provenance of both data and models), and the Dotscience Collaboration system including Pull Requests.
Once the data has been wrangled into a training and test set and the ML team is developing a model, there comes time to iterate on what kind of Machine Learning or Deep Learning model to use. Different types of models have different “hyperparameters” – in ML/DL, a parameter is something like the individual weight of a neuron in a neural network, but a hyper parameter is a “meta” parameter, a parameter about how the neural network is trained. One example of such a hyperparameter is the learning rate of a neural network, which affects the size of the jumps the backpropogation algorithm uses when teaching the network based on the labelled training data.
The ability to rapidly explore the space of possible hyperparameters and determine the best ones is a key requirement in MLOps – while it’s possible with Dotscience to use libraries like sklearn’s GridSearchCV, it presently only works on a single runner – and I’d like to see the ability to scale out that search in parallel across many runners.
Once a model is ready to be promoted either from development or a model pipeline into production, it first needs to land in a Model Library. The essential characteristic of a model library, in my opinion, is that it keeps with each versioned model in the library a strong link back from a built model (an artifact ready to be deployed to production) back to the provenance of the given model, both in terms of data the model was trained on (and where that data came from) and also the code version and the set of hyperparameters (e.g. command line arguments, if the model was trained via a command-line instantiation).
A model exists as a set of versioned files of, say, a Tensorflow saved_model – for example, the serialized weights of a neural network along with the NN architecture itself – but additionally, those versioned model files also need to be built into a Docker image along with an appropriate model server, for example Tensorflow Serving, which can be deployed into production, for example on a Kubernetes cluster.
So when tracking multiple experiments, it’s essential to keep track of the generated result: the model, and be able to trace that model back to the experiment which created it. And it must be possible to store those models automatically, without needing a human to manually copy files around, as humans are bad at doing things like that without making mistakes and forgetting to track what went where.
All of the above is why I’m interested in the Dotscience Model Library with its strong link back from a model to that model and data provenance, hyperparameters, metrics and collaboration history (e.g. which pull requests went into the development of the code in the notebooks that created it), and also the link forwards from the model library to the built docker images, the deployment of those images, and a place to monitor the images once deployed into production.
Speaking of deploying models to production, leading on from models that live in the model library (along with their backreferences for provenance), you need the ability to deploy those models to production. Kubernetes has quickly become the de-facto way to run ML models in production, and all the major cloud providers now offer the ability to deploy Kubernetes fairly easily as a service, which moves the burden of operating Kubernetes partially or fully onto the cloud provider. However, even using Kubernetes as a developer is fairly complex, and so an additional abstraction is really needed to make it possible to instantly deploy ML models for testing, integration into business applications and production.
That’s why I’m looking forward to exploring the new Dotscience Deploy capability, which makes deploying a model to Kubernetes as simple as the data scientist running
ds.publish(deploy=True) in their notebook or script, or using the web interface, CLI or API.
Another key requirement is the ability to deploy those ML models either as a REST API, for immediate predictions/classifications of production data, or in a batch mode, where they can be applied to transform or process potentially huge datasets in a data warehouse on a schedule, e.g. nightly.
Speaking of Kubernetes, scalability of models is where Kubernetes really shines. Once a model is trained and baked into a Docker image, a simple and automated way for data scientists to instantly deploy a model for testing or production becomes essential. I’m particularly curious about the new Dotscience capabilities to deploy an ML model to a scalable, fault-tolerant Kubernetes cluster, all with a single command.
Once a model is trained and baked into a Docker image along with the model server, it’s stateless: that is, giving it the same input multiple times will result in the same output – the model doesn’t “remember” anything between different API calls. This means it’s amenable to horizontal scaling of a stateless application, which is what Kubernetes excels at!
I know you and your team come from a DevOps and Kubernetes background, Luke, so perhaps you could dig a little deeper on Kubernetes for ML?
So it’s possible for ML engineers and data scientists to manually write Kubernetes YAML, and figure out a system for getting that YAML deployed to testing or production clusters, that’s a lot of complexity for data scientists & ML engineers who might not have a DevOps background though: being able to do it with a single command from a Jupyter notebook or a Python script is much closer to what these teams are comfortable and productive with, and with dotscience you can do that with the single Python command:
The way this works is by having registering a Kubernetes cluster as an attached Kubernetes ‘deployer’, which is an agent which runs in the cluster and waits for commands from the Dotscience Hub to initiate the deployment. It’s also possible to register the model in the model library, and then deploy the model to a Kubernetes deployer via the web interface, CLI or API, in scenarios where you don’t want to couple the model training and deployment so tightly. You can also scale models horizontally by editing a deployment in Dotscience and updating the number of replicas. Kubernetes can also be configured to auto-scale a deployment up and down based on load, further optimizing cloud spend.
OK, so we’ve got a model deployed to a scalable and fault-tolerant Kubernetes cluster, along with a strong link back from the deployment back to the provenance of that model and the data it was trained on – cool – now it’s in production and it starts making automatic decisions on our behalf. How do we know how it’s behaving?
Well, if we knew the right answer to whatever the model was predicting, then we wouldn’t need the model! For example, if an autonomous vehicle model is predicting road signs, but we already know what road sign we’re looking at (somehow) then we wouldn’t need the model. In other words, the production data is inherently unlabelled. Normal approaches to monitoring a software microservice, such as latency and error rate, breaks down here (of course, measuring latency and error rates for models is essential, but it’s not enough). Instead, there are statistical approaches we can take, such as looking at the distribution of predictions the model is making in production. So to take the road signs example again, if you have a fleet of autonomous vehicles roaming around, and the percentage of stop signs being predicted in production drops to zero, you can be pretty confident that something is wrong, and in that scenario you’d want a statistical monitoring system to automatically page a human! Either you deployed a bad model that wasn’t actually able to predict stop signs in practice (maybe there was an error in the training set), or something has changed in the world – maybe it snowed, and your model can’t classify stop signs in the snow, stranger things have happened in ML!
Fundamentally, we need to be able to monitor the health of the model through all of the stages of the lifecycle: that means tracking metrics during development, and also the distribution of predictions it’s making once it’s running in production. We’d use triggers in the change of behaviour in production as a cue to go back and retrain the model on new, fresher data, and then re-deploy the new version of the model: the entire process benefits greatly from automatic tracking of all the variables that go into the model – without that, it’s a lot of manually keeping track, and as the number of models and the versions of those models increases rapidly as we scale our AI efforts, that approach stops working quickly.
That’s why I’m interested to see the way that Dotscience brings both the Metric Tracking in development and the Statistical Monitoring into the same place, the Dotscience Hub, so that data science teams can see everything in one place – rather than having production metrics tracked exclusively by a separate Ops team, and having to consult multiple systems and do a lot of guessing when trying to determine when the models need to be retrained.