The anxiety of manual experiment tracking
The “normal” way of doing creative, exploratory data science is slightly anxiety-inducing. We learn to cope, but it can be improved! Specifically, you might create a model (in Jupyter or by running a script, perhaps in a container), but not perfectly record all of the context: data, code, parameters, and environment. When you (or someone else) comes back later to either try and reproduce the same result, or update the model, the context may have changed. The data is slightly different, you upgraded Python libraries, you pulled in your colleagues’ library updates… how can you explain the huge drop in model accuracy you see when you try to run it again?
How DevOps solved it first
There is a parallel problem in DevOps and it relates to application deployment configuration in a cluster. If you are deploying apps to your cluster, you can update the versions of things that are running — say, in production or the staging environment. One way you can do that is by manually poking the cluster to change which versions of things are running. But by working that way, you’ll quickly find you run into versioning problems — you’ll change something, and your colleague changes it back. Nobody notices that the versions in the cluster are out of sync until it causes an outage. There’s no source of truth for what should be running in the cluster, nor a record of how it came to be that way.
As one solution, you might try to version control the cluster configuration, say, in Git, but then you encounter an interesting problem. If everyone always commits the changes to Git, and immediately applies it to the cluster, you’re OK. But what if you apply a change manually, and forget to version control the change — easy enough to do in a crisis! Or how about committing something to version control that you didn’t apply to the cluster? You end up with a “fork in the road” in the application of changes to version control and the application of changes to the running cluster.
DevOps brought an interesting solution to this problem: drive the cluster configuration from version control. This is known as GitOps (hat-tip to my former colleagues at Weaveworks). GitOps is a really important idea, similar to continuous delivery for software. All changes to the cluster configuration go through version control. Now humans don’t have to remember to update both the running cluster and version control record. You have a single chain of command for changes to the cluster configuration, it’s auditable and is as easy as looking at the revision history to explain why things were changed and who changed them.
How can we do this in ML?
Dotscience is doing the same thing for ML. Coming back to ML, the problem specifically is that starting a training run of a model (or doing some exploratory data wrangling), is a separate action to version controlling your context. There’s that “fork in the road” again! So you might train some models, make a Git commit of your code, adjust a hyperparameter in the code, train a model, tweak the data, refactor slightly, and commit again, but there’s no guarantee that you are committing exactly the code changes that led to that exact model! And note that in ML, there are also many other variables to control for, including:
- Parameters, which might be supplied as command-line arguments to a model training script–unlikely that you’re recording every one of those
- Data changing as you’re iterating back and forth between data engineering and model development
- How much RAM there is in your GPU
All of the above make it really hard to keep track of everything, and induces the slightly anxious feeling of knowing you won’t be able to get back here in the future – and knowing all the problems that will bring – what if that’s the model that ends up running in production? How will you know exactly where it came from if it breaks?
Run tracking takes care of everything
The Dotscience platform relieves the anxiety with run tracking. Now every time you hit run, Dotscience automatically tracks the exact code you ran (don’t stop using Git, but we’ll show you the exact version of the code you actually ran versus the last code you committed), the exact data version that went into training the model, the exact library versions in a container, and any command-line or inline (in code or a notebook) parameters you supplied. The platform can show you the metrics on a nice chart so you can gain insight into which parameters have the biggest impact on performance. It also automatically publishes those runs to a “runnable ML knowledge base” so your colleagues can all learn from your progress asynchronously.
Reduce your data science anxiety — try Dotscience today! No credit card required and free access to cloud compute so you can play with it easily.