Run Tracking Liberates ML Teams
The “normal” way of doing creative, exploratory data science is slightly anxiety-inducing. We learn to cope, but it can be improved! Specifically, you might create a model (in Jupyter or by running a script, perhaps in a container), but not perfectly record all of the context: data, code, parameters, and environment. When you (or someone else) comes back later to either try and reproduce the same result, or update the model, the context may have changed. The data is slightly different, you upgraded Python libraries, you pulled in your colleagues’ library updates… how can you explain the huge drop in model accuracy you see when you try to run it again?
How DevOps Solved It First
There is a parallel problem in DevOps and it relates to application deployment configuration in a cluster. If you are deploying apps to your cluster, you can update the versions of things that are running — say, in production or the staging environment. One way you can do that is by manually poking the cluster to change which versions of things are running. But by working that way, you’ll quickly find you run into versioning problems — you’ll change something, and your colleague changes it back. Nobody notices that the versions in the cluster are out of sync until it causes an outage. There’s no source of truth for what should be running in the cluster, nor a record of how it came to be that way.
As one solution, you might try to version control the cluster configuration, say, in Git, but then you encounter an interesting problem. If everyone always commits the changes to Git, and immediately applies it to the cluster, you’re OK. But what if you apply a change manually, and forget to version control the change — easy enough to do in a crisis! Or how about committing something to version control that you didn’t apply to the cluster? You end up with a “fork in the road” in the application of changes to version control and the application of changes to the running cluster.
DevOps brought an interesting solution to this problem: drive the cluster configuration from version control. This is known as GitOps (hat-tip to my former colleagues at Weaveworks). GitOps is a really important idea, similar to continuous delivery for software. All changes to the cluster configuration go through version control. Now humans don’t have to remember to update both the running cluster and version control record. You have a single chain of command for changes to the cluster configuration, it’s auditable and is as easy as looking at the revision history to explain why things were changed and who changed them.
Post Comment