The architecture of the Dotmesh Operator

/blog/operator-architecture/images/featured_hu531ec4e914fe6089f70cd8b136c4ca05_154936_1595x1148_resize_q75_box.jpg

How we built the Dotmesh Operator.

Although I’ve been writing distributed software for a long time, I’ve only recently started working with Kubernetes. This blog post won’t focus too much on the fine details of the Dotmesh Operator; I’m going to focus more on the topic of how to build a software system that tries to maintain a desired state in a changing world, and how that’s implemented as a Kubernetes Operator.

What’s an Operator?

If you’re familiar with Kubernetes, feel free to skip to the next section

Kubernetes manages Pods, which are collections of containers running Docker images, connected to external resources such as persistent storage volumes and networks; and it routes network connections in to software running in those Pods. At the lowest level, you can ask Kubernetes to create a Pod of containers by providing a specification that lists the details of the containers and the external resources it should be connected to, and it’ll find a suitable node in the cluster and fire your Pod up for you.

However, people rarely do that directly; higher-level abstractions are provided, each of which is a logical object “in” the Kubernetes cluster. These abstractions don’t exist in any concrete sense on your nodes (they’re just an entry in a database), but the Kubernetes system uses them as descriptions of a desired setup of Pods in your cluster, and attempts to create those Pods to meet your requirements. The main ones are:

  • Deployments, which ask Kubernetes to maintain a given number of Pods all following the same specification. They can’t connect to persistent volumes unless those volumes support attachment on multiple nodes at once (eg, NFS mounts or cluster filesystems), because the Pods will (in general) be started on different nodes. This is normally used for stateless services; one might run a Deployment of ten instances of an API server Pod. Pods that terminate are restarted.
  • StatefulSets are similar to Deployments, but with a twist. Each Pod in a StatefulSet has its own identity (they get numbers, as it happens). Unlike a Deployment, StatefulSets can be attached to arbitrary persistent volumes, through a template - persistent volumes will be created for each Pod, and associated tightly with the identity of that Pod, even if the Pod is destroyed and re-created on another node. If you’re running a distributed database, you’d set it up as a StatefulSet; each Pod (a single node of your distributed database) would then be associated with its own unique persistent volume, and Kubernetes would, in effect, run a virtual cluster for you.
  • DaemonSets are also similar to Deployments, but rather than running a specified number of copies of the Pod template on arbitrary nodes (which could result in several of them running on the same node), they ensure that one instance of the Pod template runs on every node. They’re useful for infrastructure components that need to run on every node, such as monitoring agents.
  • Jobs are a different kettle of fish. While Deployments and StatefulSets are for long-running daemon servers that sit there providing some kind of network service, Jobs are for one-offs. A Job has a template for a Pod; when the Job is created, the Pod is created. If the Pod terminates successfully, then that’s that; if it terminates with an error, then the Job will restart it (subject to policies described in the Job specification) until it succeeds. Jobs are a relatively thin wrapper around raw Pods, adding simple restart logic.
  • CronJobs build on top of Jobs. A CronJob, as you might expect, contains a Job specification and a schedule; according to that schedule, Jobs are created from the specification, which then in turn cause some Pods to run. The CronJob allows a configurable number of finished Jobs to linger, but clears up old Jobs (with their corresponding completed Pods) beyond that threshold, to stop them piling up. They’re used to run things regularly, such as clean up jobs, backups, or periodic report generation.

However, the above list isn’t sufficient. Dotmesh needs to run on every node (so it can set up mounts on the nodes), so we originally shipped it as a DaemonSet; but upcoming versions will need access to persistent volume storage. We couldn’t run it as a StatefulSet because they don’t provide the one-Pod-per-node guarantee.

Thankfully, Kubernetes has a mechanism to use your own, arbitrary, logic to manage Pods: You can write an Operator.

How an Operator works.

There’s no special infrastructure in Kubernetes for Operators to work; they’re just applications that use the Kubernetes API to peer into the world of Pods, nodes, persistent volumes, and other things in your cluster; and to make changes as required.

They can connect from outside the cluster, if given credentials of the kind you’ll find in the ~/.kube/config file that kubectl itself reads to gain API access; but conventionally they’re run inside a Pod of their own, configured to grant suitable access to the API at the right level. Kubernetes provides credentials directly into the Pod that the Go API client library can find and use automatically (see the source code for the Dotmesh Operator if you want the nuts and bolts of this in Go).

As a client of the Kubernetes API, an operator can get lists of existing Pods, nodes, PVCs, or other objects in the system, and set up “watches” in order to be notified asynchronously of subsequent additions/deletions/updates to those objects; and it can create, update, or delete those objects.

That’s literally all there is to it. Some operators create a “Custom Resource Definition”, so you can create Kubernetes API objects of a custom kind (with configuration data inside) to cause the operator to set up clusters or whatever it does; that’s closest in spirit to the DaemonSets and so on that come with Kubernetes. The Dotmesh Operator, however, only runs a single Dotmesh cluster, as Dotmesh is a cluster infrastructure component. So we decided to just have the Operator look for configuration in a single hardcoded ConfigMap, but we can change this in future if we need to.

Operating in a changing world.

So, in our case, we want to create a Pod for every node in the cluster, and attach a dedicated PVC to it. If a bunch of Pods goes away - either because their nodes exploded, or because an administrator deleted them, or something, we have a load of PVCs lingering with possibly interesting state that hasn’t been replicated yet; so we want them to be re-used when new nodes come back and need new Pods, and if no new nodes are forthcoming, we should go ahead and create additional Pods on the existing nodes to handle the otherwise homeless PVCs.

At first sight, you might be tempted to start writing some code that responds to individual events coming from Kubernetes. After all, a callback gets invoked whenever a node is added, so why not handle that by creating a new Pod to run on that node, with cases to handle when there’s an existing PVC to reuse, or creating a new one? And for the node deletion case, we can see about bringing up a new Pod on an existing node to look after the abandoned PVC left by the Pod that would have been running on that node. In the case of a Pod being deleted, we re-create it so that node isn’t left without a Pod - except that we also get notified of a Pod deletion when a Pod is destroyed by a node shutting down, so we’ll need to work out how to differentiate that case…

What about Pod addition? Well, only we create Pods, so there’s no extra actions we need to do in that case. Likewise with PVC addition and deletion. We’ll need to scan the cluster for an initial list of PVCs when we start up, to see if there are any that are orphaned and need Pods creating, but PVC creation and deletion thereafter isn’t something we need to listen for notifications about. While we’re at it, at startup time, we need to assess the initial state of the system we’ve just started up in. It might be a fresh new cluster, ready for us to create a PVC and a Pod on every node; or it might be an existing system in some arbitrary state, and the operator has just restarted for some reason, so we need to assess the situation and do whatever is required to fix it.

However, this is setting off in a bad direction. We already have duplication of logic - the initial “assess the cluster” logic tries to bring about a certain state, and then the “react to changes” logic also tries to maintain that state; and so the same desired state of the system is written into the software in two places. That alone is a recipe for bugs, but our interpretation of object adds/deletions as the cluster growing or shrinking has subtler problems. Complicated events such as a cluster Kubernetes upgrade might involve a stream of Pod/node deletions coupled with node additions as upgraded nodes re-join the cluster. That will keep the operator busy with what might, in a large cluster with lots of complicated resources, become a torrent of little events; the operator can’t see the wood for the trees.

A much better approach is to just implement that initial startup logic: look at the state of the cluster, see what’s wrong with that state, and take steps to improve that state. We need to do that on initial startup in a cluster of completely unknown state, but we can also use exactly the same logic to react to a subsequent change. For the Dotmesh Operator, we’ve taken the approach of just setting a “change is needed” flag when any node, Pod or PVC is created, updated, or destroyed; and once a second, a separate goroutine polls that flag. If it’s set, it does a full check on the nodes, PVCs and Pods in the system, identifies any problems, and fixes them. The “problems” in our case are:

  • Nodes not running at least one Dotmesh Pod, that should (in the ConfigMap is a selector to match the nodes that should run Dotmesh). Solution: Fire up Dotmesh on that node, using an existing PVC if one’s free, or creating a new one.
  • Nodes running a Dotmesh Pod, that shouldn’t. Solution: Kill the Pod.
  • Dotmesh Pods that have a problem (they’re running the wrong image, or are broken in some way). Solution: Kill the Pod.
  • PVCs that aren’t attached to any Pod. Solution: Fire up Dotmesh on a random node, using that PVC.

We don’t immediatelly kill all Pods that need killing, however. We keep track of a “running population” of Pods that look like they’re working correctly (excluding Pods that are in the process of starting up, as well as ones that are downright broken), and we won’t kill any Pods if doing so would bring the running population below a high water mark of 75% of the cluster.

This fact, combined with the operator considering a Pod as invalid if it’s running the wrong version of the Dotmesh server, means we get rolling upgrades for free: if a new operator is rolled out that expects a new version of Dotmesh, it will automatically kill Pods running the old version - but as soon as it hits 25% of the cluster being down, it will stop killing them. When the Pod termination process completes, those nodes then register as not running Dotmesh, so new Pods are started on them. When they have finished starting, the count towards the “running population”, and so more of the Pods running the old version can be killed, and the cycle continues.

Just to make it clear, there is no persistent state stored in the Operator - the convergence loop is triggered on every change to a watched object, and it re-assesses the cluster from scratch each time. This means that if Pods we’ve decided need to terminate on their own (which is quite common for a Pod with something wrong with it), we’ll handle that without any extra work.

If we’re in the middle of a rolling upgrade when the operator gets replaced with an even newer version, and the cluster itself is being upgraded, and there’s a major outage taking many of the nodes down, and a new shipment of nodes has just come online and joined the cluster, and the administrator has changed the configuration so that only certain nodes should be running Dotmesh - the operator will just trudge on, doing what it can to maintain its invariants: every eligible node is running a Dotmesh Pod, every PVC is bound to a Dotmesh Pod, and invalid Dotmesh Pods die but not if less than 75% of eligible nodes are running a Pod.

Conclusion.

A Kubernetes Operator is a classical cybernetic control system; this should be no surprise, as that’s how Kubernetes itself works. you have a changing world (your Kubernetes cluster), sensors to observe it (watching changes to Kubernetes API objects), and effectors to cause some change in it (creating/modifying/deleting Kubernetes API objects). With those, you are trying to get your world into a desired state as quickly and reliably as possible, despite external forces also affecting the world in unpredictable ways, and the desired state changing with time. Control system designers have, since before computers, been refining techniques to handle this kind of situation, and us software engineers would do well to pay attention to their approaches!

To summarise, I advise:

  • Keep no state in your operator, other than caches of external information. Base your decisions purely on configuration and the observed state of Kubernetes. If you must store some “historical” state (eg, to do something about services that have restarted repeatedly), store extra information in annotations or labels of your API objects, so they are externally visible and persist across operator restarts.
  • Write a single algorithm that observes the configuration and state and takes some steps towards improving it; use that both for initial startup and for incremental changes thereafter.
  • You don’t need to (and sometimes don’t want to) make ALL the changes at once. Particularly things like killing Pods should be rate-limited, to maintain a quota of Ready pods to ensure good service is provided during an upgrade. If you introduce too much rapid change, you might over-compensate for a change. Take inspiration from PID loops in the analogue domain.
  • Rather than reacting instantly to every change, set a flag that change has happened and react once per second or so if the flag is set. That way, a sudden flurry of changes in the underlying platform (eg, an entire datacentre goes down, killing thousands of nodes at once) just causes a single assessment of the situation and reaction to it.
  • And most importantly, have fun! Working on these kinds of things is really satisfying. If you’re not enjoying it, you’re not doing it right.