Dotmesh on cloud Kubernetes clusters: PV-per-Node Mode

/blog/pv-per-node/images/featured_hu9afb4280f607755ca76e909932d0f32a_148674_1595x1148_resize_q75_box.jpg

Dotmesh in the cloud

By default, Dotmesh running under Kubernetes stores data in /var/lib/dotmesh on the host filesystem - exactly like when running Dotmesh under Docker. This is great when running your cluster on physical hardware; it’s a fixed location, not inside the container, on every node.

However, it’s not ideal for deploying into cloud environments, where the nodes are virtualised. In these environments, node storage is usually relatively ephemeral; software running on the nodes are supposed to store their important state in some kind of database accessed over the network, or in a separate filesystem explicitly mounted into the virtual node from a persistent storage service such as Amazon’s Elastic Block Store or Google Cloud Platform’s persistent disks.

These persistent storage volumes work rather differently to host disks, in that they can be reattached to other nodes with an API call; and as such, they enable a different way of thinking about that.

PV-per-node mode

Our first move towards a cloud-native storage mode for Dotmesh was a relatively simple one: Rather than using host filesystem storage on every node, we generate a Kubernetes PersistentVolumeClaim (PVC), which results in the appropriate persistent storage service for the cloud host provisioning a PersistentVolume (PV) for Dotmesh to use as its backing store on that node. This means it will persist when nodes are deleted; the Dotmesh Operator will re-attach that PVC to a new Dotmesh server node, rather than creating a new one, so replacing nodes will just cause the data from the removed node to be carried forward to the new one - and cloud persistent volumes are reliable; they won’t just disappear from under Dotmesh sometimes.

Now, that change sounds simple, but realising it required a whole bunch of groundwork to be laid. Our earlier work on the Dotmesh Operator was necessary to create the resulting hybrid between StatefulSet and DaemonSet behaviour.

However, the fun didn’t end there.

Dotmesh uses ZFS as the underlying storage system for Dots, but ZFS isn’t aware of the filesystem namespacing that’s used to implement containers on Linux. We ask Kubernetes to mount the backing-store PVC at a location in the container, but in order to tell ZFS to use the pool file contained therein, we need to provide the path to the file as seen from the host’s namespace. To make this work correctly, we needed to perform some cunning logic to obtain the path as seen on the host, and then create bind-mounts inside the Dotmesh containers so that the same host-namespace path to our PVC’s mountpoint is also visible inside the Dotmesh server container, so that ZFS commands issued inside the container actually work. But… the fun didn’t end there.

Our acceptance test framework has to simulate multiple nodes, which it installs Kubernetes on to create one or more clusters, installs Dotmesh on the clusters, and then does various things. In order to keep test cycles fast on our laptops, rather than creating handfuls of VMs, we use Docker-in-Docker. On the host, a container is created per node, and Docker is run inside each of those containers, with Kubernetes on top. This makes test nodes relatively lightweight things we can create and destroy as we run through the test suite. And because multiple tests might be running code from different branches at once in CI, all resources used by the test containers must have unique names, in one form or another, to prevent collisions.

These look like Kubernetes clusters on bare metal - there’s no cloud provider offering persistent volume storage. So we had to write our own; a component we called the “DIND Flexvolume Driver” provides persistent volume storage by loop-back mounting ext4 filesystems inside files. These files are kept in the /dotmesh-test-pools directory in the top-level docker-in-docker containers - all of which mount that directory in from the same location on the host, so the volumes are available on every node. The “DIND Dynamic Provisioner” then uses that to create persistent volumes for every suitable persistent volume claim it sees. This dynamic duo, together, provide a service that looks just like Amazon EBS from the perspective of our test Kubernetes clusters; volumes that are detached on one node can be re-attached on another node. But the thing is… the fun still didn’t end there.

Testing our code to find the PVC mount location “on the host” and set up bind mounts to make that mount location also appear in the container’s namespace was, as you might imagine, made difficult by the fact that the “hosts” used in the tests are actually docker-in-docker containers. With all the work we’d done already, making this work was a relatively simple matter of ensuring that the temporary directories Kubernetes uses to assemble persistent volume mountpoints in the docker-in-docker containers were reconfigured to be in the shared /dotmesh-test-pools directory, so that the paths in the real host were therefore the same as the paths inside the docker-in-docker container pretending to be a host, so the logic that recreates the paths from what appears to be the host inside the Dotmesh containers was actually producing paths that matched those on the actual host so that ZFS would still work.

Finally, the fun ended.

Confused yet? I’m sorry. We master Kubernetes, so that you don’t have to. Dotmesh hides all this complexity for you and just serves up dots as persistent volumes on demand :-)

Orphaned PVs

That’s all well and good, but if your cluster shrinks, you’ll end up with some orphaned PVs; every node is running Dotmesh with its own PV attached, but the “spare” PVs will just sit there until a new node joins to claim them. This isn’t good - if the PV was on a node that died while doing something interesting, that PV might have interesting state we’d like back in the cluster. The Dotmesh Operator is designed to support noticing orphaned PVs and bringing up additional Dotmesh servers (on nodes that are already running one) to handle them, but we need to rearrange some plumbing to make that work; Dotmesh current assumes it’s the only instance of the server running on the node, so we need to change that. Also, we need to decide where to draw the line at enabling this logic. A PV might be temporarily orphaned during a cluster upgrade that removes each node in turn then, shortly afterwards, returns it, all shiny and new. We probably shouldn’t consider a PV truly orphaned until it’s been waiting for more than a certain period of time, and we should also migrate existing PVs off of nodes that are hosting more than one if the cluster grows, re-balancing existing PVs rather than creating new ones. Watch our progress on that in this issue in our issue tracker!.

The Next Generation: Decoupling storage from containers

Dotmesh replicates all commits on all dots to every node in the cluster. That’s the right thing to do for small clusters, as it ensures fault tolerance (there are multiple copies) and high performance (every node has an up-to-date copy of every dot’s last snapshot, so only the latest diffs need to be moved around). But in larger clusters of ten or more nodes, it starts to get unweildy - not to mention expensive - to manage so many copies of everything.

Our proposed fix for this is to have a configurable number of storage nodes in the cluster; the Dotmesh Operator will run the Dotmesh Server on those nodes, each with one (or more) attached PVs, as before. However, the Operator will run a separate, and smaller, Dotmesh Agent on every node that will provide the FlexVolume interface to mount storage on that node. This Agent will communicate with the Dotmesh Servers to find the latest state of a dot, and then set up an NFS mount from that Dotmesh Server to the node requiring the mount.

This has several benefits:

  • We don’t need to replicate everything to every node; we create enough replicas to get the performance and fault tolerance we need, traded off against hosting costs.
  • We can handle container migrations even faster than we currently do; we can just move the NFS mount to another container, rather than needing to migrate the latest diffs for the dot.
  • As we’ll use portable service VIPs for the NFS servers, we can migrate the Dotmesh Server instances between nodes (bringing their PVs along with them); the mount IP will remain the same so existing NFS mounts will keep working.
  • As Dotmesh will consume a small number of large PVs from the cloud provider in order to provide an arbitrary number of smaller PVs to application containers, we overcome the fact that cloud provided persistent volumes are really too heavyweight for use at container scale; they are slow to migrate between nodes, and subject to punitive API rate limits that hamper migrating lots of them around the cluster. Dotmesh’s NFS mounts will be very lightweight and easily moved around, with no rate limits.

Conclusion

Building the infrastructure to test software that uses persistent volume claims in lightweight Kubernetes test clusters is hard work, but somebody’s got to do it if we’re to have well-tested storage software! All this infrastructure work, however, means we’re going to have some exciting new features for larger Kubernetes cluster users coming out over the next few weeks…