Finding goroutine leaks

/blog/finding-goroutine-leaks/images/featured_hu752bd6db369b109619c367d1d2fd6002_688336_1595x1148_resize_q75_box.jpg

Go and Goroutines

We make good use of the Go programming language at Dotmesh. Go encourages concurrent and parallel programming by providing lightweight threads, which it calls goroutines, and championing synchronous channels as a means of communication and synchronisation between goroutines.

Go has a garbage collected memory model, meaning that the runtime system is supposed to identify no-longer-used resources and reclaim them; programs written in Go should be able to allocate memory whenever needed without needing to work out when it’s no longer needed and explicitly release it.

However, that didn’t stop the reported memory usage of the core Dotmesh server (written in Go) processes underpinning the Dothub from rising constantly, so we had to find out why…

Goroutine leaks

The problem was that we were slowly accumulating goroutines that were blocked waiting on synchronous channels that would never be communicated with by anybody else, and the garbage collector wasn’t able to detect that they were deadlocked and thus would never do anything ever again, so it couldn’t release their memory.

This state of affairs occurs very easily in synchronous channel environments; synchronous channels are a very low-level synchronisation mechanism. Their correct operation depends on two parts of the program correctly cooperating, and it’s not easy to reason about all the possible interactions between two parts of a program, so it’s very easy to make mistakes. You can’t easily look at (or have some static analysis tool look at) the text of a program and detect communications errors with synchronous channels. You have to run the code, and see if it behaves incorrectly - which might be as obvious as producing wrong output, or as fuzzy as hanging (has it hung, or is it just taking a long time?), or worse, appear to work absolutely fine - but slowly consume more and more memory in ways that are only noticeable after weeks of usage on a busy system.

Finding the leaks

To find these kinds of leaks, we need to take a running system that’s been leaking for a while, and analyse the list of running goroutines to see which ones we think should be there, and to identify the ones that shouldn’t. As it’s a continuous leak, the longer we leave it, the more of the stuck goroutines there should be - so they’ll be easier to spot.

Go gives us a basic mechanism to enumarate the goroutines in the system, and to get a stack dump showing the current state of each. That’s exactly what we’ll need.

Getting a goroutine dump

Go provides a facility to extract the list of goroutine stacks via http. The output looks something like this:

[email protected]:/# curl http://localhost:6060/debug/pprof/goroutine?debug=1
goroutine profile: total 18
3 @ 0x42ecaa 0x42ed5e 0x44abb6 0xa20a63 0x45bfa1
#	0x44abb5	time.Sleep+0x165	/usr/lib/go-1.10/src/runtime/time.go:102
#	0xa20a62	main.runForever+0x162	/go/src/github.com/dotmesh-io/dotmesh/cmd/dotmesh-server/pkg/main/utils.go:280

2 @ 0x42ecaa 0x43e9e0 0x6a61fb 0x45bfa1
#	0x6a61fa	net/http.(*persistConn).writeLoop+0x14a	/usr/lib/go-1.10/src/net/http/transport.go:1822

[...]

The lines with an @ symbol in introduce a stack trace; the number at the start of the line indicates how many times that same stack exists - meaning that a number of goroutines are stopped at the same point in the code. The stack traces are sorted in descending order of frequency, so our leaks are going to be somewhere near the top!

As it happens, I didn’t need to use this interface to get the goroutine dump for the Dotmesh server; we already have an introspection API that returns the goroutine dump, as well as a lot of other useful information about the internal state of the server, so I just used that.

However, a real running Dotmesh server has lots of perfectly legitimate goroutines - the stuck goroutines won’t necessarily just be sitting right at the top of the list, and going through the 1,623 goroutine stacks I pulled down from a production node by hand didn’t look much fun. Clearly, it would be much easier to get a computer to help me do that…

Analysing the dump

I knocked together a tool to parse the goroutine dumps. The result was a list of goroutines, each of which had a stack, which consisted of a sequence of stack frames, which consisted of a function name, a source path, and a line number.

The core of the tool was classifying each goroutine’s stack, to try and label what it’s doing. I built a list of classifiers, which are functions accepting a stack, and returning either a classification name if it matched, or a false value if not. The main loop takes each stack, applies each classifier to it in order until one matches, then assigns that stack to that classification. At the end, the list of stacks in each classification is written to a file named after that classification, and a summary listing how many stacks are in each classification (sorted by number of stacks in each) is printed out.

Most of the classifiers used a library function, stack-mentions?, which returns true if the stack includes a reference to a function whose name starts with a supplied prefix. The prefix check was useful because go code like this:

func foo() {
    go func() {
      ...
      }()

    func() {
       wibble.Lock()
       defer wibble.Unlock()
       ...
    }()
}

…leads to stack dumps mentioning function names such as main.foo.func1 and main.foo.func2, where the anonymous functions have been given names; passing main.foo to stack-mentions? will cause it to match stacks that mention anything inside foo.

The final classifier in the list just returns the classification UNKNOWN. With that in place, I ran my tool over goroutine dumps from the live system, looked at what fell through into UNKNOWN, and wrote classifications for them, until every goroutine was classified.

With THAT done, I could easily ignore large numbers of goroutines that I knew were supposed to be present on a healthy system, and see what was left.

Striking oil

Lo and behold, I found two leaks.

Firstly, I had a lot of instances of a supposedly short-lived worker goroutine sitting around, blocking on sending their final status (success or failure) through a channel. Inspecting the code revealed that the caller created the goroutine, did some other processing, then read the success/failure result from the goroutine through this channel; but a few error-handling cases in the caller returned from it without reading from the result channel. The result was that if the caller failed in certain ways, the worker goroutine would be stuck forever. Starting a goroutine to asynchronously receive and discard the result of the goroutine was an easy fix!

The second one was more interesting. We have a publish/subscribe library; you can register a channel to subscribe to a topic, and you can publish events to that topic. When you do so, a copy of the event will be sent down every channel subscribed to it. Because they are synchronous channels, writing to them blocks until somebody reads the message; so in order to prevent slow subscribers delaying the delivery of the event to other subscribers, the publish operation sends to each subscriber in its own goroutine. This means they all proceed in parallel, and any one blocking won’t delay the others. We had a lot of goroutines blocked on this channel send.

It took me a little while to trace the problem, but in the end, it turned out to be quite simple.

Every branch of a dot has a goroutine that’s responsible for listening to a channel for notifications of new commits happening on that node, and then updating the etcd cluster that shares state between the nodes in the cluster. It spends its time blocking on reading from that channel; when it receives a notification, it does the write into etcd, then goes back to the main loop, which checks to see if the branch has been deleted. If it has, it terminates the goroutine; otherwise, it goes back into blocking to wait for a new commit.

The way it checks for branch deletion is to attempt to read from a channel it has subscribed to receive notifications of branch deletion. It does a non-blocking read, so if there’s nothing to read, the branch hasn’t been deleted yet, and it proceeds.

The problem was, the loop spent most of its time waiting for commit notifications, so when a branch was deleted it was usually blocked on receiving that notification. A deleted branch will never have commits, so it will never receive any such notification - it’ll wait forever. However, the main loop still has the open channel subscribed to receive branch deletion messages, and sure enough, a message signalling the deletion of that branch would be blocking to send to that channel.

The fix was simple - just time out every now and then when waiting for commit notifications and go around the loop again.

With those two fixed, our memory usage looks a bit flatter now - it’s still growing, but merely in line with growing traffic in the cluster!