At 5:53 PM on December 19th, the Puppet Forge hit a patch of turbulence. This resulted in intermittent service failures and extremely slow responses, and lasted until 2:42PM on December 20th - almost 21 hours, all told. We'd like to apologize for the performance issues and provide you with some explanation of what went wrong and what we’re doing about it.
When we launched the new Puppet Forge last month, one of our underlying concerns was that we wanted real data about what was happening, both about user behavior and about application and server behavior. (For a great talk about the importance of gathering good metrics, I will refer you to Coda Hale.) To that end, the new Puppet Forge was set up to report a large number of measurements to Graphite via StatsD. And, somewhat more recently, we started using New Relic to monitor both the application stack and the servers.
These tools have not only provided us with fascinating data but have already helped us diagnose and squash a number of bugs, and we plan to add additional instrumentation as we find gaps in our knowledge.
The New Relic instrumentation and the StatsD library take somewhat different approaches in how they send reports. We use New Relic with SSL enabled, so all communication takes place over an encrypted HTTP connection. To combat the performance cost inherent in this approach, the New Relic communication takes place independently of the main application thread, so response times are unaffected. The StatsD library we've employed takes advantage of the extremely low cost of UDP packet transmission and sends measurements from the application thread (in our case, over a secure tunnel to our StatsD server).
Sending StatsD measurements in the application thread tends to work out just fine, since UDP is basically a "fire-and-forget" operation - the sender puts its message in a properly addressed bottle, drops it in the ocean, and assumes destiny will cause the message to arrive; on the off-chance that this isn't a quick process, the library simply aborts sending the metric after 0.1s. This timeout mechanism does potentially cause some data loss, but not necessarily more than you might expect from UDP itself.
What went wrong?
At 4:08PM on the December 19th, the Puppet Forge began to show increases in the average request time, which continued to worsen as time went on. New Relic recorded the first serious failures about 30 minutes later, and the average response time soon passed 1.5s, which it stayed above for the remainder of the event.
Around the same time, our Ops team deployed some infrastructure changes. Of particular interest was a VPN tunnel from our cloud-based VMs back inside our local network, which our Linode-based Puppet Forge servers also received this update. While we're still a little fuzzy on the details, this change caused some trouble for the StatsD library, which began timing out when trying to send its measurement reports, and a significant number of the packet transmissions began to timeout, meaning that each measurement we made was adding around 100ms to the response time. Individually or in small quantities, that is still a reasonably small delay, but our servers may sometimes send hundreds of such measurements per request, and those delays began to add up quickly.
Average Response Times
While we now know that the StatsD timeouts were the cause of our outage, our search for the failure was frustrated by a lack of data. New Relic let us see clearly what our application was doing, but all reports indicated that we were just spending a lot of time executing our Ruby logic. Except, we were barely utilizing any processor time. We considered that we might have been network bound, but there was almost no traffic there. Database queries hadn't gotten any slower. Our memory usage was around 50% of physical RAM. We weren’t swapping, and we weren’t reading from or writing to disk. And we were not trying to serve an unusual request load.
It was rather as if the Puppet Forge was simply calling sleep() in the middle of requests.
When we began looking for causes outside the Puppet Forge as a system, we stumbled across the recent VPN changes. A quick test showed that disabling metrics gathering did relieve the problem, and an educated guess led us to disable the StatsD reporting which caused an almost immediate return to normalcy.
Where do we go from here?
For the time being, we're leaving our StatsD monitoring disabled. While this is a non-ideal situation for us to be in, we're unwilling to flip that particular switch back on until we can demonstrate that doing so has a negligible effect.
The VPN, intended to be a low-risk win for everyone, ended up interacting poorly with the Puppet Forge's particular configuration, in part because not everyone knows how the application works and nobody knows everything about it. We (the engineers) will be working with our Ops team to build a better shared understanding, an ongoing dialog, and clear documentation around how the Puppet Forge works, so that all of us will be more aware of how changes will affect the system, and we'll all be better able to diagnose and solve the next set of problems we face.
Above all, we failed to communicate well that we were having difficulties. To solve that, we will be launching a site to act as a central source for information about service interruptions and outages on the Puppet Forge, including live monitoring data and diagnostic updates. This will enable us to communicate any problems we have – and their resolutions – much more quickly and easily. We'll follow up on this post when we have a link for you.
We recognize that every failed request is an inconvenience to someone who is trying to get things done, and we're working hard to ensure that our service provides the dependability you need. Events like this are never fun, but we are working deliberately to learn from them when they do happen, so that we can continue to provide a service that the entire community can rely on. If you have any questions or concerns about the recent Puppet Forge outage, please submit feedback to firstname.lastname@example.org.