Puppet Forge Outage: What Went Wrong?

At 5:53 PM on December 19th, the Puppet Forge hit a patch of turbulence. This resulted in intermittent service failures and extremely slow responses, and lasted until 2:42PM on December 20th - almost 21 hours, all told. We'd like to apologize for the performance issues and provide you with some explanation of what went wrong and what we’re doing about it.

Background

When we launched the new Puppet Forge last month, one of our underlying concerns was that we wanted real data about what was happening, both about user behavior and about application and server behavior. (For a great talk about the importance of gathering good metrics, I will refer you to Coda Hale.) To that end, the new Puppet Forge was set up to report a large number of measurements to Graphite via StatsD. And, somewhat more recently, we started using New Relic to monitor both the application stack and the servers.

These tools have not only provided us with fascinating data but have already helped us diagnose and squash a number of bugs, and we plan to add additional instrumentation as we find gaps in our knowledge.

The New Relic instrumentation and the StatsD library take somewhat different approaches in how they send reports. We use New Relic with SSL enabled, so all communication takes place over an encrypted HTTP connection. To combat the performance cost inherent in this approach, the New Relic communication takes place independently of the main application thread, so response times are unaffected. The StatsD library we've employed takes advantage of the extremely low cost of UDP packet transmission and sends measurements from the application thread (in our case, over a secure tunnel to our StatsD server).

Sending StatsD measurements in the application thread tends to work out just fine, since UDP is basically a "fire-and-forget" operation - the sender puts its message in a properly addressed bottle, drops it in the ocean, and assumes destiny will cause the message to arrive; on the off-chance that this isn't a quick process, the library simply aborts sending the metric after 0.1s.  This timeout mechanism does potentially cause some data loss, but not necessarily more than you might expect from UDP itself.

What went wrong?

At 4:08PM on the December 19th, the Puppet Forge began to show increases in the average request time, which continued to worsen as time went on. New Relic recorded the first serious failures about 30 minutes later, and the average response time soon passed 1.5s, which it stayed above for the remainder of the event.

Around the same time, our Ops team deployed some infrastructure changes. Of particular interest was a VPN tunnel from our cloud-based VMs back inside our local network, which our Linode-based Puppet Forge servers also received this update. While we're still a little fuzzy on the details, this change caused some trouble for the StatsD library, which began timing out when trying to send its measurement reports, and a significant number of the packet transmissions began to timeout, meaning that each measurement we made was adding around 100ms to the response time. Individually or in small quantities, that is still a reasonably small delay, but our servers may sometimes send hundreds of such measurements per request, and those delays began to add up quickly.

Average Response Times

While we now know that the StatsD timeouts were the cause of our outage, our search for the failure was frustrated by a lack of data. New Relic let us see clearly what our application was doing, but all reports indicated that we were just spending a lot of time executing our Ruby logic. Except, we were barely utilizing any processor time. We considered that we might have been network bound, but there was almost no traffic there. Database queries hadn't gotten any slower. Our memory usage was around 50% of physical RAM. We weren’t swapping, and we weren’t reading from or writing to disk. And we were not trying to serve an unusual request load.

It was rather as if the Puppet Forge was simply calling sleep() in the middle of requests.

When we began looking for causes outside the Puppet Forge as a system, we stumbled across the recent VPN changes. A quick test showed that disabling metrics gathering did relieve the problem, and an educated guess led us to disable the StatsD reporting which caused an almost immediate return to normalcy.

Where do we go from here?

For the time being, we're leaving our StatsD monitoring disabled. While this is a non-ideal situation for us to be in, we're unwilling to flip that particular switch back on until we can demonstrate that doing so has a negligible effect.

The VPN, intended to be a low-risk win for everyone, ended up interacting poorly with the Puppet Forge's particular configuration, in part because not everyone knows how the application works and nobody knows everything about it. We (the engineers) will be working with our Ops team to build a better shared understanding, an ongoing dialog, and clear documentation around how the Puppet Forge works, so that all of us will be more aware of how changes will affect the system, and we'll all be better able to diagnose and solve the next set of problems we face.

Above all, we failed to communicate well that we were having difficulties. To solve that, we will be launching a site to act as a central source for information about service interruptions and outages on the Puppet Forge, including live monitoring data and diagnostic updates. This will enable us to communicate any problems we have – and their resolutions – much more quickly and easily. We'll follow up on this post when we have a link for you.

We recognize that every failed request is an inconvenience to someone who is trying to get things done, and we're working hard to ensure that our service provides the dependability you need. Events like this are never fun, but we are working deliberately to learn from them when they do happen, so that we can continue to provide a service that the entire community can rely on. If you have any questions or concerns about the recent Puppet Forge outage, please submit feedback to forge@puppetlabs.com.

Learn more

Comments

Christy McCreath

Christy McCreath

Hi Brian, apologies for the delayed reply. I checked in with the Puppet Forge team regarding your question. Their reply below:

The Graphite server we use is hosted on the "local" end of the VPN connection, because it's used by many other (internal-only) metrics reporters. Also, given the volume of data we're collecting, the disk costs in the cloud would quickly become unreasonable.

Christy McCreath

Christy McCreath

Hi Michael, I apologize for the delayed response. I shared your comment with our Puppet Forge team, and their response was:

The StatsD library wraps the entire UDP transaction in a Ruby Timeout instance. This meant that factors like DNS were being rolled into that timeout period in addition to the packet transmission itself, and Ruby (not the underlying UDP implementation) would abort the transaction if it took too long. As others have pointed out, DNS seems like the most likely culprit for the timeouts.

As for StatsD, we discussed hosting it alongside the application, but leveraged the StatsD instance we already had running for expedience sake. Running it on the application server would likely have helped significantly.

Brian Carpio

Brian Carpio

Is your Graphite server (I assume you are using Graphite) on the other side of your VPN? Is there a reason why you decided not to put it in "The Cloud"?

Michael Hood

Michael Hood

I'm not sure what you're referring to when you say you had a timeout set on the UDP messages being sent to statsd.

Any sane UDP implementation is calling sendto() with the MSG_DONTWAIT flag, which should guarantee non-blocking operation. Not using alarms or select.

Aside, statsd is intended to be run locally to the application(s) sending metrics to it. It should have been on the app server(s) with graphite being the concentration point on a centralized server. You're defeating its purpose by sending a flood of UDP over the network. By default statsd would have aggregated 10 seconds worth of data and sent one message to graphite over a persistent TCP connection.

Hope that helps.

Leave a comment

Tradeshow
Sep 3
Puppet Practitioner
Sep 9
Speaking Engagement
Sep 11
Tradeshow
Sep 12