Root Cause Analysis – For the 5/24/2012 Outage

05/25/2012 – 02:29 PM PDT

Description

On 5/24/2012 at 2:30 PM PDT Yammer’s hardware responsible for maintaining and delivering the feeds experienced a kernel panic and went offline. Because of this, incoming feeds started to be delivered through secondary channels.

The feed servers were restarted to clear the proxied connections. While the boxes were coming up, traffic was still coming in. Both live traffic and queued traffic were making it difficult for the feed servers to come back up.  After some quick maintenance, the feed servers came up quickly and traffic was restored. At this point, the feed servers went through the backlog and became stable again.

Impact

The site was down (fully) for 45-60 minutes. In addition we had about 5,000 messages in queue awaiting delivery to the feed servers. Based on our analysis, no data was lost in the process.

Analysis

While we have seen small kernel panics from time to time, we are currently investigating why our feed server kernel panicked the way it did, and caused such widespread issues on the network.

Mitigation Steps

  • We are looking to change some of the thread limitations set in the feed server to be more lenient in high traffic.
  • We are looking to update our internal monitoring tools to be more granular and allow us to see the kernel panicking proactively.
  • We will investigate implementing limitations on the percent of available threads to be used for proxying. This could also help to prevent the deadlocking that occurred in the system.

 

This entry was posted in Uncategorized. Bookmark the permalink.