ROOT CAUSE ANALYSIS; Service Interruptions; August 7 & 8, 2014

Last Wednesday, August 6, at 4:48PM, PDT, the Yammer on-call team began receiving alerts regarding an unexpected increase in load across the service that supports Yammer user Inboxes. In response, the team performed a series of attempts to rebalance the load to ensure platform stability. After the initial spike, the load leveled off and did not continue to increase further on Wednesday.

On Thursday, August 7, at 6:07AM, PDT, the on-call team again began seeing alerts of increasing traffic across the same service. As the number of requests grew, the Yammer platform became incapable of returning successful query responses, causing most Yammer networks to be unavailable until 8:25AM, PDT, when the on-call team was able to achieve partial restoration. This team continued to monitor the load graphs until they stabilized again at 8:52AM, PDT, and Yammer posted an all-clear at 10:00AM, PDT, when the on-call team was confident Yammer network access had been restored. The team continued to monitor the situation in an effort to establish the source of the elevated requests.

At 4:15PM, PDT, the same day, the on-call team again received alerts that load times were elevated on many Yammer networks. The on-call team was able to stabilize the Yammer platform after the introduction of additional circuit breakers, resulting in only a brief downtime for some customers during this period. Load times across Yammer networks then plateaued again at 6:50PM, PDT. At this point, it was still unclear to the Yammer on-call team what was causing the unexpected elevated load times and steady investigation continued.

Our service health monitoring system picked up on elevated load times again at 5:49AM, PDT, on Friday, August 8. This load triggered inaccurate read/unread counts for many Yammer users’ Inboxes. The on-call team was then able to pinpoint the source of the unexpected traffic to a build up of Inbox-related requests triggered by a subset of users. When these requests were halted, all server traffic returned to expected levels and the incident was closed at 3:30pm, PDT. The on-call team spent the remainder of the day running scripts to update Yammer user Inboxes to reflect their accurate read/unread message counts. This work was completed at 6:15PM, PDT, on Friday.

In an effort to prevent a reoccurrence of an incident like this, the Yammer Engineering team has consulted with a third-party database vendor on making optimizations to the database configuration to improve its performance. The team has also taken steps to implement additional safeguards around the affected service to ensure that traffic to it is not able to increase at such a rate as to threaten platform stability in the future.

There will be no further updates on this incident.

This entry was posted in Uncategorized. Bookmark the permalink.