[ROOT CAUSE ANALYSIS] Service Interruption; 12/16/2013

At 4:15pm, PST on December 16th, Yammer on-call engineers were paged and began investigation of platform performance issues. At 4:54pm, PST, a large batch of servers that handle message feeds were identified as being unresponsive.

Further investigation identified some processes that were running, which consumed all resources on the machines due a build up of data. Engineers stopped these procesess and began slowly restoring service in order to ensure data consistency.

At approximately 7:00pm, PST, the services were brought back online but many networks were still experiencing latency. By 9:18pm PST, the backup of data was fully cleared and all message deliveries returned to normal.

In order to prevent similar degraded service in the future, we are currently investigating more redundant levels of issue isolation and operating system level enforcement. We are also working on ways to improve the mean time to recovery by streamlining the service restart scenarios.

This entry was posted in Uncategorized. Bookmark the permalink.