[ROOT CAUSE ANALYSIS] Service Interruptions; 1/14 and 1/15/2014

At 1:56pm, PST, on January 14, the Yammer On-Call team started seeing failure alerts from the machines that serve Yammer network feeds. It was quickly determined that these alerts were the result of a proactive diagnostic process being run manually that triggered an unlikely chain reaction, which ultimately made Yammer unreliable for several hours.

The diagnostic process created resource contention across a distributed data store, which in turn triggered temporary website performance issues. Due to a logic bug in a polling client, the performance issue was amplified by an influx of requests. Once this was identified, the responsible service was disabled and service returned to normal at 9:30pm, PST. Restrictions are being put in place immediately to prevent diagnostic processes from consuming more than a small, fixed amount of resources across this cluster.

The cluster with the original performance problem is already scheduled for hardware and software upgrade that will allow for recovery from a degraded state without a slow restart. The slow restart process was a primary contributor to the duration of the outage.

At 1:30am, PST, on January 15, service was again impacted as a result of the earlier issues. The misbehaving client was fixed and service again stabilized at 4:30am, PST. A performance improvement has been implemented on the server so that it can now handle an influx of requests of the type that the misbehaving client generated. Further performance improvements have been identified and prioritized, which will prevent this from happening in similar but unrelated circumstances.

This entry was posted in Uncategorized. Bookmark the permalink.