ROOT CAUSE ANALYSIS; Yammer Loading Issues; August 18 & 20, 2014

At 6:00AM, PDT, on August 18, some Yammer users experienced issues loading their networks and feeds. This impacted users beginning fresh Yammer sessions, especially. During this time, the Yammer on-call team observed that requests being made to our assets servers had unexpected pauses of a second or more while data were being returned. As a result, our content delivery network (CDN) was returning incomplete assets and Yammer content could not be consistently loaded. This issue cleared up without direct intervention from the on-call team, but the situation was monitored from this point on. The on-call team posted the all-clear at 8:20AM, PDT.

On August 20 at 6:39AM, PDT, an elevated server load was again detected, resulting in slow loading times for Yammer feeds and pages. The platform recovered again at 8:00AM, PDT.

Following further investigation into these two incidents, it was determined that this was due to some asset compression at the load balancer, which was causing inconsistent asset caching behavior at the CDN. This resulted in the CDN making multiple requests for the same content, which then built up and could not be returned.

To avoid a reoccurrence of this in the future, the Yammer Engineering team has disabled the asset compression that was identified as the root cause of this behavior. Additionally, the routing rules for some Yammer services were rerouted from the main load balancer to reduce the load on it. New alerting mechanisms have also been implemented to catch load balance risks sooner.

There will be no further updates on this incident.

This entry was posted in Uncategorized. Bookmark the permalink.