YAMMER OUTAGE 04/02 – Root Cause Analysis
Updated code was released on Friday night, which while tested, created some performance problems which only manifested during our peak traffic hours. The code in question dramatically increased requests to our primary databases.
Starting at ~5:30am PST, this increased database load began creating slow response times for all of Yammer.com, sometimes timing out completely making yammer.com unavailable. All key personnel were notified and mobilized shortly after our monitoring caught these performance issues.
At ~7:30am PST the aberrant code was identified and reverted, which immediately yielded improved response times. Performance did not return to fully nominal until 8:29am PST after the backlog of requests had been satisfied.
Increased monitoring and changes to how code is deployed are being planned in order to mitigate these types of widespread performance depredations in the future.