Logging in via SSO has stabilized after a brief issue. All users should now be able to sign in to Yammer successfully again.
Load times have returned to normal across Yammer networks.
Yammer’s SAML service is currently down, causing new logins via SSO to fail. Users already logged in to Yammer should not see any effect on their sessions. More updates to come.
Yammer networks are currently experiencing some degraded service in the form of longer page load times. Our team is currently investigating the issue and working to get service speeds back to normal. We will post updates here as this progresses.
Yammer will be conducting some planned server maintenance on Saturday, March 29, 2014. From about 8am to 1pm, PDT, Yammer networks will be inaccessible. We will post here when the maintenance commences, as well as when it ends and all Yammer networks have been brought back online.
Some customers using Yammer in languages other than English experienced issues with the site loading, such as a white screen or parts of the site not loading. A hotfix was deployed and the issue has been resolved.
We will be taking every precaution necessary to mitigate unexpected service disruptions. We appreciate your patience, and apologize for any inconvenience this has caused.
At 1:56pm, PST, on January 14, the Yammer On-Call team started seeing failure alerts from the machines that serve Yammer network feeds. It was quickly determined that these alerts were the result of a proactive diagnostic process being run manually that triggered an unlikely chain reaction, which ultimately made Yammer unreliable for several hours.
The diagnostic process created resource contention across a distributed data store, which in turn triggered temporary website performance issues. Due to a logic bug in a polling client, the performance issue was amplified by an influx of requests. Once this was identified, the responsible service was disabled and service returned to normal at 9:30pm, PST. Restrictions are being put in place immediately to prevent diagnostic processes from consuming more than a small, fixed amount of resources across this cluster.
The cluster with the original performance problem is already scheduled for hardware and software upgrade that will allow for recovery from a degraded state without a slow restart. The slow restart process was a primary contributor to the duration of the outage.
At 1:30am, PST, on January 15, service was again impacted as a result of the earlier issues. The misbehaving client was fixed and service again stabilized at 4:30am, PST. A performance improvement has been implemented on the server so that it can now handle an influx of requests of the type that the misbehaving client generated. Further performance improvements have been identified and prioritized, which will prevent this from happening in similar but unrelated circumstances.