ROOT CAUSE ANALYSIS – Service Interruption 9/22/12

The service interruption on 9/22 was the result of an on-going database project expanding the number of databases to improve availability and performance. Over the last several weeks we’ve been transitioning tables between databases, first in our test environment, then pre-production and then finally in production. Until Saturday night our table transition system had been working really well for us.

During the transition process there are two moments of truth: First is when we disable writes and move read load from the old database to the new database and second when we enable writes to the new database. As we were moving reads to the new database we saw that the connection rate and load were higher than expected. Not critically high but curiously so. During this period we’re unable to take writes to the transitioning-tables, so while the site generally works, users writing to those tables will see errors. On Saturday those would have been users updating their Profile (bio, phone, etc …).

When we are transitioning tables we tag them as “transitioning” which tells our backend services to aggressively update their map of databases and tables. Our theory was that some of the backend services we’re reconnecting more than they needed to and would calm down once we removed the “transitioning” tag. We could have done a better job of testing that theory. It seemed reasonable and we wanted to restore full site functionality, so we decided to press forward.

This is where our evening became much more exciting: connection-rates and load on the new database continued to rise, eventually to the breaking point. After discussion our options we decided to transition back to the original database server. Because we disabled writes early and had not yet reenabled writes, all we needed to do was update the database/tables map and force our backend services to refresh. Erring on the side of safety we verified each backend service before bringing the site back up.

No data was lost or corrupted during the service interruption.

These transitions are designed to be largely transparent to our users. Clearly that wasn’t the case last night. We will be working to identify the unique factors that lead to last nights service interruption before performing further transitions.

This entry was posted in Uncategorized. Bookmark the permalink.