This morning’s outage (12/12/2012 at 730a PST) was triggered by an automated system on one of our core database servers. The database server detected a potential issue and automatically initiated a repair process. Because these repair processes are very resource-intensive, it is necessary to run them on a regular schedule outside of peak hours. Unfortunately, the process was initiated as our US users were coming online early in the morning, resulting in a temporary capacity problem.
Once we identified the source of the problem, we needed to understand what triggered the repair process and decide what the safest next step was. We had several options on the table; we could pause the repair process, kill the repair process, wait for the process to finish or fail-over to a different database server entirely.
We discovered two things; the database kicked off the repair ultimately because it had been too long since the last repair, (it was simply a function of time). Second, the system that schedules the repairs off-peak was not working as intended for this specific database and failed to raise an alarm. After we understood why the repair process was running, we decided the safest action was to pause the process for a few hours, letting it resume much later in the day under scrutinizing, all-hands supervision.
No data was lost or corrupted during the service interruption.
To prevent future outages we will be expanding our monitoring to include the metrics that the database uses internally to initiate these automatic repairs. We will correct the error(s) that prevented repairs from running off-peak and the related monitoring.
As part of an on-going project we will also be significantly expanding the capacity of each database server/cluster.
We apologize for any and all inconvenience this may have caused you, and appreciate your patience with this disruption.