|
Friday April 20, 2007
|
Where were you when the Blackberries went out? The eight-hour Blackberry outage this week generated plenty of news because RIM's Blackberry systems are usually so reliable -- and because this showed that yes, they do have a single point of failure. RIM released a statement today somewhat obliquely explaining what happened.
In English, RIM decided to do a software upgrade on their servers, it crashed their database, and switching over to their backup system took longer than they thought. I don't see this as crippling for RIM, because they are normally so reliable. All the same, it's been a bit of a PR boost for competitors like Microsoft, who are trumpeting that all of their push e-mail doesn't go through one server.
I was just discussing this with former IT guy Joel Santo Domingo, our desktops analyst, and here are the lessons we found for IT guys:
1. Don't run upgrades on your live servers. Switch over to a backup, run the upgrade, switch back.
2. If you're going to run an upgrade, do it on a Saturday night. Don't do it on a Tuesday.
Click through to read the full RIM statement.
RIM says: RIM's in-depth diagnostic analysis of the service interruption that
occurred in North America on Tuesday night is progressing well and RIM
will continue to provide further information as it's available. RIM's
first priority during any service interruption is always to restore
service and then establish, monitor and maintain stability. Proper
analysis can take several days or longer and RIM's commitment is to
provide the most accurate and complete information possible in such
situations.
RIM is pleased to report that normal conditions returned on Wednesday
and the BlackBerry service continues to operate normally today.
RIM has been able to definitively rule out security and capacity issues
as a root cause. Further, RIM has confirmed that the incident was not
caused by any hardware failure or core software infrastructure.
RIM has determined that the incident was triggered by the introduction
of a new, non-critical system routine that was designed to provide
better optimization of the system's cache. The system routine was
expected to be non-impacting with respect to the real-time operation of
the BlackBerry infrastructure, but the pre-testing of the system routine
proved to be insufficient.
The new system routine produced an unexpected impact and triggered a
compounding series of interaction errors between the system's
operational database and cache. After isolating the resulting database
problem and unsuccessfully attempting to correct it, RIM began it's
failover process to a backup system.
Although the backup system and failover process had been repeatedly and
successfully tested previously, the failover process did not fully
perform to RIM's expectations in this situation and therefore caused
further delay in restoring service and processing the resulting message
queue.
RIM apologizes to customers for inconvenience resulting from the service
interruption. RIM's root cause analysis and system enhancement process
with respect to this incident is ongoing and RIM has already identified
certain aspects of its testing, monitoring and recovery processes that
will be enhanced as a result of the incident and in order to prevent
recurrence.
|
|
|