Yesterday, at around noon EDT, customers began reporting service interruptions and signal interface, across XM's listening community. The problems persisted approximately 24 hours, the satellite radio provider having returned to its usual quality of service, at around the same time today. Blogs began reporting the XM Radio outage early on, noting that the company was not exactly forthcoming, when it came to issuing a statement on the subject. Not surprisingly, when we attempted to contact at XM, we found ourselves leaving messages on machines.
Now that things have calmed down considerably over at XM headquarters, we managed to get a hold of Chance Patterson, one of the company's spokespeople. Patterson answered our questions, while making it clear that there are still a lot of questions that the engineers at XM have yet to answer, themselves. You can find the full transcript of my conversation with Patterson, after the jump.
So, what exactly happened over there?
You go from working the problem and solving it to understanding what happened and why. That began the moment that we restored the broadcast from the satellite. On the top level, we're always in communication with the satellite. That comes in two forms. One is sending all of the data that becomes the audio content and the displayed information on the navigation screens. The other is commands that we use to control the satellite, test the systems, and deal with all of the major functions. So, as we were sending software up to the satellite, something caused it to stop broadcasting. While we could communicate with it, it wasn't returning the information.
There's an antenna on the bird. It processes the 0s and 1s, and sends them back down, across the country, through the outbound antennas. It stopped doing that. The dynamic is interesting, too, because we have these two satellite, XM 3 and 4. Both have a national footprint, so they both cover the entire US and parts of Canada. In many cases, if you were in an open space, you wouldn't know that there was an issue--that was part of it, too. We were trying to assess the customer impact, and whether we could immediately work the problem. Was it going to be two hours, or would it be--like it ended up--24 hours? Without being able to touch the thing, you're relying on information that comes back, so you send test signals and diagnostics to all the systems, to figure out the extent of the problem, and whether it goes into other parts of the system.
Is that part of the reason that it took so long to issue a statement?
We put out a statement at 5:30 or 6:00 PM, that night. That was an initial service interruption announcement. At that time, the timeline we were working through had us resetting and rebroadcasting again, that evening. This is the first time this has happened, so I think the decision was made at the time, as to the steps for restarting and rebroadcasting. Instead of being ten minutes to move from step seven to eight, we made it 45 minutes, because we wanted to get feedback from the satellite that it was okay. One thing lead to another, and here it was, 11:45 AM today before it was reset. So there were several communications that went out, some of them by e-mail blast, some of them through the Web site, some of them on-air.
But you're still unclear of the specifics of the problem?
We know when the problem occurred, and that's sort of a start. Up until that point, things were operating, as they say in the business, 'nominally.' We identified when the issue occurred, and so we're looking at all of the activities that occurred around that time. That's as much as we can speak to, right now. We have vendors, our own engineers, and others, pouring over lots of data, to see what the root cause was.
And solving the problem was a matter of restarting it? Does everything seems to be fine now?
Yeah. It's healthy. The satellite is healthy and the payload is fully operational, so are the subsystems of the satellite.
Everyone must be fairly relieved, over there.
Yeah, but now the next part is going over the information that we've received, over the last 24 hours, make sure we understand what happened, and taking it from there.
So, we'll be getting a lot more information, over the next couple of days?
We'll learn more later--a lot of these folks who were on watch when the incident occurred, worked all day, all night, through mid-day today to get us back online, so right now, they're taking a break.
May 22, 2007 9:04 PM
With the XM Satellite service outage dominating headlines so soon after the Blackberry outage, it's worth again looking at what these outages have in common, which is why organizations can't diagnose such problems and prevent them happening. In this case, XM says they have "quickly identified the problem," a software glitch, but the damage has been done.
Like the Blackberry outage, this validates the need for companies to be able to predict potential outages before they occur and before users are impacted. It's not enough anymore to wait for expensive monitoring tools to pick up problems after they've already wreaked havoc on the system. If organizations continue to take this approach, you can be sure we haven't seen the last of these damaging outages. Instead companies need to think proactively and gain insight into the behavior of their IT infrastructure to learn what is normal that way when abnormalities are detected, problems can be identified and prevented before anyone is affected. My company, Integrien, is in the business of helping companies achieve this.
May 23, 2007 1:51 PM
"this validates the need for companies to be able to predict potential outages before they occur and before users are impacted"
Dang! I could use that for my stock picks too!
Satellite systems are designed with multiple layers of backup and redundancy, and like stocks, you can't anticipate everything.
Stuff happens. Listen to AM if you have to.
NBD
May 23, 2007 5:23 PM
Reality Check!!!!!
These are radio stations that went off the air for a day, not oxygen to a ER. Life goes on.
May 23, 2007 7:53 PM
I started hearing the problem with XM signal on my way home from work. When I checked my email I had a notice from XM that they where experiencing a problem. Within an hour or two I received another email stating it was a software problem.
I feel XM kept me informed in a timely manner.
May 23, 2007 8:15 PM
Why did both satellites have to be restarted? The event must have started with something from the ground (or something that both satellites have in common). Something in the story doesn't fit
May 23, 2007 9:31 PM
Must have been running Windows!!! LOL
May 23, 2007 9:48 PM
More like probably Linux
May 24, 2007 7:28 AM
I went to the web site, and contrary to what this article said, I found no mention of an outage. I'd like a day credit for the 1+4 subscriptions I have. (And, AM is not an option, my presets for AM/FM have never been set in the 4 years I have had my car.)
May 24, 2007 9:32 AM
the service announcement was up, it was just--not surprisingly--not particularly easy to find.