The Night of the Dead Servers II

It seems there was a large power cut last night, sometime before 18:00, affecting a wide area of Northern Ireland. As I’m in England I probably wouldn’t have noticed; but the internet service provider in Belfast who hosts my servers went out with the electricity and killed my machines.

I heard that their UPSes worked to keep things alive for a while, and I also heard that their generator started quickly, but it seems that their distribution panel couldn’t take the strain: we can forgive them for that, as long as it doesn’t happen again.

My main complaint is that I didn’t hear any of these things from my them. We discovered the problem when one of our customers phoned us to complain.

So, I tried to contact our provider, but their phones were dead too.

Their network returned a while later. I could connect to two of my unhappy servers, but my third was still dead.

Their phones rang, but nobody answered.

Then I lost my servers again. This time it seemed to be a routing loop, that twisted all through the night until finally, more than 15 hours after the incident, someone answered the phone. The routing problem was quickly solved, and my third dead server was revived.

They still haven’t told me anything about what happened.

A service provider needs to maintain their customers’ confidence, especially when things go wrong. They should be able to identify routing problems in their network before I do. They should make sure everything is working after they think they’ve fixed it. And if they can’t do that, they should at least communicate: they should phone me to explain the problem, and not make it sound like it’s my fault for not having their mobile phone numbers.

This is the third time (that we know of) that we’ve had a service outage. The first time was when they unplugged our servers because they “didn’t know they were there”. The second time…we don’t know; we noticed one morning that our servers had been disconnected for a while; they were reconnected before we complained, so we waited to hear what the problem was, and we’re still waiting.