In my last issue of “Stories from the front #1” I was telling about annoying physical security-measurement in customer’s datacenters and how i once got stuck in one customer’s serverroom.
In this issue I’m going to tell about a worst-case-scenario where a total power-failure together with a misconfiguration of the UPS-software lead to database-corruption of a high-available cluster.
The setup is basically a two-node Primepower 1500 cluster, seperated by several hundred meters, running the Primecluster software. They have a common database which is distributed over two FibreChannel-cabinets with lots of harddrives. Both clusternodes can concurrently access those two cabinets (one cabinet per node). Both nodes are also connected by two 1000-Base-LX clusterinterconnect-lines which are used to synchronize that concurent access. Every single LAN-connection is redundant (through Solaris’ IPMP driver), every single piece of equippment is available twice, and we’re using different power-circuits from different power-feeds for everything. We’re as much reliant as we could possibly be with that hardware.
So far, so good. We also have local UPS per node, which are both good for about 45 minutes each – if they run out of juice, they alert the node by using a serial-connections on which a UPS-daemon is listening. This one does two things: Sending out an SNMP-trap to the NMS to alert the operator and telling the cluster-software to shut down before the power fails completely.
OK. The datacenter is located in West-Africa. The power-supply is flaky, you have brown-outs every other hour and about one black-out for minutes every day. The site was well prepared though, it has a large Diesel-engine (~40 MVA class) and a huge UPS. All power is converted to DC and converted back to AC through this UPS. And to be safe, we still have those two small UPS (20 kVA). All together a pretty safe solution.
Until that day. They had a total black-out. The Diesel didn’t start. But the big UPS of the site kicked in, so everything should’ve been fine.
About 50 minutes later the cluster went down.
When our emergency-staff (from abroad) logged into the system to start up the cluster-software again they realized that they couldn’t powercycle one node – it wasn’t accesible at all – so they sent out someone from the local company to the serverroom to inspect the switchboard.
They found out that a circuit-breaker of the local small UPS tripped. They switched it and they were able to start the remaining node, but Oracle totally complained about the database – it got corrupted. We were totally mad as you can imagine, we had so many safeguards in place, how could that possibly happen?
After repairing the database which took like 5 hours we started an investigation which lead to the following conclusion:
When the big UPS kicked in it partially failed in one serverroom, it sent out a power-surge which made the circuit-breaker in our UPS trip. The real problem was that the UPS-daemon wasn’t configured correctly; the operator in the NOC wasn’t alerted that the cluster-noder was running on battery and when the batteries were drained the cluster-software wasn’t shut down correctly. For some reason we never figured out the database got corrupted, something which wasn’t supposed to happen.
What’s the morale? Two things: First, building high-available systems (like in 99.9999% theoretical availabilty) do not prevent configuration errors. Second, the acceptance tests after installation and commissioning must’ve been sloppy, because a testcase like pulling the plug to check if the database gets corrupted – might have revealed that there might be a problem. Although I’m not totally sure about the second part.
Concluding I just can say, that I’ll will be even more alert about those little configuration-details as I was before.
Hope you enjoyed this little story; if you have a story to share, drop me a note.