Failover!
It had to happen someday. It happened to me this morning. I was sat on the train and saw a whole bunch of alerts fire through about memory first of all and then thermal sensors exceeding thresholds. An email to the emergency response team from the infrastructure manager telling us that one server has over heated and has failed over. We run an active/active cluster.
A bit of fettling of memory and disabling of non critical scheduled jobs (mostly alerts) to reduce pressure and so far, everything is up and running. The day is not over, so I’m not planning my weekend just yet but this has planted a seed in my mind for a future post or possible series on high availability.
Hopefully HP will come good and resolve our overheating issues and allow me to fail back tomorrow when I have some downtime.