I’ve been building out our capacity for the past 3 years using almost identical hardware configurations: motherboards with Intel Server chipsets, Xeon CPUs, 3ware RAID controllers and Intel NICs.
Sure there have been minor variations along the way: disks get bigger, memory gets faster, CPU clock speeds increment gradually. Overall though, the hardware we’re using today is very similar to what we were using 3 years ago. While some complain about the lack of progress, I am quite happy with this slow-but-steady progress in hardware development. It means I can really get to know the servers I work with every day.
One benefit of this is that I’ve found myself starting to develop a 6th sense for trouble. An I/O op which takes ever so slightly longer than it should, an occasional kernel panic on an otherwise bog-standard Xen virtual machine – small signs that something is just not quite exactly right with the hardware.
Case in point: a new server commissioned on 05/11/2010 which at first glance appeared to be functioning perfectly, but as the days and weeks went by and I started to load it up with production virtual machines, I started to notice the occasional issue which I had not noticed on other servers. Not finding any specific problems, I left it to do its job, and it has been perfectly reliable ever since. Until yesterday, when the RAID controller sent me a half dozen emails telling me one of the drives had bad sectors. Then the drive timed out, degraded the RAID array, shut down then powered back on and came back online. The controller did its job and auto-recovered by rebuilding the RAID array, but now I have solid evidence that in fact something is not quite right with the server after all. Back to the supplier it goes for a thorough diagnosis.
I wish I could pinpoint exactly why I suspected something wasn’t right with the hardware. It would help me identify problems more quickly and efficiently in future. So far I have not been able to come up with anything concrete, it’s just an intuition. The one thing I have learned is to have more faith in my intuition and less trust in my hardware. If something feels off, it probably is.