Uptime Monitoring

April 8, 2011 under The 100% Uptime Challenge

At this point, I’m reasonably confident that I have a reliable and fault-tolertant server setup powering this blog. I’ve got 3 servers in 3 countries which are each capable of keeping the blog running, even if the other 2 should fall of the face of the ‘net.

Each server has its own copy of the PHP files and MySQL database allowing it to operate independently. MySQL replication is smart enough to resume automatically after a connection failure, and I know how to verify my data integrity to confirm no conflicts have arisen.

I’ve tested a number of operating systems and browsers to see how they behave in theoretical outage scenarios.

With all this in place, it’s time to start the challenge and see if I can reach 100% uptime!

I tested out 3 monitoring services: Site24x7, SiteUptime, and Pingdom. While they all are straight forward to set up and seem at first glance to do their jobs perfectly, there is a problem: none of them behave exactly like a browser. To be more specific, none of them tried the alternate IPs when I tested shutting down a server. They then reported www.cwik.ch was down, even though all the browsers I tested connected instantly to one of the available servers.

I have set up all 3 monitoring services anyway. Chances are, even if there is an outage on one of the servers, at least one of the monitors will still connect. Until I find a better solution, I will consider the test a failure only if all 3 monitors detect an outage.

The monitoring intervals I’ve set up are:
Pingdom: 1 minute
SiteUptime: 2 minutes
Site24x7: 3 minutes

All 3 services offer a “badge” which I’ve placed in the left hand sidebar.

PS. I contacted Pingdom support to enquire about the above mentioned problem. I got an answer back indicating they were aware of the limitation and are keeping the problem in mind for a potential future fix, but at this point there is no way to have Pingdom try additional IPs should the first one fail to respond. Their support person pointed me to the Round-robin DNS article on Wikipedia and brought up the point that there is no agreed standard on how a client should handle this kind of situation, and he is quite right. That doesn’t change the fact that my testing indicates all major browsers and OS’s fully support this kind of failover, so the lesson is caveat emptor. The result you desire is not guaranteed.