Uptime Monitoring

April 8, 2011 under The 100% Uptime Challenge

At this point, I’m reasonably confident that I have a reliable and fault-tolertant server setup powering this blog. I’ve got 3 servers in 3 countries which are each capable of keeping the blog running, even if the other 2 should fall of the face of the ‘net.

Each server has its own copy of the PHP files and MySQL database allowing it to operate independently. MySQL replication is smart enough to resume automatically after a connection failure, and I know how to verify my data integrity to confirm no conflicts have arisen.

I’ve tested a number of operating systems and browsers to see how they behave in theoretical outage scenarios.

With all this in place, it’s time to start the challenge and see if I can reach 100% uptime!

I tested out 3 monitoring services: Site24x7, SiteUptime, and Pingdom. While they all are straight forward to set up and seem at first glance to do their jobs perfectly, there is a problem: none of them behave exactly like a browser. To be more specific, none of them tried the alternate IPs when I tested shutting down a server. They then reported www.cwik.ch was down, even though all the browsers I tested connected instantly to one of the available servers.

I have set up all 3 monitoring services anyway. Chances are, even if there is an outage on one of the servers, at least one of the monitors will still connect. Until I find a better solution, I will consider the test a failure only if all 3 monitors detect an outage.

The monitoring intervals I’ve set up are:
Pingdom: 1 minute
SiteUptime: 2 minutes
Site24x7: 3 minutes

All 3 services offer a “badge” which I’ve placed in the left hand sidebar.

PS. I contacted Pingdom support to enquire about the above mentioned problem. I got an answer back indicating they were aware of the limitation and are keeping the problem in mind for a potential future fix, but at this point there is no way to have Pingdom try additional IPs should the first one fail to respond. Their support person pointed me to the Round-robin DNS article on Wikipedia and brought up the point that there is no agreed standard on how a client should handle this kind of situation, and he is quite right. That doesn’t change the fact that my testing indicates all major browsers and OS’s fully support this kind of failover, so the lesson is caveat emptor. The result you desire is not guaranteed.

Configuring Apache

March 23, 2011 under The 100% Uptime Challenge

In my scenario, the Apache server doesn’t need to be aware that it is only one server of many. Therefore I’ve customised my configuration file and copied the exact same file to all 3 servers. How much you customise is down to taste and requirements, but at a minimum I would suggest disabling any modules that you don’t need, configuring a ServerAdmin address and setting up logging to suit your needs.

I’ve posted my somewhat stripped down httpd.conf should you wish to use it as a starting point (it’s for Apache 2.2 on CentOS 5).

In addition to the customisations to httpd.conf, take a look in /etc/httpd/conf.d/ to see what other files are being loaded. I removed the welcome.conf file, but didn’t make any further changes. Most Apache modules will install a config file here, for example mod_ssl will create an ssl.conf file here.

Finally, enable the apache service:

chkconfig httpd on
service httpd start

The 100% Uptime Challenge

March 20, 2011 under The 100% Uptime Challenge

A question I get asked on a fairly regular basis goes something like this: I have a really important website that will cause me no end of grief should it ever go offline, even for 5 minutes. The typical scenarios include e-commerce sites and hosted apps (SaaS), as these types of site can have a direct financial impact on the operator if there is any downtime at all.

The conversation usually arises after an incident of downtime, in the reflective post-damage-control stage. One of the first questions is inevitably: how valuable is uptime for you, really? When you research the additional infrastructure and system complexity that is required to provide an extra 0.1% uptime, perhaps you realise that it’s simply not worth it to your business. Firstly, I’d like to quickly take a look at a few figures to see how much those uptime figures really promise:

3-nines or 99.9% uptime: 43m12s downtime per month
4-nines or 99.99% uptime: 4m19s downtime per month
5-nines or 99.999% uptime: 0m26s downtime per month

Of course we would all love to achieve 5-nines availability, and do so on a reasonable budget for a small to medium sized business (SMB). I do believe it is possible, and today I have set myself the challenge of proving it!

The challenge: Create a website that will achieve 5-nines uptime.

The budget: $0. I’ll be using my own servers, and 100% free/open source software. On a budget, you could probably do this on $300-$400 per month (+ the cost of implementation if you don’t have access to the necessary skills in-house).

The challenge criteria:

  • The site must achieve 5-nines availability
  • Must be a somewhat realistic usage case, ie. contain user-generated content, not just some static files
  • Be possible to implement without going bankrupt in the process

The requirements:

  • Some geographically disparate servers with independently routable IPs. I just so happen to have a few handy as I run a hosting company, Anu Internet Services
  • Reliable hardware, preferably with SSD or RAID storage (disks fail a lot)
  • Reliable, redundant internet connections, preferably in major datacenters and transit hubs
  • A reliable, secure operating system
  • DNS must not be a single point of failure, so we’ll host this ourselves
  • Robust file storage
  • A Web server
  • A database server which can handle multi-master replication and recover from a split-brain scenario
  • A web site
  • File replication to sync any changed files across the servers

The servers: I’ll be creating 3 new virtual machines, one each in Dublin, Ireland, Amsterdam, The Netherlands and Chicago, USA. I’ll call them ie.cwik.ch, nl.cwik.ch and us.cwik.ch. All of the servers are based on Intel Xeon architecture with hardware virtualization. We use hardware RAID (from 3ware) in all of the servers to ensure a disk failure doesn’t knock us offline. The virtual machines run under Xen, which acts as a bridge between the VM network interface and the physical connection.

The connections: each of the 3 datacenters provide 100+ megabits of redundant connections to our servers. The Irish datacenter has a single uplink from the server directly to the core switches. The US and Amsterdam datacenters both have redundant switches with redundant uplinks to the core. In Amsterdam, the connections are active/active, providing 2Gbps access to a 10Gbps core, directly connected to AMS-IX, the largest internet exchange in the world. All combined, I don’t think we could get much better connectivity.

The software:

  • OS: CentOS 5 – it’s reliable, secure and free
  • DNS: we’ll go with the standard, BIND
  • Storage: let’s keep it simple and stick with the journaled ext3 filesystem that CentOS 5 ships with
  • Web server: apache 2.2 – I know it well and know it’s up to the job. lighttpd also warrants a look.
  • Database server: MySQL 5.1 fits the multi-master requirement, but on its own it cannot recover from a split-brain scenario. Some careful server configuration and primary key conventions should solve this problem.
  • File replication: we’ll use rsync over ssh for this

The site: This blog is a prime candidate. It’s a fairly basic, straight forward WordPress install, with some user generated content and a web based admin. And, it’s on a domain I don’t use for anything critical, so I can experiment without any collateral damage!

The testing: We’ll configure 2 external monitoring services: site24x7 and siteuptime. They will perform checks at 1 and 2 minute intervals respectively and track any failed requests. Combined, they will check the site 64,800 times per month. If we miss a single check in a one month period, that will mean 0.00154% downtime, meaning we’ve missed our 5-nines target (but we’d be comfortably within 4-nines at 99.9985%)

I’m excited to see if I can pull this off. I plan to blog about each step in the process of getting the infrastructure set up, configuring the software, and ultimately posting some test results after the setup has been running for a month. I may even leave it running permanently as a long-term project, as there will undoubtedly be further related topics to blog about. The first that comes to mind is how to handle things like upgrading WordPress without causing any down-time (no, I don’t know how yet how I would go about doing that…)

Stay tuned for updates, and feel free to contact me with any suggestions or questions.

Subscribe