The 100% Uptime Challenge

March 20, 2011 under The 100% Uptime Challenge

A question I get asked on a fairly regular basis goes something like this: I have a really important website that will cause me no end of grief should it ever go offline, even for 5 minutes. The typical scenarios include e-commerce sites and hosted apps (SaaS), as these types of site can have a direct financial impact on the operator if there is any downtime at all.

The conversation usually arises after an incident of downtime, in the reflective post-damage-control stage. One of the first questions is inevitably: how valuable is uptime for you, really? When you research the additional infrastructure and system complexity that is required to provide an extra 0.1% uptime, perhaps you realise that it’s simply not worth it to your business. Firstly, I’d like to quickly take a look at a few figures to see how much those uptime figures really promise:

3-nines or 99.9% uptime: 43m12s downtime per month
4-nines or 99.99% uptime: 4m19s downtime per month
5-nines or 99.999% uptime: 0m26s downtime per month

Of course we would all love to achieve 5-nines availability, and do so on a reasonable budget for a small to medium sized business (SMB). I do believe it is possible, and today I have set myself the challenge of proving it!

The challenge: Create a website that will achieve 5-nines uptime.

The budget: $0. I’ll be using my own servers, and 100% free/open source software. On a budget, you could probably do this on $300-$400 per month (+ the cost of implementation if you don’t have access to the necessary skills in-house).

The challenge criteria:

  • The site must achieve 5-nines availability
  • Must be a somewhat realistic usage case, ie. contain user-generated content, not just some static files
  • Be possible to implement without going bankrupt in the process

The requirements:

  • Some geographically disparate servers with independently routable IPs. I just so happen to have a few handy as I run a hosting company, Anu Internet Services
  • Reliable hardware, preferably with SSD or RAID storage (disks fail a lot)
  • Reliable, redundant internet connections, preferably in major datacenters and transit hubs
  • A reliable, secure operating system
  • DNS must not be a single point of failure, so we’ll host this ourselves
  • Robust file storage
  • A Web server
  • A database server which can handle multi-master replication and recover from a split-brain scenario
  • A web site
  • File replication to sync any changed files across the servers

The servers: I’ll be creating 3 new virtual machines, one each in Dublin, Ireland, Amsterdam, The Netherlands and Chicago, USA. I’ll call them ie.cwik.ch, nl.cwik.ch and us.cwik.ch. All of the servers are based on Intel Xeon architecture with hardware virtualization. We use hardware RAID (from 3ware) in all of the servers to ensure a disk failure doesn’t knock us offline. The virtual machines run under Xen, which acts as a bridge between the VM network interface and the physical connection.

The connections: each of the 3 datacenters provide 100+ megabits of redundant connections to our servers. The Irish datacenter has a single uplink from the server directly to the core switches. The US and Amsterdam datacenters both have redundant switches with redundant uplinks to the core. In Amsterdam, the connections are active/active, providing 2Gbps access to a 10Gbps core, directly connected to AMS-IX, the largest internet exchange in the world. All combined, I don’t think we could get much better connectivity.

The software:

  • OS: CentOS 5 – it’s reliable, secure and free
  • DNS: we’ll go with the standard, BIND
  • Storage: let’s keep it simple and stick with the journaled ext3 filesystem that CentOS 5 ships with
  • Web server: apache 2.2 – I know it well and know it’s up to the job. lighttpd also warrants a look.
  • Database server: MySQL 5.1 fits the multi-master requirement, but on its own it cannot recover from a split-brain scenario. Some careful server configuration and primary key conventions should solve this problem.
  • File replication: we’ll use rsync over ssh for this

The site: This blog is a prime candidate. It’s a fairly basic, straight forward WordPress install, with some user generated content and a web based admin. And, it’s on a domain I don’t use for anything critical, so I can experiment without any collateral damage!

The testing: We’ll configure 2 external monitoring services: site24x7 and siteuptime. They will perform checks at 1 and 2 minute intervals respectively and track any failed requests. Combined, they will check the site 64,800 times per month. If we miss a single check in a one month period, that will mean 0.00154% downtime, meaning we’ve missed our 5-nines target (but we’d be comfortably within 4-nines at 99.9985%)

I’m excited to see if I can pull this off. I plan to blog about each step in the process of getting the infrastructure set up, configuring the software, and ultimately posting some test results after the setup has been running for a month. I may even leave it running permanently as a long-term project, as there will undoubtedly be further related topics to blog about. The first that comes to mind is how to handle things like upgrading WordPress without causing any down-time (no, I don’t know how yet how I would go about doing that…)

Stay tuned for updates, and feel free to contact me with any suggestions or questions.

comments: Comments Off on The 100% Uptime Challenge tags: , , , ,
Subscribe