Reliability Vs Fault Tolerance

March 21, 2011 under The 100% Uptime Challenge

When it comes to increasing uptime, there are two schools of thought. The first is to try to consider every single element along the request path, and try to make each and every step as reliable as possible.

On a high level, we can illustrate the path of an http request through a web app system approximately so:

User Agent (browser) makes HTTP request -> DNS resolution -> TCP/IP connection is established with the Web server -> web server parses HTTP request -> Web server hands off request to CGI of some sort (eg. PHP) -> CGI reads file from filesystem -> CGI processing -> Web server receives content back from CGI -> Web server sends reply to UA -> UA renders page

This is a somewhat simplified overview, the idea is just to show broadly what happens each time you click a link in your browser. Along this path we encounter the following potential pit-falls:

  • DNS resolution failure – no dns servers send a correct reply
  • TCP/IP connection failure – link down
  • Web server failure – server busy, misconfiguration, etc
  • File storage failure
  • CGI failure – no reply received or invalid response
  • Request timeout – web server received content from CGI, but client has gone away (timed out?)

I’d like to look at how we might decrease the likelihood of encountering any of these pitfalls.

DNS: The domain name system is designed from the ground up to be resilient against a single or even multiple failure. All domain names must have multiple DNS resolvers listed, and all computers which know how to make a DNS request also know to keep trying other resolvers until they get a good reply. With a little care to ensure all your resolvers have a copy of the master data for your zone, this one is a simple one to solve.

TCP/IP failure: This one is more complex, as each TCP request traverses a number of different systems. It will start by coming in to your network via a WAN link. Right there you have potential failure point #1, a single WAN link, if down, will render your server unavailable. The solution is to house your server in a facility with multiple redundant WAN links, and the proper equipment and configuration to advertise these redundant routes via BGP. BGP is the Border Gateway Protocol used by Internet routers around the world to advertise which routes they are available by. If one link goes down, the BGP router will stop advertising that particular route, and continue receiving traffic via other available connections.

Once the TCP request is safely within your network, you then have the possibility of a cable or switchport failure disrupting the connection. Again there exists a network layer solution for this, it’s called 802.3ad link aggregation. This is configured at the switch level and instructs the switch that multiple cables/ports may be used to reach a destination MAC address. The very same standard can also be used to connect multiple network ports on a server to multiple 802.3ad-aware switches, thereby creating a fully fault tolerant physical connection path between the network core and the server.

Web server failure: The most common cause of a failure in this instance would be a crashed or stopped web server. This is most often caused by misconfiguration or lack of resources. To mitigate this risk, careful attention should be paid to the selection of server software, the configuration of the software, and the matching of server resources to expected load. In many cases it will make sense to run multiple web servers behind a load balancer. While this adds another potential failure point to the request process, it can also help to mitigate failure by only directing incoming requests to responsive web servers. Think of it like RAID for Web servers.

Storage failure: Your code has to be stored somewhere, and if that storage becomes unavailable, the web server and/or CGI processes will stop working. While network based storage can increase flexibility, it also adds another potential point of failure. Local disk based storage is one of the most common components to fail, so RAID storage should always be used to minimise this risk. Modern SSDs also offer potential relief from disk failures. With no moving parts, they are far more reliable than traditional spinning disks, and faster too.

CGI failure: This is where by far the most failures occur. Ever seen a 500 Internal Server Error or a 503 Service Unavailable error? This is almost definitely caused by a CGI failure. Careful choice of software and proper configuration can help ensure reliability, but by far the biggest component to consider here is your code. Similar to Web server failure, one of the best ways to mitigate this risk is by running multiple CGI servers.

Request timeout: There are 2 reasons a request might time out. Firstly, the server may be overloaded, and unable to send a reply before the connection is dropped. This problem can be solved by adding more capacity (a faster server, or more servers). The second reason this could happen is due to poor programming and/or design. Any process which you expect will take more than a couple seconds to complete should probably be run asynchronously, with a message returned to the end user informing them of this (eg. a nice progress bar). This will prevent the connection from timing out and keep your users informed of what’s happening.

It is important to look carefully at the request path for your own application, as each site or web app will be a little different. For example, most apps use a database backend, which I did not include in the above example.

So the first school of thought in optimising availability is to carefully consider each and every step in the request process and try to make sure there is as much redundancy and as few potential failures as possible in that chain. The second approach is to resign yourself to the fact that failures do happen, no matter how carefully you try to avoid them, and to acknowledge that you need to handle those failures gracefully whenever they do arise.

DNS has a handy feature built-in whereby an address (A) record is allowed to return multiple IP addresses. This technique can be used to indicate to the User Agent (browser) that multiple servers are available to handle the request. The UA will pick one IP at random from the list returned, and send the HTTP request to it. Should a connection failure occur, the browser will automatically try the next IP in the list, until it finds a server which works. By utilising this technique, we can create fault tolerance all the way up to the very start of the HTTP request process, the User Agent.

A quick DNS lookup shows, and are all using this technique, and this is the technique I will be deploying in my quest to reach 100% uptime for this blog.