A Few More Notes On Using Multiple ‘A’ Records

March 22, 2011 under The 100% Uptime Challenge

Before I dig in to the technical posts about setting up the servers, a few more theoretical notes about the multiple A record technique. I’ve done some tests simulating failure scenarios to see how the browser really reacts to an unreachable server.

The connection refused test: This is where your browser would normally display an error along the lines of “Safari can’t open the page “http://1.2.3.4/” because Safari can’t connect to the server “1.2.3.4”. This kind of error will be displayed if the connection request to port 80 of the server’s IP address is refused. This can happen if your Web server is down due to a crash, maintenance (maybe you’re installing a new SSL cert, for example), an attack (too busy) or just simply too much traffic.

To test how the browser reacts to this scenario, I set up a simple PHP page which prints the current server’s IP. My browser picked Ireland (at random), so the IP 80.93.25.175 was displayed. Next, I logged in to the ie.cwik.ch server and modified the firewall ruleset. I commented out the exception for port 80, then changed the action from DROP to REJECT, so my browser would get a connection failed response when trying to open the HTTP connection. I then went back to my browser and hit reload. The same script came back instantly, except with a different IP: 83.96.156.169 – it had picked my Dutch server, nl.cwik.ch. Success! I checked my secure log on ie.cwik.ch and noted that my browser had in fact tried an HTTP connection to ie.cwik.ch prior to hitting the nl.cwik.ch server:

ie kernel: IN=eth0 OUT= MAC=00:16:3e:3a:21:7f:00:1b:0d:e6:10:40:08:00 SRC=<my secret IP>
DST=80.93.25.175 LEN=48 TOS=0x00 PREC=0x00 TTL=54 ID=25893 DF PROTO=TCP SPT=54220
DPT=80 WINDOW=65535 RES=0x00 SYN URGP=0

The connection timeout test: There are many instances in which a connection will be refused, as outlined above. However another less common scenario is that of a network failure, whereby the connection between the browser and the server goes down. This can be a large scale peering or transit provider outage, or localised within the datacenter, for example a cable is broken or an ethernet loop is created. The end result is that the browser no longer gets any response at all from the server, and eventually display a message along the lines of “Safari can’t open the page “http://1.2.3.4/” because the server where this page is located isn’t responding”.

To test how my multiple A record configuration responds in this situation, I simulated disconnecting the ethernet cable from my virtual machine:

[root@ie1-xen-0 ~]# xm network-detach ie.cwik.ch 0

This time the results were different. Since the browser had no idea if the server would ever respond, it sat for a full 75 seconds waiting. After 75 seconds it gave up and tried the next server, which instantly returned the test page.

Reading up a bit, I discovered the 75 second timeout is not a browser configuration issue but rather an operating system level setting. All the browsers I tested on Mac OS X took 75 seconds to time out, but on Windows XP the timeout was just 25 seconds. On CentOS 5 (desktop, with Firefox) the timeout was just 3 seconds before it tried the next server.

While those delays are substantial, particularly on Mac OS X, the page did eventually load. Total network outages like this are relatively rare, and if your users can put up with the initial delay, they will eventually get a result from one of the servers. With some additional work it would even be possible to detect an outage like this on the server level and send an HTTP redirect back to the browser once it finally hits an active server, thereby avoiding sending subsequent requests to the failed server. Another technique which could be deployed to mitigate the effects of this kind of outage is to run a poller on each of the nameservers which checks the availability of each of the web severs. In the event that a web server is unreachable, the IP of that server could be removed from the DNS zone, so no new clients will attempt to connect to it. Once the server regains its feet, the poller would notify the DNS server to add the IP back to the pool.

comments: 4 » tags:

Reliability Vs Fault Tolerance

March 21, 2011 under The 100% Uptime Challenge

When it comes to increasing uptime, there are two schools of thought. The first is to try to consider every single element along the request path, and try to make each and every step as reliable as possible.

On a high level, we can illustrate the path of an http request through a web app system approximately so:

User Agent (browser) makes HTTP request -> DNS resolution -> TCP/IP connection is established with the Web server -> web server parses HTTP request -> Web server hands off request to CGI of some sort (eg. PHP) -> CGI reads file from filesystem -> CGI processing -> Web server receives content back from CGI -> Web server sends reply to UA -> UA renders page

This is a somewhat simplified overview, the idea is just to show broadly what happens each time you click a link in your browser. Along this path we encounter the following potential pit-falls:

  • DNS resolution failure – no dns servers send a correct reply
  • TCP/IP connection failure – link down
  • Web server failure – server busy, misconfiguration, etc
  • File storage failure
  • CGI failure – no reply received or invalid response
  • Request timeout – web server received content from CGI, but client has gone away (timed out?)

I’d like to look at how we might decrease the likelihood of encountering any of these pitfalls.

DNS: The domain name system is designed from the ground up to be resilient against a single or even multiple failure. All domain names must have multiple DNS resolvers listed, and all computers which know how to make a DNS request also know to keep trying other resolvers until they get a good reply. With a little care to ensure all your resolvers have a copy of the master data for your zone, this one is a simple one to solve.

TCP/IP failure: This one is more complex, as each TCP request traverses a number of different systems. It will start by coming in to your network via a WAN link. Right there you have potential failure point #1, a single WAN link, if down, will render your server unavailable. The solution is to house your server in a facility with multiple redundant WAN links, and the proper equipment and configuration to advertise these redundant routes via BGP. BGP is the Border Gateway Protocol used by Internet routers around the world to advertise which routes they are available by. If one link goes down, the BGP router will stop advertising that particular route, and continue receiving traffic via other available connections.

Once the TCP request is safely within your network, you then have the possibility of a cable or switchport failure disrupting the connection. Again there exists a network layer solution for this, it’s called 802.3ad link aggregation. This is configured at the switch level and instructs the switch that multiple cables/ports may be used to reach a destination MAC address. The very same standard can also be used to connect multiple network ports on a server to multiple 802.3ad-aware switches, thereby creating a fully fault tolerant physical connection path between the network core and the server.

Web server failure: The most common cause of a failure in this instance would be a crashed or stopped web server. This is most often caused by misconfiguration or lack of resources. To mitigate this risk, careful attention should be paid to the selection of server software, the configuration of the software, and the matching of server resources to expected load. In many cases it will make sense to run multiple web servers behind a load balancer. While this adds another potential failure point to the request process, it can also help to mitigate failure by only directing incoming requests to responsive web servers. Think of it like RAID for Web servers.

Storage failure: Your code has to be stored somewhere, and if that storage becomes unavailable, the web server and/or CGI processes will stop working. While network based storage can increase flexibility, it also adds another potential point of failure. Local disk based storage is one of the most common components to fail, so RAID storage should always be used to minimise this risk. Modern SSDs also offer potential relief from disk failures. With no moving parts, they are far more reliable than traditional spinning disks, and faster too.

CGI failure: This is where by far the most failures occur. Ever seen a 500 Internal Server Error or a 503 Service Unavailable error? This is almost definitely caused by a CGI failure. Careful choice of software and proper configuration can help ensure reliability, but by far the biggest component to consider here is your code. Similar to Web server failure, one of the best ways to mitigate this risk is by running multiple CGI servers.

Request timeout: There are 2 reasons a request might time out. Firstly, the server may be overloaded, and unable to send a reply before the connection is dropped. This problem can be solved by adding more capacity (a faster server, or more servers). The second reason this could happen is due to poor programming and/or design. Any process which you expect will take more than a couple seconds to complete should probably be run asynchronously, with a message returned to the end user informing them of this (eg. a nice progress bar). This will prevent the connection from timing out and keep your users informed of what’s happening.

It is important to look carefully at the request path for your own application, as each site or web app will be a little different. For example, most apps use a database backend, which I did not include in the above example.

So the first school of thought in optimising availability is to carefully consider each and every step in the request process and try to make sure there is as much redundancy and as few potential failures as possible in that chain. The second approach is to resign yourself to the fact that failures do happen, no matter how carefully you try to avoid them, and to acknowledge that you need to handle those failures gracefully whenever they do arise.

DNS has a handy feature built-in whereby an address (A) record is allowed to return multiple IP addresses. This technique can be used to indicate to the User Agent (browser) that multiple servers are available to handle the request. The UA will pick one IP at random from the list returned, and send the HTTP request to it. Should a connection failure occur, the browser will automatically try the next IP in the list, until it finds a server which works. By utilising this technique, we can create fault tolerance all the way up to the very start of the HTTP request process, the User Agent.

A quick DNS lookup shows www.google.com, www.yahoo.com and www.reddit.com are all using this technique, and this is the technique I will be deploying in my quest to reach 100% uptime for this blog.

Subscribe