Before I dig in to the technical posts about setting up the servers, a few more theoretical notes about the multiple A record technique. I’ve done some tests simulating failure scenarios to see how the browser really reacts to an unreachable server.
The connection refused test: This is where your browser would normally display an error along the lines of “Safari can’t open the page “http://184.108.40.206/” because Safari can’t connect to the server “220.127.116.11”. This kind of error will be displayed if the connection request to port 80 of the server’s IP address is refused. This can happen if your Web server is down due to a crash, maintenance (maybe you’re installing a new SSL cert, for example), an attack (too busy) or just simply too much traffic.
To test how the browser reacts to this scenario, I set up a simple PHP page which prints the current server’s IP. My browser picked Ireland (at random), so the IP 18.104.22.168 was displayed. Next, I logged in to the ie.cwik.ch server and modified the firewall ruleset. I commented out the exception for port 80, then changed the action from DROP to REJECT, so my browser would get a connection failed response when trying to open the HTTP connection. I then went back to my browser and hit reload. The same script came back instantly, except with a different IP: 22.214.171.124 – it had picked my Dutch server, nl.cwik.ch. Success! I checked my secure log on ie.cwik.ch and noted that my browser had in fact tried an HTTP connection to ie.cwik.ch prior to hitting the nl.cwik.ch server:
ie kernel: IN=eth0 OUT= MAC=00:16:3e:3a:21:7f:00:1b:0d:e6:10:40:08:00 SRC=<my secret IP> DST=126.96.36.199 LEN=48 TOS=0x00 PREC=0x00 TTL=54 ID=25893 DF PROTO=TCP SPT=54220 DPT=80 WINDOW=65535 RES=0x00 SYN URGP=0
The connection timeout test: There are many instances in which a connection will be refused, as outlined above. However another less common scenario is that of a network failure, whereby the connection between the browser and the server goes down. This can be a large scale peering or transit provider outage, or localised within the datacenter, for example a cable is broken or an ethernet loop is created. The end result is that the browser no longer gets any response at all from the server, and eventually display a message along the lines of “Safari can’t open the page “http://188.8.131.52/” because the server where this page is located isn’t responding”.
To test how my multiple A record configuration responds in this situation, I simulated disconnecting the ethernet cable from my virtual machine:
[root@ie1-xen-0 ~]# xm network-detach ie.cwik.ch 0
This time the results were different. Since the browser had no idea if the server would ever respond, it sat for a full 75 seconds waiting. After 75 seconds it gave up and tried the next server, which instantly returned the test page.
Reading up a bit, I discovered the 75 second timeout is not a browser configuration issue but rather an operating system level setting. All the browsers I tested on Mac OS X took 75 seconds to time out, but on Windows XP the timeout was just 25 seconds. On CentOS 5 (desktop, with Firefox) the timeout was just 3 seconds before it tried the next server.
While those delays are substantial, particularly on Mac OS X, the page did eventually load. Total network outages like this are relatively rare, and if your users can put up with the initial delay, they will eventually get a result from one of the servers. With some additional work it would even be possible to detect an outage like this on the server level and send an HTTP redirect back to the browser once it finally hits an active server, thereby avoiding sending subsequent requests to the failed server. Another technique which could be deployed to mitigate the effects of this kind of outage is to run a poller on each of the nameservers which checks the availability of each of the web severs. In the event that a web server is unreachable, the IP of that server could be removed from the DNS zone, so no new clients will attempt to connect to it. Once the server regains its feet, the poller would notify the DNS server to add the IP back to the pool.