Isolating the cause of seemingly random webapp crashes (and identifying who is responsible for fixing it)

July 1, 2012 under Main

As a managed hosting provider, it is sometimes difficult to definitively draw the line between a customer’s problem and our problem. We are paid to provide reliable infrastructure and platform, on which customers deploy their code. What we’re not responsible for is the reliability of their code.

What happens then when a customer’s webapp starts crashing occasionally and seemingly at random?
In one recent case, the frequency of such crashes went from zero to as much as half a dozen crashes per day over the course of a few months. The customer was pushing for us to fix the problem, but as far as I could tell the problem did not lie in the infrastructure or platform.

Out of memory errors indicated the problem may be a resource shortage, but doubling the RAM allocated to the virtual machine had no impact on the frequency of the errors. At this point I began to suspect buggy code was causing a runaway condition. The problem I faced at this point was convincing an increasingly unhappy customer that the problem was with their code.

To do this I needed evidence that the crashes were related to requests to URIs in their app.

This particular app is written using the Lasso programming language. One nice feature of Lasso is define_atbegin and define_atend, two tags which can be used to insert code to preprocess and postprocess for a particular script. These can be defined globally to run as preprocessor and postprocessor scripts for every request to the server. Using this feature, I installed a debugger system which is relatively simple in principle but surprisingly powerful.

At the start of each request, before any customer code is executed, the preprocessor script creates a record in a MySQL table containing details of the request: date and time, request URI including any GET params, client IP, client browser, and a column called ‘closed’ with a value of 0.

Once the customer code has finished executing, the postprocessor script updates the MySQL table to set the ‘closed’ column to 1, record the execution time (helps to identify slow running scripts) and record the contents of the error stack. If the customer code crashes, the postprocessor code won’t get a chance to run. In this case the status of ‘closed’ will remain 0.

The MySQL database can then be queried to retrieve a list of pages that never finished executing, pages that were slow to execute, or pages which finished but contained an error stack. This is all very useful debugging info, but it is the first condition which is the most useful, as this allows you to see exactly which URIs were requested but never completed – ie which pages are crashing the system!

Screenshot of HTML display

As the aim of this exercise was to highlight to the customer where they should be looking for errors, I quickly created an HTML display of the data from MySQL and put a password on it. The customer could then keep an eye on the various reports to check for slow pages, error stacks and pages that crashed.

Within a day of going live, the customer had identified the culprit code and uploaded a patched version. The webapp hasn’t crashed since.

While I can’t take credit for the original concept (Bil Corry came up with it many years ago) or even writing the debugger code (I hired an enormously talented programmer to do that), I can claim to have successfully implemented this method to isolate the offending code. This is a really great tool that not only helped fix a hard to find bug, but also helped to clearly define that it was customer code and not infrastructure or platform at fault. This made the customer happy, and a happy customer makes me happy!

Configuring File Replication

March 23, 2011 under The 100% Uptime Challenge

This is probably one of the easiest tasks. I already installed the rsync program earlier, and I’ve already got key based SSH logins configured, so all I have to do now is install a cron job to sync my webroot from my nominated master (ie.cwik.ch) to the other 2 servers:

crontab -e
* * * * * rsync -e ssh -a --delete ie:/var/www/html /var/www/
:wq

And we’re done!  To do a manual sync, just run

rsync -e ssh -a -v ie:/var/www/html /var/www/

I’ve installed this crontab in my own user’s crontab, making sure my user account has the necessary permissions to read and write all the files in the webroot. While probably best not to run the sync from root’s crontab, it can be run from any user who has the necessary privileges.

Multi-way file replication

I have only accounted for file writes on one master server, with updates being synchronised one-way to the other servers in my pool. Situations may also arise whereby file uploads need to be processed and the files stored on the servers. Here are two possible solutions to this problem:

Storing files in blobs: The blob type allows binary data to be stored in a MySQL table. Taking advantage of this, you could store uploaded files in a MySQL table instead of writing them to disk, letting MySQL replication (which is already running) take care of propagating the data across all servers.

Unique filenames: Taking a lesson from our multi-master MySQL configuration, the same technique we used to ensure key uniqueness across the server pool could equally be applied as a file naming convention. In my example, file uploads could be prefixed with the location of the server. So a file uploaded to the Irish server would have an ie_ prefix, to the Dutch server would have nl_, or to the American server, us_. rsync could then be configured on each server to replicate the uploads directory to all the other servers in the pool.

comments: Comments Off on Configuring File Replication tags: , ,

Setting Up Multi-Master Circular Replication with MySQL

March 22, 2011 under The 100% Uptime Challenge

Circular replication diagramThe goal of this configuration is to allow each server to operate autonomously for a period of time should it ever lose contact with the other 2 servers, while doing everything I can to make sure there will be no conflicting database updates once connectivity resumes. Each server will act as both a Master and a Slave with replication going in a circle. The binary log, which is written by the master and read by the slave, contains the name of the MySQL server which wrote it. Therefore a loop will never happen, as once the query comes back around to its original master, the master will know that it has already been executed.

I’ll attempt to prevent key conflicts by using the auto_increment_offset and auto_increment_increment configuration variables, to ensure AUTO_INCREMENT values will be unique to each server. This may also prove handy as I’ll be able to tell which server created a record, just by looking at the key value.

There is one remaining potential problem which I can’t guarantee won’t happen, that is the problem of the order of updates. Based on my lovely sketch of the replication circle, consider this scenario:

  • nl.cwik.ch loses connectivity to us.cwik.ch, interrupting replication
  • user A updates a post on nl.cwik.ch: UPDATE posts SET title=’Some new title’ WHERE id=123;
  • the post does not get replicated to us.cwik.ch because the link is down
  • later, user B updates the same post on us.cwik.ch: UPDATE posts SET title=’Some other title’ WHERE id=123;
  • user B’s update gets replicated around the circle to nl.cwik.ch
  • the connection between nl and us comes back up
  • the original update statement from nl.cwik.ch gets replicated to us and nl servers
  • we now have inconsistent data:
    • title on nl: Some other title
    • title on us: Some new title
    • title on ie: Some new title

It’s an unlikely scenario, but plausible. One potential workaround would be to design your application such that records never get updated, but instead create a new record for every change. You can then always select the newest record based on a timestamp column, so in the above scenario all servers would show ‘Some other title’ as the newest blog post title.

Another possibility, should you consider the likelihood of this scenario limited for your application, would be to store a checksum of each row and compare the checksums across all the servers. Any inconsistencies could be flagged and/or an administrator notified.

I already installed mysql-server earlier, so let’s get down to the configuration!

Step 1: Enable the MySQL service and create a config file. I’ve posted my config as an example. This will need to be done on all 3 servers.

chkconfig mysqld on
vi /etc/my.cnf (use my config as a starting point, or create your own - not forgetting to set
server_id and auto_increment_offset appropriately)
service mysqld start

Step 2: Log in to your new MySQL server using ‘mysql -u root’

Step 3: Set your root password and configure your replication user (do this on each server):

USE mysql
DELETE FROM user WHERE user='';
UPDATE user SET password=PASSWORD('your new root password') WHERE user='root';
GRANT REPLICATION SLAVE ON *.* TO 'repl'@'80.93.25.175' IDENTIFIED BY 'somethinglongandrandom';
GRANT REPLICATION SLAVE ON *.* TO 'repl'@'83.96.156.169' IDENTIFIED BY 'somethinglongandrandom';
GRANT REPLICATION SLAVE ON *.* TO 'repl'@'67.202.99.74' IDENTIFIED BY 'somethinglongandrandom';
FLUSH PRIVILEGES;
SHOW MASTER STATUS;
+---------------------+----------+--------------+------------------+
| File                | Position | Binlog_Do_DB | Binlog_Ignore_DB |
+---------------------+----------+--------------+------------------+
| mysql-bin-ie.000003 |      685 |              |                  |
+---------------------+----------+--------------+------------------+

Now on each server, use CHANGE MASTER to start slaving the next server in the circle, until each server is both master and slave. Use the log file name and position as shown in SHOW MASTER STATUS.

On nl.cwik.ch: CHANGE MASTER TO MASTER_HOST='ie.cwik.ch', MASTER_USER='repl',
  MASTER_PASSWORD='somethinglongandrandom', MASTER_LOG_FILE='mysql-bin-ie.000003', MASTER_LOG_POS=685;

On us.cwik.ch: CHANGE MASTER TO MASTER_HOST='nl.cwik.ch', MASTER_USER='repl',
  MASTER_PASSWORD='somethinglongandrandom', MASTER_LOG_FILE='mysql-bin-nl.000003', MASTER_LOG_POS=685;

On ie.cwik.ch: CHANGE MASTER TO MASTER_HOST='us.cwik.ch', MASTER_USER='repl',
  MASTER_PASSWORD='somethinglongandrandom', MASTER_LOG_FILE='mysql-bin-us.000003', MASTER_LOG_POS=685;

Step 4: Replication should now be up and running in a multi-master, circular configuration. Use ‘SHOW SLAVE STATUS\G’ on each server to check that it’s working properly. Check that Slave_IO_Running and Slave_SQL_Running both report Yes. Once it’s all working properly, create a test table to check that the AUTO_INCREMENT values are being created properly (with unique values per server):

mysql ie> USE test;
mysql ie> CREATE TABLE test (id INT NOT NULL AUTO_INCREMENT PRIMARY KEY, val VARCHAR(255) NOT NULL);
mysql ie> INSERT INTO test (val) VALUES ('I am Ireland');
mysql nl> INSERT INTO test (val) VALUES ('I am the Netherlands');
mysql us> INSERT INTO test (val) VALUES ('I am the USA');
mysql ie> INSERT INTO test (val) VALUES ('I am Dublin');
mysql nl> INSERT INTO test (val) VALUES ('I am Amsterdam');
mysql us> INSERT INTO test (val) VALUES ('I am Chicago');

Check the table contents and note the id values:

mysql> SELECT * FROM test;
+----+----------------------+
| id | val                  |
+----+----------------------+
|  1 | I am Ireland         |
|  2 | I am the Netherlands |
|  3 | I am the USA         |
| 11 | I am Dublin          |
| 12 | I am Amsterdam       |
| 13 | I am Chicago         |
+----+----------------------+
6 rows in set (0.00 sec)

Looking good! Note the records created by the Irish server always end in 1, the Dutch server always end in 2 and the American server always will end in 3.

comments: 12 » tags: , , ,

Setting Up the Servers

March 22, 2011 under The 100% Uptime Challenge

Now on to the fun part: setting up the servers!

I’ve set up 3 new Xen virtual machines, one each in Ireland, the Netherlands and USA. I’ve given them the IPs and hostnames:

80.93.25.175 	ie.cwik.ch	Dublin, Ireland
83.96.156.169	nl.cwik.ch	Amsterdam, The Netherlands
67.202.99.74	us.cwik.ch	Chicago, USA

I started with a CentOS 5 install with just the bare minimum installed: cron, ssh, logrotate and syslog.

Step 1: When setting up a new CentOS server, the first thing I always do is run ‘yum -y update’ to install all the latest updates. As usual there was a newer version of kernel-xen available, so I rebooted to make it active.

Step 2: Create a new SSH key pair and install the public key part on the servers. This way I can disable password authentication, making network logins more secure (just change PasswordAuthentication to no in /etc/ssh/sshd_config). To create your own key pair on Linux or Mac OS X, open a Terminal window and run:

cwik:~ cwik$ ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/Users/cwik/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /Users/cwik/.ssh/id_rsa.
Your public key has been saved in /Users/cwik/.ssh/id_rsa.pub.
The key fingerprint is:be:8b:40:8c:1b:0e:97:11:bb:13:2d:8d:7e:fa:8c:6a cwik@cwik.local
The key’s randomart image is:

+--[ RSA 2048]----+
|  .              |
|   *             |
|  * o            |
| . O             |
|. O +   S        |
| + B   .         |
|  + .   .        |
| E + . . .       |
|o.. o . o.       |
+-----------------+

cwik:~ cwik$ cat .ssh/id_rsa.pub
ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAzT4QusU0oYWr+hgL8DY06Q7sief07/J145dnEcdMvC++/5A8Sl79Y+Ux
X3+SxIOj9in3QyxLAvGyXAbxdv/6/IsLZOuHOsyemw9XeqW18I6Ein+yEwZJICE3volOFMCZPVW2sE5SEc+Dtesx
HWq2c1WUcdWTzpcfKMfVo/bJLhDmeCKbXdko0hghB2GNJGKPpYJSnkSNQotrZfBMuXol8S2GYukSa+DaBfnV2jK
bfgiad8r8V216OFSFoNJ8NkmKXVHg34WqyOsqFFF5VOxv3P+UhQGWr8RoS78CffmjF6LAMet3YB9V2DXO3+08
EqSccmzdvyeAbo8cbY0cOoao2Q== cwik@cwik.local

Now install this on the server (that’s not my real pub key BTW, just one I created for this blog post ;-)

[cwik@us ~]$ mkdir .ssh
[cwik@us ~]$ touch .ssh/authorized_keys
[cwik@us ~]$ chmod 700 .ssh
[cwik@us ~]$ chmod 600 .ssh/authorized_keys
[cwik@us ~]$ echo “ssh-rsa
AAAAB3NzaC1yc2EAAAABIwAAAQEAzT4QusU0oYWr+hgL8DY06Q7sief07/J145dnEcdMvC++/5A8Sl79Y+Ux
X3+SxIOj9in3QyxLAvGyXAbxdv/6/IsLZOuHOsyemw9XeqW18I6Ein+yEwZJICE3volOFMCZPVW2sE5SEc+Dtesx
HWq2c1WUcdWTzpcfKMfVo/bJLhDmeCKbXdko0hghB2GNJGKPpYJSnkSNQotrZfBMuXol8S2GYukSa+DaBfnV2jK
bfgiad8r8V216OFSFoNJ8NkmKXVHg34WqyOsqFFF5VOxv3P+UhQGWr8RoS78CffmjF6LAMet3YB9V2DXO3+08
EqSccmzdvyeAbo8cbY0cOoao2Q==” >> .ssh/authorized_keys

Note the permissions on .ssh and authorized_keys must only allow the owner read privileges. If you omit this step, login will fail and sshd will log a warning to the secure log.

Alright, I can now log in to my servers using key based authentication! Next, repeat this process on each of the 3 servers themselves, so that I can use ssh (and thus scp and rsync over ssh) without entering a password. This is important for a replication configuration, as I’ll be setting up automated cron jobs to do things like keep my webroot in sync. When finished, each of my servers has a authorized_keys file listing my own public key and the public key of each of the other servers (one per line).

Step 3: I want relatively new versions of PHP and MySQL. The versions CentOS 5 ships with are starting to get a little dated, so I’ve elected to install a couple 3rd party yum repositories which contain newer builds. In particular PHP 5.3 (vs. the stock 5.1) and MySQL 5.1 (vs the stock 5.0). Grab the latest installer RPMs for remi and EPEL and install them using rpm -i <filename>.

Step 4: Install all the software I need:

yum –enablerepo remi –enablerepo epel -y install httpd php php-mysql mysql mysql-server rsync iptables named sudo

Step 5: Configure sudo so I can run commands as root as and when needed without having to re-type my password each time. Run ‘visudo’ and add a new line around line 76 where you see the default root definition:

cwik    ALL=(ALL)       NOPASSWD: ALL

Step 6: Install an FTP server on the nominated master. I have chosen ie.cwik.ch (at random) as my master file store. I’ll upload any new files and make changes on this server only. Those changes will then be propagated to us.cwik.ch and nl.cwik.ch by means of an automated rsync command that will be run on a regular basis via cron. vsftpd is the standard FTP server for CentOS and works great. It supports SSL, but this is not enabled by default, so I’ll set it up:

[cwik@ie ~]$ yum -y install vsftpd
[cwik@ie ~]$ chkconfig vsftpd on
[cwik@ie ~]$ openssl req -x509 -nodes -days 999 -newkey rsa:1024  -keyout /etc/vsftpd/vsftpd.pem  \
    -out /etc/vsftpd/vsftpd.pem
(answer some questions here to generate your self-signed cert)
[cwik@ie ~]$ vi /etc/vsftpd/vsftpd.conf
(add these lines to the bottom):
# Enable SSL Support
ssl_enable=YES
allow_anon_ssl=NO
force_local_data_ssl=NO
force_local_logins_ssl=NO
ssl_tlsv1=YES
ssl_sslv2=NO
ssl_sslv3=NO
rsa_cert_file=/etc/vsftpd/vsftpd.pem

# ip_conntrack_ftp doesn't work with encrypted connections. Specify range for passive FTP here.
pasv_min_port=5600
pasv_max_port=5700
[cwik@ie ~]$ service vsftpd start

OK, FTP service is running!

Step 7: set up the firewall. Open up /etc/sysconfig/iptables and set up some good firewall rules. Here’s my ruleset that I’m using for this blog. Note that the only open ports are 80 for HTTP, and 53 for DNS. Ports for SSH (I’m running this on the non-standard port 1984) and for FTP and MySQL are restricted to the IPs of my servers only. I’ve also got an exception allowing me to log in via SSH from my laptop. Turn on the firewall with:

chkconfig iptables on
service iptables start

That’s about it for the initial server set up. The next tasks are:

  • Configuring file replication
  • Configuring MySQL and setting up circular replication
  • Editing Apache config to suit
  • Creating a BIND configuration file and DNS zone

Then I’ll finally be ready to copy over my WordPress blog and go live!

comments: Comments Off on Setting Up the Servers tags: , , , , , , ,
Subscribe