Isolating the cause of seemingly random webapp crashes (and identifying who is responsible for fixing it)

July 1, 2012 under Main

As a managed hosting provider, it is sometimes difficult to definitively draw the line between a customer’s problem and our problem. We are paid to provide reliable infrastructure and platform, on which customers deploy their code. What we’re not responsible for is the reliability of their code.

What happens then when a customer’s webapp starts crashing occasionally and seemingly at random?
In one recent case, the frequency of such crashes went from zero to as much as half a dozen crashes per day over the course of a few months. The customer was pushing for us to fix the problem, but as far as I could tell the problem did not lie in the infrastructure or platform.

Out of memory errors indicated the problem may be a resource shortage, but doubling the RAM allocated to the virtual machine had no impact on the frequency of the errors. At this point I began to suspect buggy code was causing a runaway condition. The problem I faced at this point was convincing an increasingly unhappy customer that the problem was with their code.

To do this I needed evidence that the crashes were related to requests to URIs in their app.

This particular app is written using the Lasso programming language. One nice feature of Lasso is define_atbegin and define_atend, two tags which can be used to insert code to preprocess and postprocess for a particular script. These can be defined globally to run as preprocessor and postprocessor scripts for every request to the server. Using this feature, I installed a debugger system which is relatively simple in principle but surprisingly powerful.

At the start of each request, before any customer code is executed, the preprocessor script creates a record in a MySQL table containing details of the request: date and time, request URI including any GET params, client IP, client browser, and a column called ‘closed’ with a value of 0.

Once the customer code has finished executing, the postprocessor script updates the MySQL table to set the ‘closed’ column to 1, record the execution time (helps to identify slow running scripts) and record the contents of the error stack. If the customer code crashes, the postprocessor code won’t get a chance to run. In this case the status of ‘closed’ will remain 0.

The MySQL database can then be queried to retrieve a list of pages that never finished executing, pages that were slow to execute, or pages which finished but contained an error stack. This is all very useful debugging info, but it is the first condition which is the most useful, as this allows you to see exactly which URIs were requested but never completed – ie which pages are crashing the system!

Screenshot of HTML display

As the aim of this exercise was to highlight to the customer where they should be looking for errors, I quickly created an HTML display of the data from MySQL and put a password on it. The customer could then keep an eye on the various reports to check for slow pages, error stacks and pages that crashed.

Within a day of going live, the customer had identified the culprit code and uploaded a patched version. The webapp hasn’t crashed since.

While I can’t take credit for the original concept (Bil Corry came up with it many years ago) or even writing the debugger code (I hired an enormously talented programmer to do that), I can claim to have successfully implemented this method to isolate the offending code. This is a really great tool that not only helped fix a hard to find bug, but also helped to clearly define that it was customer code and not infrastructure or platform at fault. This made the customer happy, and a happy customer makes me happy!