Isolating the cause of seemingly random webapp crashes (and identifying who is responsible for fixing it)

July 1, 2012 under Main

As a managed hosting provider, it is sometimes difficult to definitively draw the line between a customer’s problem and our problem. We are paid to provide reliable infrastructure and platform, on which customers deploy their code. What we’re not responsible for is the reliability of their code.

What happens then when a customer’s webapp starts crashing occasionally and seemingly at random?
In one recent case, the frequency of such crashes went from zero to as much as half a dozen crashes per day over the course of a few months. The customer was pushing for us to fix the problem, but as far as I could tell the problem did not lie in the infrastructure or platform.

Out of memory errors indicated the problem may be a resource shortage, but doubling the RAM allocated to the virtual machine had no impact on the frequency of the errors. At this point I began to suspect buggy code was causing a runaway condition. The problem I faced at this point was convincing an increasingly unhappy customer that the problem was with their code.

To do this I needed evidence that the crashes were related to requests to URIs in their app.

This particular app is written using the Lasso programming language. One nice feature of Lasso is define_atbegin and define_atend, two tags which can be used to insert code to preprocess and postprocess for a particular script. These can be defined globally to run as preprocessor and postprocessor scripts for every request to the server. Using this feature, I installed a debugger system which is relatively simple in principle but surprisingly powerful.

At the start of each request, before any customer code is executed, the preprocessor script creates a record in a MySQL table containing details of the request: date and time, request URI including any GET params, client IP, client browser, and a column called ‘closed’ with a value of 0.

Once the customer code has finished executing, the postprocessor script updates the MySQL table to set the ‘closed’ column to 1, record the execution time (helps to identify slow running scripts) and record the contents of the error stack. If the customer code crashes, the postprocessor code won’t get a chance to run. In this case the status of ‘closed’ will remain 0.

The MySQL database can then be queried to retrieve a list of pages that never finished executing, pages that were slow to execute, or pages which finished but contained an error stack. This is all very useful debugging info, but it is the first condition which is the most useful, as this allows you to see exactly which URIs were requested but never completed – ie which pages are crashing the system!

Screenshot of HTML display

As the aim of this exercise was to highlight to the customer where they should be looking for errors, I quickly created an HTML display of the data from MySQL and put a password on it. The customer could then keep an eye on the various reports to check for slow pages, error stacks and pages that crashed.

Within a day of going live, the customer had identified the culprit code and uploaded a patched version. The webapp hasn’t crashed since.

While I can’t take credit for the original concept (Bil Corry came up with it many years ago) or even writing the debugger code (I hired an enormously talented programmer to do that), I can claim to have successfully implemented this method to isolate the offending code. This is a really great tool that not only helped fix a hard to find bug, but also helped to clearly define that it was customer code and not infrastructure or platform at fault. This made the customer happy, and a happy customer makes me happy!

comments: Comments Off on Isolating the cause of seemingly random webapp crashes (and identifying who is responsible for fixing it) tags: , , ,

Apache 2.2 with Lasso 8.5 issues

May 6, 2010 under Main

Over the past couple months I have set up 4 new servers based on CentOS 5.4 x86 and Lasso Professional 8.5.6. I used the stock version of Apache 2.2 from the CentOS yum repositories, and the Lasso Connector for Apache 2.2 supplied with Lasso 8.5.6.

While this configuration has worked fine in testing, all 4 servers exhibited performance problems once production load was placed on the machines. httpd threads would sporadically hang and consume 100% CPU. This would usually happen within an hour of starting Apache, and lead to performance degradation, eventually to the point where the server was unusably slow.

After extensive testing the problem appears to be caused by a bug in the Lasso Connector for Apache 2.2 on CentOS. Downgrading to Apache 2.0 and using the connector for this version has solved the problem and resulted in huge speed improvements for each server I have done this on so far.

comments: Closed tags: ,

Lasso Sites Manager module for Webmin

March 4, 2010 under Main
Create Site

Screenshot of Create Site window

This week I wrote a module for Webmin called Lasso Sites Manager.

Webmin is a really cool browser based server administration tool. It comes with dozens of modules for managing everything from user accounts to server software, package updates and disk quotas.

Lasso Server 9 is the latest release of the Lasso middleware application from LassoSoft. Lasso 9 lacks a key feature which the last release, 8.5, did have: an easy, graphical way to define multiple Lasso Sites.

Lasso 9 does support this feature, but with a very different implementation from Lasso 8. Instead of spawning its own site processes, Lasso 9 now integrates into Apache via FastCGI. FastCGI can now be configured to connect to a specific Lasso installation. Thus, the sites are much better integrated with Apache and the OS, which yields much more flexibility.

Inspired by discussions on the Lasso Talk mailing list, I decided to have a go at writing a Webmin module to manage multiple Lasso Sites in Lasso Server 9. I’m releasing the module here as free software with no license restrictions.

Feedback, improvements, and a Mac port would all be very welcome. If there is enough interest from other developers in maintaining and improving the module, I’ll probably put it up somewhere like Sourceforge or Google Code.

Lasso Sites

Lasso 9 supports multiple installations of Lasso Server on a single host. Each installation (“Site”) requires a unique installation directory and FastCGI port number. The lassoserver process needs to be told where its home is by being passed the LASSO9_HOME environment variable at startup.

Multiple Apache virtual hosts can connect to a single Lasso Site, or you can define 1 Lasso Site per virtual host.

Each Lasso Site can run under a different Unix user, providing much increased security, especially when implemented together with Apache’s suEXEC module. Lasso Sites Manager webmin module supports suEXEC and will even apply the correct Apache configuration and file permissions for you.

Module features in brief:

  • Create, update and delete Lasso Sites
  • Install from local copy of Lasso Server, a pre-configured template, or from SVN
  • Auto-selection of available FastCGI port
  • Configure Lasso Site to auto-start Lasso Server via Apache FastCGIServer
  • Or manually control stop/start of Lasso Server via the module
  • Automatic integration with Apache virtual hosts
  • Ability to run Lasso sites under any Unix user (also without suEXEC, by setting setuid bit)
  • Enable/disable Lasso Server error log (LASSO9_PRINT_FAILURES directed to a file)
  • Integration with suEXEC (off by default, enable via module config)
  • Open Lasso Admin for any Site directly from the module

Some currently known issues are:

  • Only tested on CentOS Linux 5.4 – anyone interested in contributing Mac OS X support please contact me
  • Auto-start not working in combination with suEXEC
  • May have issues when turning suEXEC support on or off after you have already defined sites. Saving each site config (just click on the name and hit Save) should fix this
  • Install from SVN at your own risk, may require some manual tweaking

An issue to be aware of that isn’t a bug: when using suEXEC on CentOS, the Lasso Site _must_ be located within /var/www. This path is hard-coded into the suEXEC module as the doc-root. If you want to change this, you need to recompile FastCGI with a different AP_DOC_ROOT.

On to To Do list:

  • Ability to update one/all sites from updated local installation or from SVN
  • Port to Mac OS X client and server. Anyone who can contribute to this port, or provide access to an OS X Server installation, please contact Chris Wik.

Download and Installation

Pre-requisites: a working installation of CentOS 5 with Webmin, Apache 2.2, Lasso Server 9, and subversion (‘yum -y install subversion’ on CentOS). Built/tested with Webmin 1.500.

Please install and test that Lasso Server 9 works BEFORE installing this module. Eg: turn off selinux, install Lasso Server RPM, install FastCGI (using the provided script at /usr/local/lib/lasso/Apache2Conf/install_mod_fastcgi)

You can download the module here. To install, go to Webmin -> Webmin Configuration -> Webmin Modules. Select “From uploaded file”, choose the file, and click Install Module. The module will now appear under the Servers category in Webmin.

This module comes with no guarantees, use it at your own risk.

comments: Closed tags: , , , ,