Graphing the average http request execution time with SNMP

March 2, 2012 under Main

After implementing our new NAS based shared storage for our virtualization pool, we wanted to see what the real world impact on web server performance would be. We decided that the amount of time Apache takes to process http requests would give us a good indication as to whether performance had improved or degraded after moving the VM from local RAID storage to the NAS.

Apache already knows how long it takes to serve each request, but by default this information is not logged anywhere. Since the metric is already available, adding some logging and statistics collection is a nice way to graph this metric over time with minimal impact on server performance. Unlike measuring a page load time externally, this measurement does not involve any additional requests to the server and takes its data from the whole server, not just one test page.

The first step is to set up a new CustomLog on your server to log the %D variable for each request, see mod_log_config in the Apache docs. %D is “The time taken to serve the request, in microseconds.”

LogFormat "%D" resptime
CustomLog logs/resptime_log resptime

If you have multiple virtual hosts with per-vhost logging (ie. you have a CustomLog inside each virtual host), you will have to add this directive to each virtual host that you want to collect samples from. You can have multiple virtual hosts log to the same file, just beware that you may run into file descriptor limits depending on the number of virtual hosts and your OS. When we tried to add this directive to >500 virtual servers we hit this limit and got an “Unable to open log files” error and Apache refused to start. We solved the issue by sampling a selection of virtual hosts (~200) instead of all of them.

The next step is to get these values into a usable average that you can graph. Since we already use Cacti for graphing things like CPU, load average and network throughput, we decided to try to graph this value the same way. Cacti collects its statistics via SNMP using the net-snmp package on CentOS. This package can be extended to collect additional information, see Extending Net-SNMP on docs.redhat.com

I wrote a small Perl script to average the values in the response time log and clobber the file when done, as the data is of little interest once it has been averaged and logged. I saved the script as /usr/local/bin/avgresptime.pl


#!/usr/bin/perl

# Path to the log
my $log = “/etc/httpd/logs/resptime_log”;

# Open the log file
open(LOGFILE, $log) or die “Could not open log file: $!\n”;

# How many lines in the log? (= number of requests)
my $lines = `wc -l $log | cut -d ” ” -f 1`;
chomp($lines);

# Define $total and $avg here
my $total = 0;
my $avg = 0;

# Add up the total number of microseconds:
while() {
$total += $_;
}

# Calculate the average
$avg = int (($total / $lines) + .5);

# Print (debug)
### human-readable output:
#print sprintf(“Log file has %d lines with an average response time of %d \n”, $lines, $avg);
### lines + avg
#print sprintf(“%d %d”, $lines, $avg);
### only avg resp time:
#print sprintf(“%d”, $avg);

# Close log file
close(LOGFILE);

# Clobber the file
open(LOGFILE, “>”, $log) or die “Could not clobber log file: $! \n”;
close(LOGFILE);

# convert from microseconds to hundredths of seconds and round to nearest integer
my $out = int (($avg / 10000) + .5);

# the reason for this is:
# “Integer exit codes are limited to a range of 0-255. For values that are likely to exceed 256, either use the standard output of the script (which will be typed as a string) or a different method of extending the agent.”
# source: http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Deployment_Guide/sect-System_Monitoring_Tools-Net-SNMP-Extending.html
#
# using an exit code is by far the simplest way to graph this data. 255 hundredths of a second = 2.55 seconds.
# hopefully no web server takes more than 2.55 seconds on average to process requests.
# after 12 hours graphing, our average on a busy web server is 19, max 136.
if( $out < 255) { exit $out; } else { exit 255; }

I then added a line to my /etc/snmp/snmpd.conf :

extend httpd_avg_exec_time /usr/local/bin/avgresptime.pl

I then spent a frustrating hour or so trying to figure out how to address the OID to get the value out of SNMP. Turned it to be:

# snmpwalk -v 1 -c -On .1.3.6.1.4.1.8072.1.3.2.3.1.4

Which returned:

.1.3.6.1.4.1.8072.1.3.2.3.1.4.19.104.116.116.112.100.95.97.118.103.95.101.120.101.99.95.116.105.109.101 = INTEGER: 14

That big long string is the OID you want Cacti to graph. To do that, go to your device (assuming you already have it set up in Cacti) and under “Associated Graph Templates” add “SNMP – Generic OID Template”. I gave mine a nice descriptive title and for the vertical axis, put “hundredths of a second” as the label. You then have to go to the Data Source and enter the OID. You’ll also want to set “Maximum Value” to 255 (being the maximum value an exit code can return) and set the data source type to Gauge.

Let it churn for a couple of polls (10-15 minutes) and you should start to see data!

Here’s what our graph looks like after running overnight:

average httpd request processing time

Now we wait. Once we’ve got a good week or two of sample data, we’ll migrate this virtual machine to our NAS storage and see what impact that has on request processing time.

Subscribe