Construction of a NAS

February 24, 2012 under Main

Over at Anu we’ve got a whole lot of machines running Xen virtual machines. We’ve been running the same hardware and software setup for about 5 years with only a few minor updates here and there along the way. It has served us well, but we are starting to run into limitations of our storage setup.

Our existing setup makes use of hardware RAID controllers in each server, with 2 SATA drives in a RAID-1 mirror configuration. The setup is cost effective, fault tolerant and by offloading I/O to the RAID controller the performance achieved is more than adequate for most workloads.

The first problems came when a customer started hitting I/O bottlenecks on his MySQL server. The solution we came up with was to add a couple 15k RPM SAS drives and set up a 2nd RAID-1 mirror for storing the MySQL databases. This solved the performance problem.

The second problem came when a customer started to outgrow the capacity we could add to a single machine. Frustratingly, our average storage utilization across the Xen pool is just 53.2% – we had plenty of capacity, but no way to allocate the contiguous chunk required.

The solution is fairly obvious: centralize storage on a NAS or SAN. The project was started.

Our criteria were: capacious, quicker than local RAID, resilient and redundant. Oh, and affordable.

That last criteria basically ruled out all the major commercial vendors, so we embarked on a construction project to build our own using off the shelf hardware and open source software.

Decision #1 was iSCSI vs NFS – or put otherwise, SAN vs NAS. NFS came out on top primarily because of its relative simplicity and familiarity. iSCSI does support multipathing, which is an attractive feature, but we decided our bonded NICs would provide adequate resiliency.

For the underlying storage we opted for RAID-10 using standard 7.2k SATA drives. RAID-5 gave us much better read throughput in our tests, but RAID-10 way outperformed in write throughput and in IOPS (I/O operations per second).

RAID-10 gave us some resiliency to drive failures, but putting all our storage eggs in one NAS basket made me nervous. As an insurance policy, I spec’d up 2 identical storage servers and DRBD for block level replication over the LAN. A complete hardware failure on the primary will trigger heartbeat to fail over to the second unit. In testing the failover process took about 7 seconds, during which running I/O operations simply appeared to take a short pause. As soon as the failover occurred the write completed and normal operations resumed with no errors reported.

We purchased a new 48-port Netgear gigabit Ethernet switch to dedicate to the new storage network, which we run physically segregated from the rest of our LAN and WAN traffic. Bumping up the frame size from the default 1500 to 9000 increased performance significantly. Unfortunately this is anecdotal as I forgot to measure before and after transfer rates for comparison purposes.

Each of the NFS boxes got 6 x GbE connections. Linux Bonding driver in mode 6 (balance-alb) proved to offer the best performance and fault tolerance in our tests. We tested almost every mode available using this reference at LiNUX Horizon.

House of Linux have a great how-to on setting up DRBD + Heartbeat + NFS on CentOS 6 x86_64 which got us up and running with a HA NFS cluster in less than a day.

The next step (after lots and lots of testing and trying every way we could think of to break it) was to start converting from local storage across to our shiny new NAS storage. To do this required extra network bandwidth on our Xen machines, so we came up with a plan that would not only give us the extra bandwidth but also cut our energy costs and increase reliability: we pulled out the RAID card from its PCI slot and replaced it with a dual port Intel GbE NIC. We swapped the old SATA drives for an SSD for the OS – capacity was no longer an issue as all the VM storage is going on the NFS NAS.

As of today we have our first 3 physical machines converted with 21 Xen VMs on the new storage system. Some quick tests with hdparm running inside virtual machines shows impressive results from our economical home build system. Before (hardware RAID-1 based):

Timing buffered disk reads: 310 MB in 3.01 seconds = 103.07 MB/sec
Timing buffered disk reads: 370 MB in 3.00 seconds = 123.29 MB/sec
Timing buffered disk reads: 314 MB in 3.02 seconds = 104.13 MB/sec

After, on the new SAN:

Timing buffered disk reads: 702 MB in 3.00 seconds = 233.82 MB/sec
Timing buffered disk reads: 534 MB in 3.00 seconds = 177.84 MB/sec
Timing buffered disk reads: 762 MB in 3.00 seconds = 253.96 MB/sec

The real test will be how it scales. At the moment we’re seeing just under 15mbps average throughput to the NAS with peaks around 70mbps. Performance feels great and the stats back it up. If we can consolidate a half dozen Xen machines to run off the NAS while maintaining better performance than local storage, I will call the project a success.