% fortune -ae paul murphy

ZFS, HW RAID, and expensive mis-apprehensions

Last week frequent contributor _dietrich drew everyone's attention to a piece by Timothy Morgan comparing various low to mid range Unix and Windows servers from a "bang for the buck" perspective.

It's an interesting piece of work and one I wish I'd done despite Morgan's politically correct assertion that Linux isn't Unix - something that's kind of funny given that he starts the article with the words:

Only 15 years ago, Linus Torvalds was annoyed enough about the expense of the just-commercialized Unix workstations and systems on the market that he started to create his own operating system, one that had a look and feel very much like Unix.

Maybe somebody should remind him that Torvalds described it at the time as "Unix for the 386" and as "a better Minix" - or maybe just that something which walks like a duck, quacks like a duck, looks like a duck, and inter breeds with ducks, is likely to be a duck.

Aside from that, however, the article is extremely courageous in that he attempts to guess what the TPC/C result would be for various configurations, and then uses his guess as the denominator in the per transaction cost ratio he uses as the basis for his bang for the buck comparisons.

Here's his rationale:

I have tried to keep the configurations across server architectures and operating system platforms as similar as is practical based on the natures of the product lines.

I am well aware that I am showing the estimated or actual OLTP performance of a given processor complex and comparing the cost of a base configuration. In this way, I am trying to isolate the base cost of a server and show its potential performance on the TPC-C online transaction processing benchmark. Yes, the Transaction Processing Performance Council frowns on this sort of thing. But, someone has to do like-for-like comparisons and the TPPC can't even get its own acronym straight, much less come up with a scheme that not only encourages, but makes vendors adhere to a wider spectrum of tests to gauge the performance of product lines rather than one iteration of a product. (And, while I am thinking about it, don't bother emailing me about how an acronym has to be pronounceable or it is just an abbreviation; the usage of the word "acronym" has changed to mean any abbreviation, and IT Jungle is on the cutting edge of word technology.)

For the comparisons, I have put a RAID 5 disk controller on each machine, two 36 GB disks, and 2 GB of main memory for each processor core in the box (there are some exceptions on the core count, of course. Each server also has a basic tape backup, shown in the table.

I'd say "more power to him" for having the guts to do it -if he hadn't made two big mistakes with respect to the Sun gear he bench-ti-mated.

Here's what he says about his Niagara performance numbers:

I am guessing the performance of the Niagara boxes based on other benchmarks, since Sun is too stubborn to run the TPC-C tests on its Galaxy and Niagara even though it has great numbers. Go figure.

There are two issues here: first, I think that Sun is in the right describing TPC/C as too easily gamed and too simple to represent most modern workloads. What's more relevant for Morgan's purposes, however, is that it's possible to get a realistic TPC/C estimate, at least for Sybase ASE on the T2000, by looking at the individual transactions in the banking workload component of Sun's published SpecWeb-2005 result - a process that produces a much higher estimate than his.

The most interesting, and cautionary, mistake in the paper, however, has nothing to do with the minutiae of his estimates -or their consequences for those who follow his recommendations.

The big problem is simply this: every Sun box he bench-ti-mated ran Solaris 10, but he assumed similar PC style RAID hardware whether his target machine was to run Windows, Linux, Solaris or even AIX and that's just plain wrong.

Solaris 10 now ships with ZFS - and ZFS obsoletes both PC style RAID controllers and the external RAID controllers used with bigger systems.

To get a fair comparison, you use the best technologies the vendors offer: hardware raid (internal or external) for everybody except Sun, ZFS on JBODs for Sun.

There are three very important differences:

  1. without ZFS, you get the fastest I/O by backing advanced RAID hardware with as many small, fast spinning, devices as you can cram in the box. With a Sun 3510FC array, for example, you get maximum performance by putting at least 1GB on each of two controllers and filling the thing with 12 x 36GB/15K RPM disks.

    ZFS doesn't work this way. To get maximum performance with ZFS you fill the box with large, fast, disks and eliminate its RAID controllers in favor of more memory, and more "dumb" controllers, in the server.

  2. with hardware RAID you decide in advance whether to optimise for reading or writing, and set up your RAID strategy as a compromise between workload performance and reliability.

    With ZFS you don't. Instead, you mirror by default because space on big disks is cheap and mirroring produces the minimum number of read or write transactions for protected data regardless of page size.

  3. with hardware RAID the choke point tends to be on the RAID controller, but with ZFS it's usually the pre-write buffer for the internal disk controllers.

    As a result you spend your traditional RAID dollars putting a separate computer into each disk box and then run a disk management application to control the thing. With ZFS all of this is handled in the Solaris kernel and you sink your money into simple but fast internal controllers. The result is a faster and cheaper system that's more reliable precisely because it's simpler and less power intensive

So how much difference does this make? The list price for a Sun 3510FC array with 12 x 36GB disks is $25,995 - but that same JBOD, minus the on-board controllers and memory, runs about $14K complete with 12 x 73GB drives. Put four controllers in your T2000 (alternating your X/Es ) and you do fewer I/Os, your I/O pipelines don't stall, there is no separate disk management application, and so you go faster for much less.

The comparable numbers for the 3520SCSI JBOD are $23,387 for the RAID version with 12 x 73GB disks and about $16K for the ZFS compatible product with 12 x 143GB disks.

So what's the bottom line? Morgan's ratings ignore important differentiating technology - had he bench-ti-mated his Solaris costs and performance on a ZFS/JBOD basis, Sun would have placed at the top of every grouping.

All of which suggests an important question for readers of this blog: how many of you are making the same mistake? How about your colleagues and bosses?


Paul Murphy wrote and published The Unix Guide to Defenestration. Murphy is a 25-year veteran of the I.T. consulting industry, specializing in Unix and Unix-related management issues.