[rescue] Fun with ZFS (was: Friend needs storage solution)

Tue May 30 12:23:05 CDT 2006

On Tue, 30 May 2006, Patrick Giagnocavo wrote:

>> Anyone have any suggestions or opinions?  Cost-effectiveness is an
>> issue.
>
> I suggest OpenSolaris + ZFS.  The commands are simple and the ability
> to add storage later is a big plus.  Also ZFS has a built-in integrity
> checker option to ensure that blocks written are consistent.

I'd strongly recommend -against- OpenSolaris + ZFS.

I was debating between OpenSolaris and Solaris 10 on my E4000 (ZFS being
the deciding factor for whether I wanted to run OpenSolaris), and
couldn't quite get a straight answer about how ZFS works (or how it
handles failures, or how it distributes data, or a lot of other really
basic questions) from a bunch of people that were singing its praises,
so I built up the E4000 with a 20 or so disks (10 on one SOC+, 10 on
another on a separate I/O board, 1 SCSI disk on the boot channel for the
root filesystem, and two SCSI disks on the secondary I/O board in a
zpool mirror) and ran some tests.

The short story is that performance is pretty good, but reliability
isn't so hot.

I could repeatably get OpenSolaris Nevada b36 to panic during certain
types of disk failures.  Apparently if you're swift enough to pull a
member of a ZFS volume set (or whatever the heck they're called) while
an operation -to that disk- is in-flight, the kernel will panic.  Joy.
Just what I always wanted in a RAID.

RAIDZ is just RAID5, and by "just", I mean "just".  Yeah, you can do
mirrors and stripes and mirrors of stripes and stripes of mirrors, but
when you do RAIDZ, it's just RAID5[0].  Presumably there's a hot-spare
feature somewhere, but I didn't mess with it long enough to find it
(panicking such a large E4000 is a very time-consuming event).

I failed my disks under very pessimistic scenarios[1]: -very- high load,
high disk activity to the RAIDZ, little room for disk contention
elsewhere.  The test methodology I used was to initiate 40 concurrent
builds of GCC 4.1.0 with -pipe to minimize /var/tmp contention.

I found error-recovery a bit perplexing.  Pull a ZFS member, do some
I/O, and reintroduce it.  I pulled a member of a RAIDZ for quite a long
time, did lots of writes to the filesystem (probably 3GB or so of cp on
an otherwise quiesced system), and reintroduced the member.  The
reintroduced member seemed to sync up nearly immediately (less than a
second).  I do not have faith that 341MB of parity and data was written
in less than a second over 1Gb/s fibre-channel, but the zfs status
command said that the array was healthy and that some strange percentage
was recovered ("silvered" I think is the term it used).

Performance is far superior to Solaris UFS, though:

   http://magnum.celestrion.net/~jp/zfs.performance.html

Primarily, it seems that node creation within ZFS is at least an order
of magnitude faster than node creation within UFS (compare UFS->ZFS and
ZFS->UFS copies).

My methodology was to quiesce the system (a 10-processor, 10GB memory
E4000), try each operation 5 times, take the wall-clock outputs of
time(1) and average them, and then divide that into my dataset size in
megabytes.  The particularly startling one (copying RAIDZ to a normal
UFS slice) I repeated through several sets and always got answers
withing 5%, so I guess that operation is just slow.

The data set was a mix of small files (part of the OpenBSD source tree),
moderately-sized files (some 192kbps MP3s), and some large-ish files
(some > 100MB zip archives).

One thing to note, though, performance when writing to the RAIDZ during
the massively parallel GCC build was far below what I would've expected.
The disks stayed somewhat busy, but it seemed like RAIDZ introduced
serialization to a degree that did not asynchronously interleave disk
access and parity calculation (ie: the disks weren't -slammed-, like
you'd expect if 40 compiles were hitting a filesystem for writing
concurrently).

Beyond that, I found the ZFS administration commands and its whole
methodology confusing.  I've used AIX's LVM, HP's LVM, and all manner of
standalone storage on more Unixes that I remember, and ZFS is the most
opaque of them all.  AIX, even with its ODM in the way, still uses its
fstab equivalent (/etc/filesystems) to note mountpoints.  ZFS
filesystems just get mounted where they think they should go: they don't
show up in /etc/vfstab, and they don't show up usefully in the mount
list.  The data is just out there...somewhere.

There's also a fundamental abstraction violation between the RAIDZ
metadevice and ZFS filesystem.  This is intentional[2], but is pretty
confusing when you come from a world where your storage system and the
filesystems thereon are separate entities.  You can't just carve a UFS
or FAT filesystem out of a RAIDZ or use mirrored zpool for swap--the
only things that will go on them are ZFS filesystems, and I never quite
understood whether you could have more than one filesystem on a RAIDZ or
whether the filesystem and metadevice abstration were so intertwined
that you only got one filesystem out of it.  The documentation seemed to
indicate the former, but the tools seemed to indicate the latter.

So, this means on a large Solaris system with lots of compartmentalized
storage, you can potentially have:
   * Regular disklabel devices with UFS and other filesystems.
   * SVM metadevices with UFS and other filesystems.
   * ZFS devices with ZFS filesystems.
   * zpool metadevices (stripes, mirrors, and permutations thereof) with
     ZFS filesystems.
   * RAIDZ metadevies with ZFS filesystems.

I don't know about you, but I would not want to maintain such a system.

Maybe when ZFS grows up, it'll be a nice storage subsystem, but it's too
much too new and too untested to use for anything you care about, IMO.

[0] From a resiliancy standpoint, that is. Sun makes noise about how
     it's better/smarter/faster/40% less polyunsaturated fat, but it's
     really just a bit smarter about how the FS operates so it doesn't
     make naive errors.
[1] Because where does Finagle's Law of Dynamic Negatives apply moreso
     than on the software RAID in your main server?
[2] Doesn't necessarily mean it's a good idea, but Sun is apparenly
     fully aware of what they're trying to accomplish.
-- 
Jonathan Patschke   )  "The telephone is an antiquity...it is an
Elgin, TX          (   outmoded device which constantly disrupts
USA                 )  work."        --Ralf Huetter of Kraftwerk