[geeks] Solaris 10 / OpenSolaris bits to be in next version of OSX

Scott Howard scott at doc.net.au
Thu Aug 10 06:18:13 CDT 2006


On Wed, Aug 09, 2006 at 06:35:16PM -0500, Jonathan C. Patschke wrote:
> > Ooh.  I like the self-healing-data aspect.  I think that's worth
> 
> Except that there's no reason to break down the barrier between volume
> manager and filesystem to get there.  Set some blocks aside for
> checksums in the volume manager, and you have the same functionality.
> 
> For instance, AIX has been doing things like this for years.  -Every-
> slice lives inside a volume group managed by the LVM.  If you have a
> volume group stating that there will exist more than one copy of each
> block, AIX will -tell you- if/where a bad block comes into play, return

But it has to detect it first. I don't know how AIX does it, but my
guess is that it relys on the disk block CRC, or some other error being
returned by the disk or the disk driver. VxVM does the same. SVM
unfortuantely doesn't, although the RFE to do so has been around for a
while now.

Whilst this certainly covers some of the failure modes (exactly how many
depends on whether it does it own checks or not), it doesn't handle many
of them.  If you've got a dodgy disk drive or (more likely) disk array,
then VxVM/SVM and I suspect AIX will simply not detect it.  If you've got
an IDE disk (or any other type of disk with write cache enabled by
default), and you loose power during a write, then it will go undetected.
If you've got a Seagate ST39120A drive and it starts corrupting data on
the fly it will not be detected.

ZFS (mirror or RAID-Z) will detect it. Simple as that.

If you want to try this in action, give the following a go...

Create a mirror using your favourite volume manager/filesystem
combination, mount it, and start some IO.

Then "dd" over one side of the mirror with something like
dd if=/dev/zero of=/dev/rdsk/c2t2d2s2 bs=1048576 count=5

Then start counting to see how long until your machine panics. Unless
you're running ZFS that is, in which case it will just correct the data
as it goes.

> The added bonus to the AIX approach is that you get this for -all-
> slices, including paging slices.  It's Nice when you can replace a device

You mean like this?

# zfs create -V 2Gb data01/swap01
# swap -a /dev/zvol/dsk/data01/swap01
# swap -l
swapfile             dev  swaplo blocks   free
/dev/md/dsk/d31     85,31     16 8135408 8135408
/dev/zvol/dsk/data01/swap01 299,1      16 4194288 4194288


OK, so ZFS for root isn't there yet (in Solaris 10 at least, it works in
Nevada just you can't directly install onto it), but it's coming.

> The more sinister side of this is how Sun is selling it.  "Silent data
> corruption"?  What in the world is that supposed to mean?

And now I finally understand why you just don't get why it's such a good
thing.

"Silent" Data Corruption is data corruption that goes undetected, and it
occurs far more than you'd expect.  The "dd" above is an example of
Silent Data Corruption - your data is corrupted, but in a way that the
system was unable to detect (at least until it hit corrupted filesystem
metadata and crashed the system).  Your Hardware RAID box is probably
Silent Data Corruption waiting to happen - unless it's got a mirrored
cache.  What happens when the RAID controller resets itself? Or you
have a power outage and the battery doesn't last as long as it should?
Your data is toast, but you don't realise.

Google has 20,000 hits for "Silent Data Corruption" (with quotes), it's
hardly Sun snake-oil.

> Data corruption does not happen unless there is a bug in the stack of
> code somewhere between fread() and the SCSI transaction or unless the
> hardware is defective.  There's nothing silent about it.  Either

There's nothing silent as long as it's detected.  Lots of times it's
not.

Time for a real-world example.  A few weeks ago parts of California,
including one of the Sun offices had a series of power outage. After
it came back up, ZFS on one of Sun's main build servers started reporting
checksum errors. A "zfs scrub" was run, and a total of over 500,000 bad
blocks were found (and corrected using the mirror on another array).

The cause appears to be a dodgy RAID array (or more likely, battery)
that didn't survive the outage and thus simply threw away any outstanding
(cached) writes.

With any other FS/VM I'm aware of this would have toasted the filesystem
so badly that it would have needed backups to fix, but in this case ZFS
detected the errors and corrected them.  There were NO disk errors
reported as the array didn't know anything was wrong.  There were no
disk CRC errors reported as the data on the disk was OK, just much of
it was stale.  A perfect example of Silent Data Corruption.

> hardware or software has failed; this is the general case for a failure
> in a RAID metadevice.   So Sun's RAID implementation can handle that
> scenario?  Well, uh, great.  That's rather why we have RAIDs in the
> first place, isn't it?

No, we have RAID for non-silent data corruption - where the item that
fails is detected and action can be taken to recover. RAID doesn't add
_any_ value for silent data corruption.

  Scott.



More information about the geeks mailing list