[geeks] Solaris 10 / OpenSolaris bits to be in next version of OSX

Thu Aug 10 06:44:50 CDT 2006

On Wed, Aug 09, 2006 at 08:35:17PM -0400, Charles Shannon Hendrix wrote:
> But what about this required integration of the two abstractions?

The fact that the two-level method was a hack from the start?

There's a number of advantages in having the filesystem not juts being
aware of, but also controlling the actual layout of data on disk.

A few examples that spring directly to mind are :
* Mirroring/remirroring only requires copying data that is in use, rather
than the entire disk
* Removing a disk becomes much easier - the FS just needs to move the
data that is on it to another disk (OK, so ZFS in S10 doesn't do this yet,
but it will real-soon)
* Read-ahead can be done much more intelligently (and ZFS does lots of
this!)

These can only be done when the filesystem knows the exact disk layout,
not some virtualised version of it.

> > most of the bits are right ... I think ..."
> 
> Well, it sound like you had bad disks to me.
> 
> I've *never* had a good drive that did that.  If a write fails, it
> knows, and it tells me.

How do you know? Because it didn't tell you otherwise? If you trust
your storage that much, then you probably don't need ZFS for it's
availibility features. Can you probably also do without insurance for
your house/car/etc too... :)

> Or, is there something else that really is new magic in ZFS?

It's not magic, but (to the best on my knowledge) it's new - at the
filesystem leve at least.  All it does is to store checksums for ALL data
on the disk.  Nothing new there - even the disk does that with it's own
CRC check.

What it does differently is that instead of storing the checksum with
the data itself, it stores one rung higher up the filesystem metadata
tree. ie, at each level you have the metadata plus the checksum of the
data that it points to.  This means that as the filesystem walks down
the tree, it can check at each step of the way that everything is in
order.  If, for example, a block didn't get written to disk half-way down
the tree (eg, the disk didn't write it for some reason) then it knows not
to trust anything below that point and it will go to the redundant copy.

If we were just relying on the disk block CRC (or any other form of
checksum stored with the data itself) then we wouldn't detect anything
wrong - the data on that block is probably still valid, it's just stale
from the previous time something wrote to the block.

> > I think what they're saying is that their RAID-Z can automatically
> > detect, and transparently handle, read errors on the media that do not
> > actually involve complete hardware failure.  In short, anyone's RAID can
> > handle a failed disk, but ZFS can handle it when the disk is just
> > starting to go bad and is dropping bits, and identify exactly which disk
> > block it was that returned bad data even if the disk's ECC didn't catch
> > or flag it.
> 
> If this is true, then I definitely would not trust ZFS, because you just
> described them trying to handle a situation that should not be handled.

Why not?  It handles it by using redundancy - and in particular a
redundant copy that we _know_ we can trust due to the checksum.

> When the drive starts having errors, a properly working RAID should kick
> it out immediately.
> 
> Modern drives automatically remap.  When you start having errors, the
> drive is toast.

_IF_ it can detect it. That's what RAID is for. ZFS is for when the disk
can't detect it.

  Scott.