OpenSolaris

  subsites   code review   repo   packages   bugs   defect   polls   planet
You are not signed in. Sign in or register.

Heads-up: ZFS corruption possible in build 58

Date: Fri, 23 Feb 2007 14:23:30 -0700
From: Mark Maybee <Mark.Maybee at Sun dot COM>
To: on-all at Sun dot COM, onnv-gate at onnv dot eng dot Sun dot COM, zfs-team at Sun dot COM
Subject: Heads-up: ZFS corruption possible in build 58

[ The fix to build 6526196 was pulled into build 59, so build 59 should be
safe to run, at least from this perspective.  -- gk ]

As most of you are probably aware, Jurassic has been experiencing
stability issues since build 58 was installed.  These issues have been
resolved by rolling jurassic back to build 57.  The root cause for
these issues has been isolated to a bug in ZFS that was introducing
corruption into the file systems.  For details about this bug see[*]:

6526196 assertion failed: zp->z_phys->zp_links > zp_is_dir ...

This bug was introduced in the putback of the fix for bug 6512391
in build 58.  A fix for 6526196 will be put back into build 60.
Unfortunately, this means that builds 58 and 59 have this bug.

ZFS should not be used with builds 58 or 59.  If you do (or have)
used ZFS with these builds, then if the system reboots uncleanly
(eg, power loss or system panic) while there is ZFS activity, your
filesystem may become corrupt.  If this happens, your best chance
of recovery is to install the fix for 6526196 (ie. nightlies after
the putback or build 60) and put the following line into /etc/system:

set zfs:zfs_errorok = 1

Which will cause ZFS to attempt to fix any corruption that it finds,
and print a message on console, rather than panicing.  Note, if you
have not yet experienced corruption, you should role your system
back to build 57.

The ZFS team apologizes for the recent pain and aggravation we have
caused jurassic users.  However we are thankful that jurassic has
served its purpose well here.  We discovered a serious bug in ZFS
here within Sun *before* it impacted our customers.  In an effort
to avoid this sort of disruption on jurassic in the future, we are in
the process of adjusting our tests and testing methodologies to
prevent this type of bug from being introduced again.

Mark Maybee
ZFS team co-lead

[*] Some technical details:

This bug is what we call in ZFS-land a "future leak".  ZFS performs
all file system updates in the context of a transaction.  At any
given moment there are many many transactions headed out to disk.
When multiple changes are made to the same file you may end up with
a series of transactions on the same data.  When we experience a
future leak, a change in a transaction at time N+1 may show up in
a transaction at time N.  In the absence of a system panic, this
does *not* manifest as any sort of corruption.  It just means that
some delta change will be committed to stable storage a bit earlier
than we intended.  However, if we *do* experience a panic between
committing transaction N to disk and committing transaction N+1, then
we *do* see corruption:  when we bring the file system back on-line,
its state is not temporally consistent.  Changes that should not have
made it to disk have made it to disk.  This corruption can easily
result in system panics (as seen on jurassic).