|
|
Heads-up: ZFS corruption possible in build 58Date: Fri, 23 Feb 2007 14:23:30 -0700 From: Mark Maybee <Mark.Maybee at Sun dot COM> To: on-all at Sun dot COM, onnv-gate at onnv dot eng dot Sun dot COM, zfs-team at Sun dot COM Subject: Heads-up: ZFS corruption possible in build 58 [ The fix to build 6526196 was pulled into build 59, so build 59 should be safe to run, at least from this perspective. -- gk ] As most of you are probably aware, Jurassic has been experiencing stability issues since build 58 was installed. These issues have been resolved by rolling jurassic back to build 57. The root cause for these issues has been isolated to a bug in ZFS that was introducing corruption into the file systems. For details about this bug see[*]: 6526196 assertion failed: zp->z_phys->zp_links > zp_is_dir ... This bug was introduced in the putback of the fix for bug 6512391 in build 58. A fix for 6526196 will be put back into build 60. Unfortunately, this means that builds 58 and 59 have this bug. ZFS should not be used with builds 58 or 59. If you do (or have) used ZFS with these builds, then if the system reboots uncleanly (eg, power loss or system panic) while there is ZFS activity, your filesystem may become corrupt. If this happens, your best chance of recovery is to install the fix for 6526196 (ie. nightlies after the putback or build 60) and put the following line into /etc/system: set zfs:zfs_errorok = 1 Which will cause ZFS to attempt to fix any corruption that it finds, and print a message on console, rather than panicing. Note, if you have not yet experienced corruption, you should role your system back to build 57. The ZFS team apologizes for the recent pain and aggravation we have caused jurassic users. However we are thankful that jurassic has served its purpose well here. We discovered a serious bug in ZFS here within Sun *before* it impacted our customers. In an effort to avoid this sort of disruption on jurassic in the future, we are in the process of adjusting our tests and testing methodologies to prevent this type of bug from being introduced again. Mark Maybee ZFS team co-lead [*] Some technical details: This bug is what we call in ZFS-land a "future leak". ZFS performs all file system updates in the context of a transaction. At any given moment there are many many transactions headed out to disk. When multiple changes are made to the same file you may end up with a series of transactions on the same data. When we experience a future leak, a change in a transaction at time N+1 may show up in a transaction at time N. In the absence of a system panic, this does *not* manifest as any sort of corruption. It just means that some delta change will be committed to stable storage a bit earlier than we intended. However, if we *do* experience a panic between committing transaction N to disk and committing transaction N+1, then we *do* see corruption: when we bring the file system back on-line, its state is not temporally consistent. Changes that should not have made it to disk have made it to disk. This corruption can easily result in system panics (as seen on jurassic). |