OpenSolaris

Discussions Communities Projects Download Source Browser

Home » OpenSolaris Forums » zfs » discuss

Thread: Proposal: ZFS Hot Spare support

Welcome, Guest Help
Login Login
Guest Settings Guest Settings
Reply to this Thread Reply to this Thread Search Forum Search Forum Back to Thread List Back to Thread List

Permlink Replies: 39 - Last Post: Apr 11, 2006 12:39 PM by: billm
eschrock

Posts: 799
From: Menlo Park, CA

Registered: 3/9/05
Proposal: ZFS Hot Spare support
Posted: Mar 30, 2006 9:03 AM

  Click to reply to this thread Reply

As mentioned last night, we've been reviewing a proposal for hot spare
support in ZFS. Below you can find a current draft of the proposed
interfaces. This has not yet been submitted for ARC review, but
comments are welcome. Note that this does not include any enhanced FMA
diagnosis to determine when a device is "faulted". This will come in a
follow-on project, of which some preliminary designs have been sketched
out but not enough to draft any coherent proposal.

- Eric


A. DESCRIPTION

ZFS, as an integrated volume manager and filesystem, has the ability to
replace disks within an active pool. This allows administrators to
replace failing or faulted drives to keep the system functioning
with the required level of replication. Most other volume managers also
support the ability to perform this replacement automatically through
the use of "hot spares". This case will add this functionality to ZFS.

This case will increment the on-disk version number in accordance to
PSARC 2006/206, as the resulting labels introduce a new pool state that
older pools will not understand, and exported pools containing hot
spares will not be importable on earlier versions.

B. POOL MANAGEMENT

Hot spares are stored with each pool, although they can be overlapped
between different pools. This allows administrators to reserve
system-wide hot spares, as well as per-pool hot spares according to their
policies.

1. Creating a pool with hot spares

A pool can be created with hot spares by using the new 'spare' vdev:

# zpool create test mirror c0d0 c1d0 spare c2d0 c3d0

This will create a pool with a single mirror and two spares. Only a
single 'spare' vdev can be specified, though it can appear anywhere
within the command line. The resulting pool looks like the following:

# zpool status
pool: test
state: ONLINE
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
mirror ONLINE 0 0 0
c0d0 ONLINE 0 0 0
c1d0 ONLINE 0 0 0
spares
c2d0 ONLINE
c3d0 ONLINE

errors: No known data errors

2. Adding hot spares to a pool

Hot spares can be added to a pool in the same manner by using 'zpool
add':

# zpool add test spare c4d0 c5d0

This will add two disks to the set of available spares in the pool.

3. Removing hot spares from a pool

Hot spares can be removed from a pool with the new 'zpool remove'
subcommand. This subcommand suggests the ability to remove arbitrary
devices, and certainly is a feature that will be supported in a future
release, but currently this will only allow removing hot spares. For
example:

# zpool remove test c2d0

If the hot spare is currently spared in, then the command will print an
error and exit.

4. Activating a hot spare

Hot spares can be used for replacement just like any other device using
'zpool replace'. If ZFS detects that the device is a hot spare within
the same pool, then it will create a 'spare' vdev instead of a
'replacing' vdev:

# zpool replace test c0d0 c2d0
# zpool status
...
config:
NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
mirror ONLINE 0 0 0
spare ONLINE 0 0 0
c0d0 ONLINE 0 0 0 35.5K resilvered
c2d0 ONLINE 0 0 0 35.5K resilvered
c1d0 ONLINE 0 0 0
spares
c2d0 SPARED currently in use
c3d0 ONLINE


The difference between a 'replacing' and 'spare' vdev is that the former
automatically removes the original drive once the replace completes.
With spares, the vdev remains until the original device is removed from
the system, at which point the hot spare is returned to the pool of
available spares. Note that in this example we have replaced an online
device. Under normal circumstances, the device in question would be
faulted or the administrator would have proactively offlined the device.

5. Deactivating a hot spare

There are 3 ways in which a hot spare can be deactivated: cancelling the
hot spare, replacing the original drive, or permanently swapping in the
hot spare.

To cancel a hot spare attempt, the user can simply 'zpool detach' the
hot spare in question, at which point it will be returned to the set of
available spares, the the original drive will remain in its current
position (faulted or not):

# zpool detach test c2d0
# zpool status
...
config:

NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
mirror ONLINE 0 0 0
c0d0 ONLINE 0 0 0 35.5K resilvered
c1d0 ONLINE 0 0 0
spares
c2d0 ONLINE
c3d0 ONLINE

If the original device is replaced, then the spare is automatically
removed once the replace completes:

# zpool replace test c0d0 c4d0
# zpool status
...
config:

NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
mirror ONLINE 0 0 0
spare ONLINE 0 0 0
replacing ONLINE 0 0 0
c0d0 ONLINE 0 0 0 38K resilvered
c4d0 ONLINE 0 0 0 38K resilvered
c2d0 ONLINE 0 0 0 38K resilvered
c1d0 ONLINE 0 0 0
spares
c2d0 SPARED currently in use
c3d0 ONLINE
<wait for replace to complete>
# zpool status
...
config:

NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
mirror ONLINE 0 0 0
c4d0 ONLINE 0 0 0 35.5K resilvered
c1d0 ONLINE 0 0 0
spares
c2d0 ONLINE
c3d0 ONLINE

If the user instead wants the hot spare to permanently assume the place
of the original device, the original device can be removed with 'zpool
detach'. At this point the hot spare will become a functioning device,
and automatically be removed from the list of available hot spares
(for all pools if it is shared):

# zpool detach test c0d0
# zpool status
...
config:

NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
mirror ONLINE 0 0 0
c2d0 ONLINE 0 0 0 35K resilvered
c1d0 ONLINE 0 0 0
spares
c3d0 ONLINE

6. Determining device usage

A hot spare is considered 'in use' for the purpose of libdiskmgt and
zpool(1M) if it is labelled as a spare and is currently in one or more
pool's list of active spares. If a spare is part of an exported pool,
it is not considered in use, due largely to the fact that distinguishing
this case from a recently destroyed pool is difficult and not solvable
in the general case.

C. AUTOMATED REPLACEMENT

In order to perform automated replacement, a ZFS FMA agent will be added
that subscribes to 'fault.zfs.vdev.*' faults. When a fault is received,
the agent will examine the pool to see if it has any available hot
spares. If so, it will perform a 'zpool replace' with an available
spare. The initial algorithm for this will be 'first come, first
serve', which may not be ideal for all circumstances (such as when not
all spares are the same size). It is anticipated that these
circumstances will be rare, and that the algorithm can be improved in
the future.

This is currently limited by the fact that the ZFS diagnosis engine only
emits faults when a device has disappeared from the system. When the DE
is enhanced to proactively fault drives based on error rates, then the
agent will automaticaly leverage this feature.

In addition, note that there is no automated response capable of
bringing the original drive back online. The user must explicitly take
one of the actions described above. A future enhancement will
allow ZFS to subscribe to hotplug events and automatically replace the
affected drive when it is replaced on the system.

D. MANPAGE DIFFS

XXX
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



Ellis, Mike
Mike.Ellis@fmr.com
RE: Proposal: ZFS Hot Spare support
Posted: Mar 30, 2006 10:20 AM   in response to: eschrock

  Click to reply to this thread Reply

I didn't catch a mention of RaidZ in your note.

How would hot-spares play in a RaidZ type configuration? (Especially
with the "auto-return-home" (or predictive replacement) feature your
mention.[ In traditional Raid-Arrays hot-spare rebuilds and
"go-home-transitions" are handled differently to cut down on exposure
windows, and resource utilization, not sure if/how that applies here...
]

If I read/interpreted the last part of your note, I think its OK to use
a max-size LUN to hot-spare any (LUN ]]
# zpool status
...
config:

NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
mirror ONLINE 0 0 0
c4d0 ONLINE 0 0 0 35.5K
resilvered
c1d0 ONLINE 0 0 0
spares
c2d0 ONLINE
c3d0 ONLINE

If the user instead wants the hot spare to permanently assume the place
of the original device, the original device can be removed with 'zpool
detach'. At this point the hot spare will become a functioning device,
and automatically be removed from the list of available hot spares
(for all pools if it is shared):

# zpool detach test c0d0
# zpool status
...
config:

NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
mirror ONLINE 0 0 0
c2d0 ONLINE 0 0 0 35K resilvered
c1d0 ONLINE 0 0 0
spares
c3d0 ONLINE

6. Determining device usage

A hot spare is considered 'in use' for the purpose of libdiskmgt and
zpool(1M) if it is labelled as a spare and is currently in one or more
pool's list of active spares. If a spare is part of an exported pool,
it is not considered in use, due largely to the fact that distinguishing
this case from a recently destroyed pool is difficult and not solvable
in the general case.

C. AUTOMATED REPLACEMENT

In order to perform automated replacement, a ZFS FMA agent will be added
that subscribes to 'fault.zfs.vdev.*' faults. When a fault is received,
the agent will examine the pool to see if it has any available hot
spares. If so, it will perform a 'zpool replace' with an available
spare. The initial algorithm for this will be 'first come, first
serve', which may not be ideal for all circumstances (such as when not
all spares are the same size). It is anticipated that these
circumstances will be rare, and that the algorithm can be improved in
the future.

This is currently limited by the fact that the ZFS diagnosis engine only
emits faults when a device has disappeared from the system. When the DE
is enhanced to proactively fault drives based on error rates, then the
agent will automaticaly leverage this feature.

In addition, note that there is no automated response capable of
bringing the original drive back online. The user must explicitly take
one of the actions described above. A future enhancement will
allow ZFS to subscribe to hotplug events and automatically replace the
affected drive when it is replaced on the system.

D. MANPAGE DIFFS

XXX
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



eschrock

Posts: 799
From: Menlo Park, CA

Registered: 3/9/05
Re: Proposal: ZFS Hot Spare support
Posted: Mar 30, 2006 10:28 AM   in response to: Ellis, Mike

  Click to reply to this thread Reply

On Thu, Mar 30, 2006 at 01:20:20PM -0500, Ellis, Mike wrote:
> I didn't catch a mention of RaidZ in your note.
>
> How would hot-spares play in a RaidZ type configuration? (Especially
> with the "auto-return-home" (or predictive replacement) feature your
> mention.[ In traditional Raid-Arrays hot-spare rebuilds and
> "go-home-transitions" are handled differently to cut down on exposure
> windows, and resource utilization, not sure if/how that applies here...
> ]
>
> If I read/interpreted the last part of your note, I think its OK to use
> a max-size LUN to hot-spare any (LUN once its job is done, ready to spare for any other pool/lun. (obviously
> not the entire hot-spare will be used if its "sparing" for a smaller
> failed LUN).

Yep. The initial concern raised was "what if I have a pool with half
36G disks and half 72G disks?" If you then have both 36G and 72G
spares, then using a 72G spare for a 36G disk could potentially deprive
you of a needed hot spare should a 72G disk fail. In general, this is a
misconfigured system, since it gives you a false sense of security when
examining your available hot spares. Hence not worrying about it in the
initial version.

> Maybe there is no difference between Mirrored/RaidZ-configurations (ZFS
> masking all this?), but even in this case some note regarding this
> working for both mirrored and RaidZ configurations might make sense?

Yes, Mirror and RAID-Z replacements are handled identically, and use the
same resilvering code. There is no need to do any special-casing or
worry about "exposure windows" or anything like that. I can add
statements to that effect. Note that it may also be possible to
hot-spare unreplicated pools with the arrival of predictive analysis and
pro-active replacement. The usefulness of this feature is rather
questionable, however.

- Eric

--
Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



David Magda
dmagda@ee.ryerson.ca
Re: Proposal: ZFS Hot Spare support
Posted: Mar 30, 2006 5:33 PM   in response to: eschrock

  Click to reply to this thread Reply

On Mar 30, 2006, at 12:03, Eric Schrock wrote:

> A hot spare is considered 'in use' for the purpose of libdiskmgt and
> zpool(1M) if it is labelled as a spare and is currently in one or more
> pool's list of active spares. If a spare is part of an exported pool,
> it is not considered in use, due largely to the fact that
> distinguishing
> this case from a recently destroyed pool is difficult and not solvable
> in the general case.

Would it be possible (or useful) to have a 'pool' of spares available
to a couple of ZFS pools?

Instead of associating the disks with a particular pool, you would be
able to say "if a disk fails in ZFS pool X, Y, or Z, grab a disk 1,
2, or 3; if a disk fails in ZFS pool A, B, or C, grab disk 4 or 5;
all other ZFS pools should grab disk 6".
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



Barry Robison
barryr@al.com.au
Re: Proposal: ZFS Hot Spare support
Posted: Mar 30, 2006 6:31 PM   in response to: David Magda

  Click to reply to this thread Reply

David Magda wrote:

>
> Would it be possible (or useful) to have a 'pool' of spares available
> to a couple of ZFS pools?
>
> Instead of associating the disks with a particular pool, you would be
> able to say "if a disk fails in ZFS pool X, Y, or Z, grab a disk 1,
> 2, or 3; if a disk fails in ZFS pool A, B, or C, grab disk 4 or 5;
> all other ZFS pools should grab disk 6".

B. POOL MANAGEMENT

Hot spares are stored with each pool, although they can be overlapped
between different pools. This allows administrators to reserve
system-wide hot spares, as well as per-pool hot spares according to their
policies.


So spares can belong to multiple pools, I take it.

_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



eschrock

Posts: 799
From: Menlo Park, CA

Registered: 3/9/05
Re: Proposal: ZFS Hot Spare support
Posted: Mar 30, 2006 9:45 PM   in response to: Barry Robison

  Click to reply to this thread Reply

On Fri, Mar 31, 2006 at 01:31:49PM +1100, Barry Robison wrote:
>
> So spares can belong to multiple pools, I take it.
>

Yep. Here's an example:

# zpool create test mirror c0t0d0 c0t1d0 spare c1t0d0 c1t1d0
# zpool create test2 mirror c4t0d0 c4t1d0 spare c1t0d0 c1t1d0
# zpool status
pool: test
state: ONLINE
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
mirror ONLINE 0 0 0
c0t0d0 ONLINE 0 0 0
c0t1d0 ONLINE 0 0 0
spares
c1t0d0 ONLINE
c1t1d0 ONLINE

errors: No known data errors

pool: test2
state: ONLINE
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
test2 ONLINE 0 0 0
mirror ONLINE 0 0 0
c4t0d0 ONLINE 0 0 0
c4t1d0 ONLINE 0 0 0
spares
c1t0d0 ONLINE
c1t1d0 ONLINE

errors: No known data errors
# zpool replace test c0t0d0 c1t0d0
# zpool status
pool: test
state: ONLINE
scrub: resilver completed with 0 errors on Thu Mar 30 21:42:37 2006
config:

NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
mirror ONLINE 0 0 0
spare ONLINE 0 0 0
c0t0d0 ONLINE 0 0 0 35.5K resilvered
c1t0d0 ONLINE 0 0 0 35.5K resilvered
c0t1d0 ONLINE 0 0 0
spares
c1t0d0 SPARED currently in use
c1t1d0 ONLINE

errors: No known data errors

pool: test2
state: ONLINE
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
test2 ONLINE 0 0 0
mirror ONLINE 0 0 0
c4t0d0 ONLINE 0 0 0
c4t1d0 ONLINE 0 0 0
spares
c1t0d0 SPARED in use by pool 'test'
c1t1d0 ONLINE

errors: No known data errors

It's probably a bug that the 'test' pool is reported as ONLINE. By
definition, a 'spare' vdev should probably be treated as DEGRADED. I
can fix that...

- Eric

--
Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



darrenr

Posts: 2,060
From:

Registered: 6/8/05
Re: Proposal: ZFS Hot Spare support
Posted: Mar 30, 2006 11:16 PM   in response to: eschrock

  Click to reply to this thread Reply

Eric Schrock wrote:

> ...
>
># zpool replace test c0t0d0 c1t0d0
># zpool status
> pool: test
> state: ONLINE
> scrub: resilver completed with 0 errors on Thu Mar 30 21:42:37 2006
>config:
>
> NAME STATE READ WRITE CKSUM
> test ONLINE 0 0 0
> mirror ONLINE 0 0 0
> spare ONLINE 0 0 0
> c0t0d0 ONLINE 0 0 0 35.5K resilvered
> c1t0d0 ONLINE 0 0 0 35.5K resilvered
> c0t1d0 ONLINE 0 0 0
> spares
> c1t0d0 SPARED currently in use
> c1t1d0 ONLINE
>...
>It's probably a bug that the 'test' pool is reported as ONLINE. By
>definition, a 'spare' vdev should probably be treated as DEGRADED. I
>can fix that...
>

To me the output here is a little confusing. Shouldn't the status
of c0t0d0 in mirror's spare output say something other than "ONLINE"?
Perhaps also that for c1t0d0?

I'd expect c1t0d0 to be ONLINE (in the mirror/spare output) after the
replacement is complete and at some other state in the meantime.

Darren

_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



David Magda
dmagda@ee.ryerson.ca
Re: Proposal: ZFS Hot Spare support
Posted: Mar 31, 2006 4:48 AM   in response to: eschrock

  Click to reply to this thread Reply

On Mar 31, 2006, at 00:45, Eric Schrock wrote:

> # zpool create test mirror c0t0d0 c0t1d0 spare c1t0d0 c1t1d0
> # zpool create test2 mirror c4t0d0 c4t1d0 spare c1t0d0 c1t1d0

Yes, I must have read over section B too quickly, since this is more
or less what I meant in my question.

Thanks for clearing things up.

Regards,
David
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



eschrock

Posts: 799
From: Menlo Park, CA

Registered: 3/9/05
Re: Proposal: ZFS Hot Spare support
Posted: Mar 30, 2006 8:46 PM   in response to: David Magda

  Click to reply to this thread Reply

On Thu, Mar 30, 2006 at 08:33:32PM -0500, David Magda wrote:
>
> Would it be possible (or useful) to have a 'pool' of spares available
> to a couple of ZFS pools?
>
> Instead of associating the disks with a particular pool, you would be
> able to say "if a disk fails in ZFS pool X, Y, or Z, grab a disk 1,
> 2, or 3; if a disk fails in ZFS pool A, B, or C, grab disk 4 or 5;
> all other ZFS pools should grab disk 6".

We kicked this idea around for a while, but there are two main reasons
for not doing it:

1. You need to invent a new grammar for describing arbitrary relations
between spares and pools. We can't leverage any existing ZFS CLI to
do this for us.

2. The information about which spares are allocated to your pool is no
longer associated with your disks. With ZFS, we've tried very to
keep all information about your data, including how to mount it,
share it, and manage redundancy, with the data itself. Having a
separate pool means that 'zpool export' no longer takes information
about my hot spares anymore, which is not too appealing.

- Eric

--
Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



Jeff Bonwick
bonwick@zion.eng.sun...
Re: Proposal: ZFS Hot Spare support
Posted: Mar 30, 2006 11:57 PM   in response to: eschrock

  Click to reply to this thread Reply

> > spares
> > c1t0d0 SPARED currently in use
> > c1t1d0 ONLINE

> To me the output here is a little confusing. Shouldn't the status
> of c0t0d0 in mirror's spare output say something other than "ONLINE"?
> Perhaps also that for c1t0d0?

I agree. I'd expect ONLINE to mean in use, and OFFLINE to mean
not in use (and thus available). But that's still somewhat indirect.

How about TAKEN and AVAILABLE?

Jeff

_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



eschrock

Posts: 799
From: Menlo Park, CA

Registered: 3/9/05
Re: Proposal: ZFS Hot Spare support
Posted: Mar 31, 2006 9:39 AM   in response to: Jeff Bonwick

  Click to reply to this thread Reply

On Thu, Mar 30, 2006 at 11:57:35PM -0800, Jeff Bonwick wrote:
> > > spares
> > > c1t0d0 SPARED currently in use
> > > c1t1d0 ONLINE
>
> > To me the output here is a little confusing. Shouldn't the status
> > of c0t0d0 in mirror's spare output say something other than "ONLINE"?
> > Perhaps also that for c1t0d0?
>
> I agree. I'd expect ONLINE to mean in use, and OFFLINE to mean
> not in use (and thus available). But that's still somewhat indirect.
>
> How about TAKEN and AVAILABLE?

I'm all for AVAILABLE. It's still possible to have UNAVAIL spares as
well, as the kernel verifies that they can be opened and correspond to a
known device. Of course, this makes me wonder about replacing hot
spares. If we validate the GUID is known, how does one replace a hot
spare? If I swap in a different drive, it'll complain that the disk
doesn't match the known spare. Perhaps 'zpool replace' needs to support
hot spares, and the future hotplug work can replace them automatically.
I'll need to think about that for a bit...

- Eric

--
Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



darrenr

Posts: 2,060
From:

Registered: 6/8/05
Re: Proposal: ZFS Hot Spare support
Posted: Mar 31, 2006 12:51 PM   in response to: Jeff Bonwick

  Click to reply to this thread Reply

Jeff Bonwick wrote:

>>> spares
>>> c1t0d0 SPARED currently in use
>>> c1t1d0 ONLINE
>>>
>>>
>
>
>
>>To me the output here is a little confusing. Shouldn't the status
>>of c0t0d0 in mirror's spare output say something other than "ONLINE"?
>>Perhaps also that for c1t0d0?
>>
>>
>
>I agree. I'd expect ONLINE to mean in use, and OFFLINE to mean
>not in use (and thus available). But that's still somewhat indirect.
>
>How about TAKEN and AVAILABLE?
>
>

I agree with those suggestions.

Darren

_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



darrenr

Posts: 2,060
From:

Registered: 6/8/05
Re: Proposal: ZFS Hot Spare support
Posted: Mar 31, 2006 5:23 PM   in response to: Jeff Bonwick

  Click to reply to this thread Reply

Jeff Bonwick wrote:

>>> spares
>>> c1t0d0 SPARED currently in use
>>> c1t1d0 ONLINE
>>>
>>>
>
>
>
>>To me the output here is a little confusing. Shouldn't the status
>>of c0t0d0 in mirror's spare output say something other than "ONLINE"?
>>Perhaps also that for c1t0d0?
>>
>>
>
>I agree. I'd expect ONLINE to mean in use, and OFFLINE to mean
>not in use (and thus available). But that's still somewhat indirect.
>
>How about TAKEN and AVAILABLE?
>
>

I forgot to mention, I think that the "ONLINE" status of the disk being
spared-out should be something different.

I think this is what is meant by the "35.5k resilvering"?

To me this is the only obscure part of the output.
I'd rather see something like:


NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
mirror ONLINE 0 0 0
spare ONLINE 0 0 0
c0t0d0 RESYNC 0 0 0 35.5K
c1t0d0 RESYNC 0 0 0 35.5K
c0t1d0 ONLINE 0 0 0
spares
c1t0d0 TAKEN currently in use
c1t1d0 AVAILABLE


I'm tempted to suggest that "RESYNC" should be different for the incoming
disk and the outgoing disk, maybe:


NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
mirror ONLINE 0 0 0
spare ONLINE 0 0 0
c0t0d0 OUTSYNC 0 0 0 35.5K
c1t0d0 INSYNC 0 0 0 35.5K
c0t1d0 ONLINE 0 0 0
spares
c1t0d0 TAKEN currently in use
c1t1d0 AVAILABLE


The idea is that the "spares" section under "mirror" is now self
explanatory.
I'm not too enamoured by "OUTSYNC" or "INSYNC" as useful words here but
hopefully they should convey the idea. "SYNCUP" and "SYNCDOWN" are some
other altnatives I can think of right now.

Darren

_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



eschrock

Posts: 799
From: Menlo Park, CA

Registered: 3/9/05
Re: Proposal: ZFS Hot Spare support
Posted: Mar 31, 2006 6:36 PM   in response to: darrenr

  Click to reply to this thread Reply

On Fri, Mar 31, 2006 at 05:23:28PM -0800, Darren Reed wrote:
>
> I forgot to mention, I think that the "ONLINE" status of the disk being
> spared-out should be something different.
>

Well, the example I gave is pretyt contrived. Under normal
circumstances, the device you're sparing out is faulted. It's really
important that we show the actual state of that device, not just some
faked-up value. For example, the following all imply very different
capabilities of the pool:

spare DEGRADED
diskA ONLINE
diskB ONLINE

spare DEGRADED
diskA FAULTED
diskB ONLINE

spare DEGRADED
diskA DEGRADED
diskB ONLINE

Note that this is the same as with replacing. If you go to replace a
online device, we don't go change its state. We kicked around the idea
of trying to fake up something to visually represent which was being
replaced, but changing the 'state' definitely didn't work for the above
reasons. Event though a device is being replaced and/or spared, it has
a state that is distinct from its current role in the vdev tree.

- Eric

--
Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



darrenr

Posts: 2,060
From:

Registered: 6/8/05
Re: Proposal: ZFS Hot Spare support
Posted: Apr 3, 2006 12:33 AM   in response to: eschrock

  Click to reply to this thread Reply

Eric Schrock wrote:

>On Fri, Mar 31, 2006 at 05:23:28PM -0800, Darren Reed wrote:
>
>
>>I forgot to mention, I think that the "ONLINE" status of the disk being
>>spared-out should be something different.
>>
>>
>>
>
>Well, the example I gave is pretyt contrived. Under normal
>circumstances, the device you're sparing out is faulted. It's really
>important that we show the actual state of that device, not just some
>faked-up value. For example, the following all imply very different
>capabilities of the pool:
>
> spare DEGRADED
> diskA ONLINE
> diskB ONLINE
>
> spare DEGRADED
> diskA FAULTED
> diskB ONLINE
>
> spare DEGRADED
> diskA DEGRADED
> diskB ONLINE
>
>Note that this is the same as with replacing.
>

Looking at those three, the "DEGRADED" for the first spare
set seems like a bug to me. My assumption is that:

ONLINE(spare) = ONLINE(diskA) + ONLINE(diskB)

and I think this is the intuitive way to read the above output.
If that isn't the story then something needs to not say "ONLINE".

>If you go to replace a
>online device, we don't go change its state. We kicked around the idea
>of trying to fake up something to visually represent which was being
>replaced, but changing the 'state' definitely didn't work for the above
>reasons. Event though a device is being replaced and/or spared, it has
>a state that is distinct from its current role in the vdev tree.
>
>

It would be very worthwhile if something could be faked, visually,
to represent what is going on inside, if only to avoid the first case
of output (above) which seems non-sensical.

Darren

_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



przemolicc@pocz...
Re: Proposal: ZFS Hot Spare support
Posted: Mar 31, 2006 1:02 AM   in response to: eschrock

  Click to reply to this thread Reply

On Thu, Mar 30, 2006 at 09:03:30AM -0800, Eric Schrock wrote:
> 3. Removing hot spares from a pool
>
> Hot spares can be removed from a pool with the new 'zpool remove'
> subcommand. This subcommand suggests the ability to remove arbitrary
> devices, and certainly is a feature that will be supported in a future
> release, but currently this will only allow removing hot spares. For
> example:
>
> # zpool remove test c2d0
>
> If the hot spare is currently spared in, then the command will print an
> error and exit.

I am not sure whether shrinking of pool is considered in the future but if it is
wouldn't be better to use another syntax:

[SPARES]
# zpool remove test spare c2d0

[SHRINKING]
# zpool remove test c2d0

This way I distinguish betweent removing spare and _shrinking_ pool. Without
that I could easily make a mistake.

przemol
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



eschrock

Posts: 799
From: Menlo Park, CA

Registered: 3/9/05
Re: Proposal: ZFS Hot Spare support
Posted: Mar 31, 2006 9:47 AM   in response to: przemolicc@pocz...

  Click to reply to this thread Reply

On Fri, Mar 31, 2006 at 11:02:59AM +0200, przemolicc at poczta dot fm wrote:
>
> I am not sure whether shrinking of pool is considered in the future but if it is
> wouldn't be better to use another syntax:
>
> [SPARES]
> # zpool remove test spare c2d0
>
> [SHRINKING]
> # zpool remove test c2d0
>
> This way I distinguish betweent removing spare and _shrinking_ pool. Without
> that I could easily make a mistake.

For future pool removal, I anticipate having labelled mirrors and
RAID-Z vdevs, so that you can identify them by name, such as:

mirror-1
c0d0
c1d0
mirror-2
c2d0
c3d0

Then, you can remove a toplevel vdev by saying 'zpool remove mirror-1'.
The only way that this could become confusing is if you have an
unreplicated pool with hot spares, but I don't see this being a useful
configuration.

Note that another possibility would be:

# zpool remove mirror c0d0

Which means "remove the mirror containing disk c0d0", but that has other
issues (especially if support mirrors of RAID-Z and more complicated
configurations).

This is definitely a reason not to have 'zpool remove' behave like
'zpool detach' for a single drive case.

- Eric

--
Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



Robert Milkowski
rmilkowski@task.gda.pl
Re: Proposal: ZFS Hot Spare support
Posted: Mar 31, 2006 2:59 AM   in response to: eschrock

  Click to reply to this thread Reply

Hello Eric,

This is great!

However it would be really usefull if you could specify that some of
spares are global - so if I create new pool this spares will
assigned automatically.


--
Best regards,
Robert mailto:rmilkowski at task dot gda dot pl
http://milek.blogspot.com

_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



eschrock

Posts: 799
From: Menlo Park, CA

Registered: 3/9/05
Re: Proposal: ZFS Hot Spare support
Posted: Mar 31, 2006 9:41 AM   in response to: Robert Milkowski

  Click to reply to this thread Reply

On Fri, Mar 31, 2006 at 12:59:47PM +0200, Robert Milkowski wrote:
> Hello Eric,
>
> This is great!
>
> However it would be really usefull if you could specify that some of
> spares are global - so if I create new pool this spares will
> assigned automatically.

I'm hesitant to do this for two reasons:

1. We're creating auxilliary ZFS state that is independent of the pool
data. This means that we need to invent a new syntax for managing
system-wide global spares, as well as how to assign them to pools.

2. Creating pools is not a common operation. Most systems will have
only one or two pools on it. It's easily enough to simply add the
same spares to both pools, and more configurable.

- Eric

--
Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



Robert Milkowski
rmilkowski@task.gda.pl
Re[2]: Proposal: ZFS Hot Spare support
Posted: Mar 31, 2006 3:59 PM   in response to: eschrock

  Click to reply to this thread Reply

Hello Eric,

Friday, March 31, 2006, 7:41:57 PM, you wrote:

ES> On Fri, Mar 31, 2006 at 12:59:47PM +0200, Robert Milkowski wrote:
>> Hello Eric,
>>
>> This is great!
>>
>> However it would be really usefull if you could specify that some of
>> spares are global - so if I create new pool this spares will
>> assigned automatically.

ES> I'm hesitant to do this for two reasons:

ES> 1. We're creating auxilliary ZFS state that is independent of the pool
ES> data. This means that we need to invent a new syntax for managing
ES> system-wide global spares, as well as how to assign them to pools.

ES> 2. Creating pools is not a common operation. Most systems will have
ES> only one or two pools on it. It's easily enough to simply add the
ES> same spares to both pools, and more configurable.

I don't know - ZFS was mainly targeted for large systems (I mean in
those systems you will see big difference with ZFS) and for example
here we add quite a lot of storage on regular basis (I won't make just
one large pool, rather many small pools) and creating globa host
spares at the beginning would be welcomed improvements - the same way
we have it on HW arrays.

btw: I guess hot spares in ZFS won't make it into U2...?


--
Best regards,
Robert mailto:rmilkowski at task dot gda dot pl
http://milek.blogspot.com

_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



eschrock

Posts: 799
From: Menlo Park, CA

Registered: 3/9/05
Re: Proposal: ZFS Hot Spare support
Posted: Mar 31, 2006 4:11 PM   in response to: Robert Milkowski

  Click to reply to this thread Reply

On Sat, Apr 01, 2006 at 01:59:38AM +0200, Robert Milkowski wrote:
>
> I don't know - ZFS was mainly targeted for large systems (I mean in
> those systems you will see big difference with ZFS) and for example
> here we add quite a lot of storage on regular basis (I won't make just
> one large pool, rather many small pools) and creating globa host
> spares at the beginning would be welcomed improvements - the same way
> we have it on HW arrays.

Why won't you make just one large pool, rather than many small pools?
The only reason not to do so is:

a. Different performance characteristics
or
b. Different fault tolerance characteristics

I can see a server with just two or three pools (one for the root disk,
one for customer data, etc), but I don't see why you would create lots
of new pools on a regular basis. Can you explain your use case and
reasons in a little more detail? "Because we can do it on product X"
doesn't really help, especially when a HW array is so fundamentally
different from a ZFS storage pool.

Supposing we were to adopt the idea of "global spares", where would this
information be stored? What would the zpool(1M) interface look like?
Could I still do per-pool spares? What would happen when I exported and
imported a pool? If a spare is swapped in permanently (an asynchronous
event in the kernel), does it then remove it from the global list of
spares for subsequent pools? I'm still having trouble envisioning the
details of how this would actually work...

> btw: I guess hot spares in ZFS won't make it into U2...?

Yes, that is correct.

- Eric

--
Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



Robert Milkowski
rmilkowski@task.gda.pl
Re[2]: Proposal: ZFS Hot Spare support
Posted: Mar 31, 2006 4:41 PM   in response to: eschrock

  Click to reply to this thread Reply

Hello Eric,

Saturday, April 1, 2006, 2:11:09 AM, you wrote:

ES> On Sat, Apr 01, 2006 at 01:59:38AM +0200, Robert Milkowski wrote:
>>
>> I don't know - ZFS was mainly targeted for large systems (I mean in
>> those systems you will see big difference with ZFS) and for example
>> here we add quite a lot of storage on regular basis (I won't make just
>> one large pool, rather many small pools) and creating globa host
>> spares at the beginning would be welcomed improvements - the same way
>> we have it on HW arrays.

ES> Why won't you make just one large pool, rather than many small pools?
ES> The only reason not to do so is:

ES> a. Different performance characteristics
ES> or
ES> b. Different fault tolerance characteristics

ES> I can see a server with just two or three pools (one for the root disk,
ES> one for customer data, etc), but I don't see why you would create lots
ES> of new pools on a regular basis. Can you explain your use case and
ES> reasons in a little more detail? "Because we can do it on product X"
ES> doesn't really help, especially when a HW array is so fundamentally
ES> different from a ZFS storage pool.

Answers to a) and b) are no and no.

In our case in one soultion we're thnking to put zfs on we've got
let's say 8x 3511 JBODs connected to two hosts in a cluster. Right now
we have additional head unit (with HW controllers) and we're doing
raid-5 group on every enclosure using 11 disks and leaving last disk
as a global hot spare. With zfs I was thinking of doing something
similar - raidz for every JBOD (so in this case I will endup with 8
pools and 8 hot spares).

Now I can make just one large raidz pool (+ some hot spares) but it
could be risky. So I can make one large pool which is actually a
"concatenation/stripe" of many raidz groups - in an essence it could
be a stripe/concatenation of raidz groups where each raidz group is
build from 11 disks from one enclosure. That way availibility is
better then having one large raidz pool and probably performance is
better as Bill pointed out (however I don't understand why). In that
configuration I would endup with ~40TB logical data pool.

Now what happens if two disks in one raidz group fail? I will loose
whole 40TB of data.

What happens if there's a problem with one disk (very long IOs but
it's still working - it happens) with entire pool? Instead of heaving
problem with one smaller pool now I've got a performance problem with
entire 40TB pool.

Now if I want to serve some data from the other cluster node I can
just switch some pools to the other node - something I can't do with
one pool.



ES> Supposing we were to adopt the idea of "global spares", where would this
ES> information be stored? What would the zpool(1M) interface look like?
ES> Could I still do per-pool spares? What would happen when I exported and
ES> imported a pool? If a spare is swapped in permanently (an asynchronous
ES> event in the kernel), does it then remove it from the global list of
ES> spares for subsequent pools? I'm still having trouble envisioning the
ES> details of how this would actually work...

Maybe just another pool with hot spares? Then be default all new pools
would have an variable use_global_hotspares set to on?

Something like:

zpool create global_hotspares hotspare c1t0d0 c2t0d0 c3t0d0

if you don't want to use global_hotspares in a given pool you could do

zfs set use_global_hs=off pool

Now if a pool (normal pool) is exported and then imported it just
looks for a pool with either a specific ID, name or any other tag
which would mean it's a pool with global hotspares (only if
use_global_hs is set to on for the pool being imported). If no such
pool is available it can only use local hotspares directly attached to
it (if there are any). Now if you import a pool with global hotspares
all actually active pools (or later imported) which have use_global_hs
set to on will automatically use it.

??

>> btw: I guess hot spares in ZFS won't make it into U2...?

ES> Yes, that is correct.

Thanks for info.


--
Best regards,
Robert mailto:rmilkowski at task dot gda dot pl
http://milek.blogspot.com

_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



eschrock

Posts: 799
From: Menlo Park, CA

Registered: 3/9/05
Re: Proposal: ZFS Hot Spare support
Posted: Mar 31, 2006 5:15 PM   in response to: Robert Milkowski

  Click to reply to this thread Reply

On Sat, Apr 01, 2006 at 02:41:39AM +0200, Robert Milkowski wrote:
>
> Now I can make just one large raidz pool (+ some hot spares) but it
> could be risky. So I can make one large pool which is actually a
> "concatenation/stripe" of many raidz groups - in an essence it could
> be a stripe/concatenation of raidz groups where each raidz group is
> build from 11 disks from one enclosure. That way availibility is
> better then having one large raidz pool and probably performance is
> better as Bill pointed out (however I don't understand why). In that
> configuration I would endup with ~40TB logical data pool.
>
> Now what happens if two disks in one raidz group fail? I will loose
> whole 40TB of data.

This won't be the case with metadata replication, which should be coming
soon. You will only lose the plain file contents of the objects
contained within that toplevel vdev.

Of course, if you're measuring "time to restore from backup", then it
doesn't matter if we survive the failure, since you'll still have to
restore all your data from backup. Although I could imagine some
creative ways of using zfs send/receive to make this faster.

> What happens if there's a problem with one disk (very long IOs but
> it's still working - it happens) with entire pool? Instead of heaving
> problem with one smaller pool now I've got a performance problem with
> entire 40TB pool.

This should be handled by the ZFS I/O scheduler automatically. We have
some work to do in this area, but I wouldn't design a feature around
lack of current performance.

> Now if I want to serve some data from the other cluster node I can
> just switch some pools to the other node - something I can't do with
> one pool.

Yes, this is definitely true.

> Maybe just another pool with hot spares? Then be default all new pools
> would have an variable use_global_hotspares set to on?
>
> Something like:
>
> zpool create global_hotspares hotspare c1t0d0 c2t0d0 c3t0d0
>
> if you don't want to use global_hotspares in a given pool you could do
>
> zfs set use_global_hs=off pool
>
> Now if a pool (normal pool) is exported and then imported it just
> looks for a pool with either a specific ID, name or any other tag
> which would mean it's a pool with global hotspares (only if
> use_global_hs is set to on for the pool being imported). If no such
> pool is available it can only use local hotspares directly attached to
> it (if there are any). Now if you import a pool with global hotspares
> all actually active pools (or later imported) which have use_global_hs
> set to on will automatically use it.

OK, so this is just a "magic pool" that behaves differently? This
starts to get very nasty very quickly. The name "global_hostspares" is
reserved, and all of a sudden all the operations I can do it are
different. You can only add individual disks, you can't remove certain
disks, the output of "zpool status" has to be different, importing a hot
spare pool has to be handled specially, renames (when supported) will
have to be handled carfeully, I can't create ZFS filesystems
in it, and the edge conditions continue...

Based on my observations, it seems to me that:

1. This introduces an order of magnitude more edge conditions that alter
normal interaction with the system.
2. It requires work (particularly "zpool set") that we haven't yet done.
3. It does not replace the need for per-pool spares.
4. It is not the common use case.
5. The behavior can be replicated with a small amount of manual work
given the current proposal.

We can implement this as a future RFE, but right now we should implement
the straightforward solution, and deal with the complexities of this
proposal at a later date.

- Eric

--
Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



Robert Milkowski
rmilkowski@task.gda.pl
Re[2]: Proposal: ZFS Hot Spare support
Posted: Apr 3, 2006 2:40 AM   in response to: eschrock

  Click to reply to this thread Reply

Hello Eric,

Saturday, April 1, 2006, 3:15:10 AM, you wrote:

ES> On Sat, Apr 01, 2006 at 02:41:39AM +0200, Robert Milkowski wrote:
>>
>> Now I can make just one large raidz pool (+ some hot spares) but it
>> could be risky. So I can make one large pool which is actually a
>> "concatenation/stripe" of many raidz groups - in an essence it could
>> be a stripe/concatenation of raidz groups where each raidz group is
>> build from 11 disks from one enclosure. That way availibility is
>> better then having one large raidz pool and probably performance is
>> better as Bill pointed out (however I don't understand why). In that
>> configuration I would endup with ~40TB logical data pool.
>>
>> Now what happens if two disks in one raidz group fail? I will loose
>> whole 40TB of data.

ES> This won't be the case with metadata replication, which should be coming
ES> soon. You will only lose the plain file contents of the objects
ES> contained within that toplevel vdev.

Yeah, that would be better. But still there's a problem how to correct
that situation - as you mentioned you will be probably forced to
restore whole 40TB of datam instead of 5TB.



>> What happens if there's a problem with one disk (very long IOs but
>> it's still working - it happens) with entire pool? Instead of heaving
>> problem with one smaller pool now I've got a performance problem with
>> entire 40TB pool.

ES> This should be handled by the ZFS I/O scheduler automatically. We have
ES> some work to do in this area, but I wouldn't design a feature around
ES> lack of current performance.

That's good to hear.



ES> Based on my observations, it seems to me that:

ES> 1. This introduces an order of magnitude more edge conditions that alter
ES> normal interaction with the system.
ES> 2. It requires work (particularly "zpool set") that we haven't yet done.
ES> 3. It does not replace the need for per-pool spares.
ES> 4. It is not the common use case.

I can't agree with #4.
IMHO in most raid enviroments, especially with a lot of disks, you
just create some global hot spares and don't think about it later when
adding new disks, etc.

ES> We can implement this as a future RFE, but right now we should implement
ES> the straightforward solution, and deal with the complexities of this
ES> proposal at a later date.

That's reasonable.

--
Best regards,
Robert mailto:rmilkowski at task dot gda dot pl
http://milek.blogspot.com

_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



Robert Milkowski
rmilkowski@task.gda.pl
Re[2]: Proposal: ZFS Hot Spare support
Posted: Apr 3, 2006 11:24 AM   in response to: eschrock

  Click to reply to this thread Reply

Hello Eric,

Saturday, April 1, 2006, 3:15:10 AM, you wrote:

ES> This won't be the case with metadata replication, which should be coming
ES> soon. You will only lose the plain file contents of the objects
ES> contained within that toplevel vdev.

It just occured to me - if there would be a zfs command to get a list
of "broken" (data missing) files due to failure of some disks then
with such a list one could restore only bad files and not a whole pool
(assuming that you can overwrite these files).

Most backup software lets you restore only files listed in a file.

What do you think?


--
Best regards,
Robert mailto:rmilkowski at task dot gda dot pl
http://milek.blogspot.com

_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



eschrock

Posts: 799
From: Menlo Park, CA

Registered: 3/9/05
Re: Proposal: ZFS Hot Spare support
Posted: Apr 3, 2006 11:32 AM   in response to: Robert Milkowski

  Click to reply to this thread Reply

On Mon, Apr 03, 2006 at 08:24:45PM +0200, Robert Milkowski wrote:
>
> It just occured to me - if there would be a zfs command to get a list
> of "broken" (data missing) files due to failure of some disks then
> with such a list one could restore only bad files and not a whole pool
> (assuming that you can overwrite these files).
>
> Most backup software lets you restore only files listed in a file.
>
> What do you think?

Starting in build 36, we get 50% of the way there. If you do a scrub of
a pool, and then run 'zpool status -v', you'll get a detailed list of
all the unrecoverable (logical) blocks in the pool found during the
scrub. The problem is that they are currently only identified by
dataset name and object number - not exactly conducive to repair
procedures. There is a future RFE to translate the object number to a
filename (when available), but it's non-trivial when the filesystem is
currently mounted. We can't grok around the internal DMU state without
going through the "front door" of the ZPL. Matt or Mark may be able to
shed some light on how much investigation they've done in this area, if
any.

The result, of course, would be _very_ cool. With background scrubbing
(also coming in the future), you will always have an up-to-date list of
damaged data in your pool, or hopefully lack thereof :-)

- Eric

--
Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



Erik Trimble
Erik.Trimble@Sun.COM
Proposal: ZFS Hot Spare support
Posted: Mar 31, 2006 5:17 PM   in response to: Robert Milkowski

  Click to reply to this thread Reply

The main reason to have different ZFS pools is to implement redundancy
ACROSS JBOD enclosures.

I'm assuming that you can't add new disks to a udev unit afterwards -
you can only add new udevs to a pool. Or is this incorrect?




In Robert's case, the best thing to do is this (assuming he wants
maximum disk space usage, while still retaining some redundancy):

(for simplicity's sake, I'm showing a 3-array (3 drives/array) config)

zpool create tank raidz c0t0d0s2 c1t0d0s2 c2t0d0s2 raidz c0t1d0s2
c1t1d0s2 c2t2d0s2 raidz c0t2t0s2 c1t2d0s2 c2t2d0s2


That is, create a stripe of RAID-Z undevs.


This insulates you against the loss of any one JBOD.


You can then add the remaining disks as HotSpare to the pool.


(of course, using the 3511s, you probably would be best off creating
each RAID-5 subarray using the HW controller, then simply striping them
using ZFS).


--
Erik Trimble
Java System Support
Mailstop: usca14-102
Phone: x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



eschrock

Posts: 799
From: Menlo Park, CA

Registered: 3/9/05
Re: Proposal: ZFS Hot Spare support
Posted: Mar 31, 2006 6:32 PM   in response to: Erik Trimble

  Click to reply to this thread Reply

On Fri, Mar 31, 2006 at 05:17:25PM -0800, Erik Trimble wrote:
> The main reason to have different ZFS pools is to implement redundancy
> ACROSS JBOD enclosures.

I'm a little confused. To implement redundancy across anything, doesn't
that mean they have to be in the same pool? How to I get redundancy
across multiple pools?

> zpool create tank raidz c0t0d0s2 c1t0d0s2 c2t0d0s2 raidz c0t1d0s2
> c1t1d0s2 c2t2d0s2 raidz c0t2t0s2 c1t2d0s2 c2t2d0s2

But isn't this just one pool?

> (of course, using the 3511s, you probably would be best off creating
> each RAID-5 subarray using the HW controller, then simply striping them
> using ZFS).

It depends. If you want better performance, this might be true (though
benchmarks would be in order). If you want better fault tolerance, then
its better to expose them as JBODs and have ZFS deal with them. Then
you get the self-healing capabilities of ZFS that you simply cannot get
from a hardware RAID solution. For sure, you would want to RAID the
subarrays, or else you're putting all your reliability entirely within
the hands of the hardware...

- Eric

--
Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



Jeff Bonwick
bonwick@zion.eng.sun...
Re: Re: Proposal: ZFS Hot Spare support
Posted: Apr 1, 2006 12:05 AM   in response to: eschrock

  Click to reply to this thread Reply

> It depends. If you want better performance, this might be true (though
> benchmarks would be in order). If you want better fault tolerance, then
> its better to expose them as JBODs and have ZFS deal with them. Then
> you get the self-healing capabilities of ZFS that you simply cannot get
> from a hardware RAID solution.

Another option is to get the best of both worlds by letting the
arrays do RAID-5, and then mirroring or RAID-Z-ing the arrays.

A RAID-Z group of RAID-5 arrays can tolerate at least three
whole-disk failures before losing data. It can also tolerate
the failure of an entire array *plus* one whole-disk failure
on each of the remaining arrays). Using RAID-Z (or mirroring)
means that you get self-healing data: if an array returns bad data,
ZFS will detect it and reconstruct good data from the other arrays.

Jeff

_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



Robert Milkowski
rmilkowski@task.gda.pl
Re[2]: Re: Proposal: ZFS Hot Spare support
Posted: Apr 3, 2006 2:49 AM   in response to: Jeff Bonwick

  Click to reply to this thread Reply

Hello Jeff,

Saturday, April 1, 2006, 10:05:29 AM, you wrote:

>> It depends. If you want better performance, this might be true (though
>> benchmarks would be in order). If you want better fault tolerance, then
>> its better to expose them as JBODs and have ZFS deal with them. Then
>> you get the self-healing capabilities of ZFS that you simply cannot get
>> from a hardware RAID solution.

JB> Another option is to get the best of both worlds by letting the
JB> arrays do RAID-5, and then mirroring or RAID-Z-ing the arrays.

JB> A RAID-Z group of RAID-5 arrays can tolerate at least three
JB> whole-disk failures before losing data. It can also tolerate
JB> the failure of an entire array *plus* one whole-disk failure
JB> on each of the remaining arrays). Using RAID-Z (or mirroring)
JB> means that you get self-healing data: if an array returns bad data,
JB> ZFS will detect it and reconstruct good data from the other arrays.

I haven't considered this one - sounds interesting.
However less storage will ba available but still this could be
interesting.

Thanks.

--
Best regards,
Robert mailto:rmilkowski at task dot gda dot pl
http://milek.blogspot.com

_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



Jeff Bonwick
bonwick@zion.eng.sun...
Re: Re[2]: Proposal: ZFS Hot Spare support
Posted: Apr 1, 2006 12:35 AM   in response to: Robert Milkowski

  Click to reply to this thread Reply

> I don't know - ZFS was mainly targeted for large systems

Actually, our goal is to run the gamut. I want ZFS not just on
large disk farms, but also on my laptop. Eventually I'd also
like to get ZFS onto iPods and Compact Flash cards, so that a
power outage doesn't mean losing your music or your pictures.

Jeff

_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



darrenr

Posts: 2,060
From:

Registered: 6/8/05
Re: Proposal: ZFS Hot Spare support
Posted: Apr 3, 2006 12:06 AM   in response to: Jeff Bonwick

  Click to reply to this thread Reply

Jeff Bonwick wrote:

>>I don't know - ZFS was mainly targeted for large systems
>>
>>
>
>Actually, our goal is to run the gamut. I want ZFS not just on
>large disk farms, but also on my laptop. Eventually I'd also
>like to get ZFS onto iPods and Compact Flash cards, so that a
>power outage doesn't mean losing your music or your pictures.
>
>

And where power outage includes spontaneous popping out
of said devices from their "holder" too :) I can't remember
how many Amiga floppies I burnt because the weren't always
consistent on disk.

Darren

_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



nadkarni

Posts: 480
From:

Registered: 3/9/05
Re: Re[2]: Proposal: ZFS Hot Spare support
Posted: Apr 10, 2006 3:13 PM   in response to: Robert Milkowski

  Click to reply to this thread Reply

At one point there was talk of implementing "hot space" rather than hotspares. Is this a precursor to that step ? Or is hot space a different notion ?

-Sanjay

billm

Posts: 91
From: Menlo Park, CA

Registered: 3/9/05
Re: Re: Re[2]: Proposal: ZFS Hot Spare support
Posted: Apr 11, 2006 12:39 PM   in response to: nadkarni

  Click to reply to this thread Reply

On Mon, Apr 10, 2006 at 03:13:17PM -0700, Sanjay G. Nadkarni wrote:
> At one point there was talk of implementing "hot space" rather than
> hotspares. Is this a precursor to that step ? Or is hot space a
> different notion ?

They serve similar purposes, but are not 100% replacements for each
other. We will still be working on hot space, but it will not be
a short-term project.


--Bill
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



Joe Little
jmlittle@gmail.com
Re: Proposal: ZFS Hot Spare support
Posted: Mar 31, 2006 7:39 PM   in response to: eschrock

  Click to reply to this thread Reply

In our case, we are predominantly using iscsi with multiple raw LUNs
being exposed for RAIDZ. Each backend unit has mostly uniform disk
sizes, but disks sizes differ between units as disks are purchased
over time and generally target to maximize storage space. Thus, we
will likely be seeing a large heterogeneous disk farm that according
to what ZFS best practices, state, should be in separate, uniform
raidz zpools. So, a spare pool that may fit multiple zpools can come
in handy there.


On 3/31/06, Eric Schrock <eric dot schrock at sun dot com> wrote:
> On Sat, Apr 01, 2006 at 01:59:38AM +0200, Robert Milkowski wrote:
> >
> > I don't know - ZFS was mainly targeted for large systems (I mean in
> > those systems you will see big difference with ZFS) and for example
> > here we add quite a lot of storage on regular basis (I won't make just
> > one large pool, rather many small pools) and creating globa host
> > spares at the beginning would be welcomed improvements - the same way
> > we have it on HW arrays.
>
> Why won't you make just one large pool, rather than many small pools?
> The only reason not to do so is:
>
> a. Different performance characteristics
> or
> b. Different fault tolerance characteristics
>
> I can see a server with just two or three pools (one for the root disk,
> one for customer data, etc), but I don't see why you would create lots
> of new pools on a regular basis. Can you explain your use case and
> reasons in a little more detail? "Because we can do it on product X"
> doesn't really help, especially when a HW array is so fundamentally
> different from a ZFS storage pool.
>
> Supposing we were to adopt the idea of "global spares", where would this
> information be stored? What would the zpool(1M) interface look like?
> Could I still do per-pool spares? What would happen when I exported and
> imported a pool? If a spare is swapped in permanently (an asynchronous
> event in the kernel), does it then remove it from the global list of
> spares for subsequent pools? I'm still having trouble envisioning the
> details of how this would actually work...
>
> > btw: I guess hot spares in ZFS won't make it into U2...?
>
> Yes, that is correct.
>
> - Eric
>
> --
> Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris dot org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



ptribble

Posts: 1,575
From: GB

Registered: 4/27/05
Re: Proposal: ZFS Hot Spare support
Posted: Apr 3, 2006 12:10 PM   in response to: eschrock

  Click to reply to this thread Reply

On Sat, 2006-04-01 at 01:11, Eric Schrock wrote:
> Why won't you make just one large pool, rather than many small pools?
> The only reason not to do so is:
>
> a. Different performance characteristics
> or
> b. Different fault tolerance characteristics

Or:

c. Different administrative boundaries.

By which I mean that pools are the unit that is imported and exported.

If different projects (groups - possibly with separate funding) buy
storage,
I would expect to align the pools with what they purchased. That way I
can
split the storage up later without breaking up the data.

Or I allocate storage off a SAN. In that case I would want to import and
export
pools to move data around on the SAN - ie. between machines. Say a
machine becomes
busy, I would want to be able to export a pool and import it on another
machine
attached to the SAN and run the service there.

I'm not sure what the model for global spares is here. I can see that
for a
spare local to a pool then when I export the pool I lose the spare (the
spare
is physically associated with the pool and should remain so).

--
-Peter Tribble
L.I.S., University of Hertfordshire - http://www.herts.ac.uk/
http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/


_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



eschrock

Posts: 799
From: Menlo Park, CA

Registered: 3/9/05
Re: Proposal: ZFS Hot Spare support
Posted: Apr 3, 2006 12:34 PM   in response to: ptribble

  Click to reply to this thread Reply

On Mon, Apr 03, 2006 at 08:10:01PM +0100, Peter Tribble wrote:
>
> Or:
>
> c. Different administrative boundaries.
>
> By which I mean that pools are the unit that is imported and exported.

Yep. This is the use case that Robert pointed that I had failed to
consider.

> If different projects (groups - possibly with separate funding) buy
> storage, I would expect to align the pools with what they purchased.
> That way I can split the storage up later without breaking up the
> data.
>
> Or I allocate storage off a SAN. In that case I would want to import
> and export pools to move data around on the SAN - ie. between
> machines. Say a machine becomes busy, I would want to be able to
> export a pool and import it on another
> machine attached to the SAN and run the service there.
>
> I'm not sure what the model for global spares is here. I can see that
> for a spare local to a pool then when I export the pool I lose the
> spare (the spare is physically associated with the pool and should
> remain so).

Yep. Global spares are likely per-system, rather than per-pool. For
example, exporting a pool will not touch any globally configured hot
spares. As a result of Robert's suggestion, we'll be examining how to
expose this in an adminsitratively meaningful way in the future. As
usual, the difficultly is all about the admnistrative interface. The
actual FMA agent that goes off and does the replacement is trivial, and
can get the suggested replacement from anywhere.

- Eric

--
Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



sommerfe

Posts: 975
From: US

Registered: 3/9/05
Re: Proposal: ZFS Hot Spare support
Posted: Apr 1, 2006 6:25 PM   in response to: eschrock

  Click to reply to this thread Reply

On Thu, 2006-03-30 at 12:03, Eric Schrock wrote:
> C. AUTOMATED REPLACEMENT
>
> In order to perform automated replacement, a ZFS FMA agent will be added
> that subscribes to 'fault.zfs.vdev.*' faults. When a fault is received,
> the agent will examine the pool to see if it has any available hot
> spares. If so, it will perform a 'zpool replace' with an available
> spare.

I've seen automated replacement go bad...

For a while we had an E420R and its connected A5100 JBOD on a UPS.
The UPS battery went bad. We discovered this the hard way when a series
of brownouts caused the UPS to reach into the battery and find nothing
there..

The E420R sailed right through as if nothing had happened (who knows --
maybe proportionally bigger capacitors in the power supply?), but
the A5100 really didn't like this. I believe all the drives took a
little while to reset and spin back up.

In the mean time, SVM concluded that a bunch of drives in the array had
gone bad, and decided to replace as many as it had hot spares. Once the
array came all the way back on line, mirroring to the replacements
started..

In reality, all the drives were fine; it just took the better part of a
day to unwind all the premature replacements.

Not quite sure what heuristics you'd use to avoid this sort of thing,
though....

- Bill

_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



eschrock

Posts: 799
From: Menlo Park, CA

Registered: 3/9/05
Re: Proposal: ZFS Hot Spare support
Posted: Apr 1, 2006 6:59 PM   in response to: sommerfe

  Click to reply to this thread Reply

On Sat, Apr 01, 2006 at 09:25:55PM -0500, Bill Sommerfeld wrote:
>
> I've seen automated replacement go bad...
>

Well, this is certainly what would happen with the current bits. The
good news is that this is all done through FMA by subscribing to the
fault.fs.zfs.vdev.* fault. In the future, as we make the diagnosis
engine smarter, this hotplug support will be able to automatically
leverage whatever we come up with.

I don't know what the "right answer" is in the case you described, but
we'll certainly be gathering lots of data (via FMA error/fault logs) as
well as hooking into SMART and the rest of the I/O subsystem to make
more intelligent diagnosis in the future. I've got some stuff scoped
out for the next phase (SERD on I/O and checksum errors) as well as the
next advancements beyond that (processing SMART data and subscribing to
hotplug events). Expect to see more info soon.

- Eric

--
Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



shawga

Posts: 102
From: Louisville, CO

Registered: 3/12/06
Re: Proposal: ZFS Hot Spare support
Posted: Apr 1, 2006 7:33 PM   in response to: sommerfe

  Click to reply to this thread Reply

Veritas VM has a flag for this. If you set this on the disk volumes, it
won't try to use them as reallocation targets.

I found this the hard way. We were mirroring on 9176 (early precursor
to the D178) between datacenters and between two arrays. When one array
went away in a power failure, it 'mirrored' everything to the same
array.

Performance on the box went away in a hurry.

When mirroring, we'll need to have a similar flag, as it gets even more
interesting when you've got high-performance disk (database storage) and
low-performance disk (used for DB exports). Mix up the mirroring there,
and things will get ugly. I would assume that you'd put different tiers
of storage into different pools to reduce the chance of this happening,
but it's still a possibility.

On Sat, 2006-04-01 at 21:25 -0500, Bill Sommerfeld wrote:
> On Thu, 2006-03-30 at 12:03, Eric Schrock wrote:
> > C. AUTOMATED REPLACEMENT
> >
> > In order to perform automated replacement, a ZFS FMA agent will be added
> > that subscribes to 'fault.zfs.vdev.*' faults. When a fault is received,
> > the agent will examine the pool to see if it has any available hot
> > spares. If so, it will perform a 'zpool replace' with an available
> > spare.
>
> I've seen automated replacement go bad...
>
> For a while we had an E420R and its connected A5100 JBOD on a UPS.
> The UPS battery went bad. We discovered this the hard way when a series
> of brownouts caused the UPS to reach into the battery and find nothing
> there..
>
> The E420R sailed right through as if nothing had happened (who knows --
> maybe proportionally bigger capacitors in the power supply?), but
> the A5100 really didn't like this. I believe all the drives took a
> little while to reset and spin back up.
>
> In the mean time, SVM concluded that a bunch of drives in the array had
> gone bad, and decided to replace as many as it had hot spares. Once the
> array came all the way back on line, mirroring to the replacements
> started..
>
> In reality, all the drives were fine; it just took the better part of a
> day to unwind all the premature replacements.
>
> Not quite sure what heuristics you'd use to avoid this sort of thing,
> though....
>
> - Bill
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris dot org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss






Terms of Use | Privacy | Trademarks | Copyright Policy | Site Guidelines
Your use of this web site or any of its content or software indicates your agreement to be bound by these Terms of Use.
Copyright © 1995-2005 Sun Microsystems, Inc.