|
Replies:
4
-
Last Post:
Oct 2, 2008 7:23 AM
by: batschul
|
Threads:
[
Previous
|
Next
]
|
|
Posts:
453
From:
US
Registered:
12/21/05
|
|
|
|
[nfs-discuss] Code reviewers wanted for 6751438 mirror mounted
mountpoints panic when umounted
Posted:
Oct 1, 2008 5:56 PM
|
|
See http://blogs.sun.com/tdh/entry/debugging_a_refcnt and http://blogs.sun.com/tdh/entry/more_on_that_refcnt_debugging for more than you want to know about this bug.
Basically, the last couple of paragraphs of the second article explain it all:
> Okay, I took my first advice to add more comments and I have a new > piece of code to question in nfs4_ephemeral_umount > <http://: > > 1763 int is_recursed = FALSE; > 1764 int was_locked = FALSE; > ... > 1770 *pmust_unlock = FALSE; > ... > 1817 if (!is_recursed) { > ... > 1845 was_locked = TRUE; > 1846 } else { > 1847 net->net_refcnt++; > 1848 ASSERT(net->net_refcnt != 0); > ... > 1859 if (was_locked == FALSE) { > ... > 1866 if (mutex_tryenter(&net->net_tree_lock)) { > 1867 *pmust_unlock = TRUE; > 1868 } else { > > > So, if *!is_recursed*, then we can end up on line 1847 to bump the > refcnt. At this point, **pmust_unlock* is FALSE and *was_locked* is > also FALSE. If we grab the *net->net_tree_lock* right away on 1866, > then crud, it is okay. > > But the other case is not. Assume that *is_recursed* is TRUE. That > means we do not bump the refcnt and at 1859, the following values > hold: *was_locked* is FALSE and **pmust_unlock* is FALSE. Now if we > grab the mutex right away, we end up with **pmust_unlock* as TRUE. In > my examination of the code, that then implies we *must* drop the > refcnt at some exit point. But we just saw that we never grabbed it in > this scenario. > > This would account for us trying to rele a refcnt we never held. We > return TRUE to nfs4_free_mount > <http:// and it > sends this into nfs4_ephemeral_umount_activate > <http:// > and thus into nfs4_ephemeral_umount_unlock > <http://. > > I'm going to introduce more state in a *pmust_rele to keep track of > whether I need to rele or not. > > ------------------------------------------------------------------------
Testing has shown that we were dropping the refcnt when we had never held it in the first place.
I solve the problem by passing in a boolean to keep track of whether or not we held the lock and thus we need to rele it.
As shown in the blog entries, I did some other changes as well:
> 1. Introduced a incr function to make sure that all bumps are correct: > 291 static void > 292 nfs4_ephemeral_tree_incr(nfs4_ephemeral_tree_t *net) > 293 { > 294 ASSERT(mutex_owned(&net->net_cnt_lock)); > 295 net->net_refcnt++; > 296 ASSERT(net->net_refcnt != 0); > 297 } > 298 > 299 static void > 300 nfs4_ephemeral_tree_hold(nfs4_ephemeral_tree_t *net) > 301 { > 302 mutex_enter(&net->net_cnt_lock); > 303 nfs4_ephemeral_tree_incr(net); > 304 mutex_exit(&net->net_cnt_lock); > 305 } > ... > 1859 } else { > 1860 nfs4_ephemeral_tree_incr(net); > 1861 } > > 2. Fixed up the one locking issue: > 734 } else { > 735 net = mi->mi_ephemeral_tree; > 736 nfs4_ephemeral_tree_hold(net); > 737 > 738 mutex_exit(&mi->mi_lock); > > 3. And I still have the change which I thought would fix the bug: > 2006 if (was_locked == FALSE) > 2007 mutex_exit(&net->net_tree_lock); > >
The webrev can be found here:
http://cr.opensolaris.org/~tdh/6751438/ _______________________________________________ nfs-discuss mailing list nfs-discuss at opensolaris dot org
|
|
|
Posts:
254
From:
DE
Registered:
3/9/05
|
|
|
|
Re: [nfs-discuss] Code reviewers wanted for 6751438 mirror mounted
mountpoints panic when umounted
Posted:
Oct 2, 2008 3:49 AM
in response to: tdh
|
|
On Thu, 02 Oct 2008 02:56:46 +0200, Tom Haynes <Thomas dot Haynes at Sun dot COM> wrote:
> The webrev can be found here: > http://cr.opensolaris.org/~tdh/6751438/
looks sufficiently sane to me. comments though...
292 nfs4_ephemeral_tree_incr(nfs4_ephemeral_tree_t *net) 293 { 294 ASSERT(mutex_owned(&net->net_cnt_lock)); 295 net->net_refcnt++; 296 ASSERT(net->net_refcnt != 0);
312 nfs4_ephemeral_tree_decr(nfs4_ephemeral_tree_t *net) 313 { 314 ASSERT(mutex_owned(&net->net_cnt_lock)); 315 ASSERT(net->net_refcnt != 0); 316 net->net_refcnt--;
to me, it looks lile we never ever just want to continue if we encounter the "net_refcnt != 0" case, debug or not, thus a VERIFY() seems to be more appropriate here.
PS: fwiw, I usually consider code that drops locks that have been acquired somewhere else 'net_tree_lock' bad practice and not "safely" ;-) even more if the "go/nogo" decision is being parsed thru multiple layers here with 'pmust_unlock'
1706 * Common code to safely release net_cnt_lock and net_tree_lock 1707 */ 1708 void 1709 nfs4_ephemeral_umount_unlock(bool_t *pmust_unlock, 1710 bool_t *pmust_rele, nfs4_ephemeral_tree_t **pnet) 1711 { 1712 nfs4_ephemeral_tree_t *net = *pnet; 1713 1714 if (*pmust_unlock) { 1715 mutex_enter(&net->net_cnt_lock); 1716 net->net_status &= ~NFS4_EPHEMERAL_TREE_UMOUNTING; 1717 if (*pmust_rele) 1718 nfs4_ephemeral_tree_decr(net); 1719 mutex_exit(&net->net_cnt_lock); 1720 1721 mutex_exit(&net->net_tree_lock); 1722 1723 *pmust_unlock = FALSE; 1724 } 1725 }
--- frankB _______________________________________________ nfs-discuss mailing list nfs-discuss at opensolaris dot org
|
|
|
|
Posts:
453
From:
US
Registered:
12/21/05
|
|
|
|
Re: [nfs-discuss] Code reviewers wanted for 6751438 mirror mounted
mountpoints panic when umounted
Posted:
Oct 2, 2008 7:11 AM
in response to: batschul
|
|
Frank Batschulat (Home) wrote: > On Thu, 02 Oct 2008 02:56:46 +0200, Tom Haynes <Thomas dot Haynes at Sun dot COM> wrote: > > >> The webrev can be found here: >> http://cr.opensolaris.org/~tdh/6751438/ >> > > looks sufficiently sane to me. > comments though... > > 292 nfs4_ephemeral_tree_incr(nfs4_ephemeral_tree_t *net) > 293 { > 294 ASSERT(mutex_owned(&net->net_cnt_lock)); > 295 net->net_refcnt++; > 296 ASSERT(net->net_refcnt != 0); > > 312 nfs4_ephemeral_tree_decr(nfs4_ephemeral_tree_t *net) > 313 { > 314 ASSERT(mutex_owned(&net->net_cnt_lock)); > 315 ASSERT(net->net_refcnt != 0); > 316 net->net_refcnt--; > > to me, it looks lile we never ever just want to continue if > we encounter the "net_refcnt != 0" case, debug or not, thus > a VERIFY() seems to be more appropriate here. > >
I was taught not to use VERIFYs as it caused customers problems. Yes, I know they still have problems and we are maiking it harder to observe.
And while this follows common practice in using refcnts in the NFS code, perhaps to aid in observability we could add some dtrace probes?
Or is this something we could easily do in a dtrace script? I could envision a script that would capture it.
> PS: fwiw, I usually consider code that drops locks that have > been acquired somewhere else 'net_tree_lock' bad practice and > not "safely" ;-) even more if the "go/nogo" decision is being > parsed thru multiple layers here with 'pmust_unlock' > >
The alternative is to pull code from the v4 unmounting down into this code.
As that code is dependent on whether it is a forced unmount or not, it makes it harder to abstract. But that could be done.
But I didn't want to pull non-mirror mount code down into the mirror mount code. I.e., I didn't want to inadvertently hide it because it was put where not expected.
> 1706 * Common code to safely release net_cnt_lock and net_tree_lock > 1707 */ > 1708 void > 1709 nfs4_ephemeral_umount_unlock(bool_t *pmust_unlock, > 1710 bool_t *pmust_rele, nfs4_ephemeral_tree_t **pnet) > 1711 { > 1712 nfs4_ephemeral_tree_t *net = *pnet; > 1713 > 1714 if (*pmust_unlock) { > 1715 mutex_enter(&net->net_cnt_lock); > 1716 net->net_status &= ~NFS4_EPHEMERAL_TREE_UMOUNTING; > 1717 if (*pmust_rele) > 1718 nfs4_ephemeral_tree_decr(net); > 1719 mutex_exit(&net->net_cnt_lock); > 1720 > 1721 mutex_exit(&net->net_tree_lock); > 1722 > 1723 *pmust_unlock = FALSE; > 1724 } > 1725 } > > --- > frankB >
_______________________________________________ nfs-discuss mailing list nfs-discuss at opensolaris dot org
|
|
|
|
Posts:
254
From:
DE
Registered:
3/9/05
|
|
|
|
Re: [nfs-discuss] Code reviewers wanted for 6751438 mirror mounted
mountpoints panic when umounted
Posted:
Oct 2, 2008 7:21 AM
in response to: tdh
|
|
On Thu, 02 Oct 2008 16:11:01 +0200, Tom Haynes <Thomas dot Haynes at Sun dot COM> wrote:
>> 292 nfs4_ephemeral_tree_incr(nfs4_ephemeral_tree_t *net) >> 293 { >> 294 ASSERT(mutex_owned(&net->net_cnt_lock)); >> 295 net->net_refcnt++; >> 296 ASSERT(net->net_refcnt != 0); >> >> 312 nfs4_ephemeral_tree_decr(nfs4_ephemeral_tree_t *net) >> 313 { >> 314 ASSERT(mutex_owned(&net->net_cnt_lock)); >> 315 ASSERT(net->net_refcnt != 0); >> 316 net->net_refcnt--; >> >> to me, it looks lile we never ever just want to continue if >> we encounter the "net_refcnt != 0" case, debug or not, thus >> a VERIFY() seems to be more appropriate here. > > I was taught not to use VERIFYs as it caused customers problems.
Outch ! really ? man, we got a problem already ;-)
http://grok.czech.sun.com:8080/source/search?q=&defs=&refs=VERIFY&path=&hist=&project=%2Fonnv-clone
nah, honestly I'd be interested to know about that, really...
>> PS: fwiw, I usually consider code that drops locks that have >> been acquired somewhere else 'net_tree_lock' bad practice and >> not "safely" ;-) even more if the "go/nogo" decision is being >> parsed thru multiple layers here with 'pmust_unlock' > > The alternative is to pull code from the v4 unmounting down into this code. > > As that code is dependent on whether it is a forced unmount or not, it makes > it harder to abstract. But that could be done. > > But I didn't want to pull non-mirror mount code down into the mirror mount > code. I.e., I didn't want to inadvertently hide it because it was put where > not expected.
quite, wasn't a request for this fix as this was not new code that had been added but already existing one...
cheers frankB _______________________________________________ nfs-discuss mailing list nfs-discuss at opensolaris dot org
|
|
|
|
Posts:
254
From:
DE
Registered:
3/9/05
|
|
|
|
Re: [nfs-discuss] Code reviewers wanted for 6751438 mirror mounted
mountpoints panic when umounted
Posted:
Oct 2, 2008 7:23 AM
in response to: batschul
|
|
On Thu, 02 Oct 2008 16:21:04 +0200, Frank Batschulat (Home) <Frank dot Batschulat at Sun dot COM> wrote:
> Outch ! really ? man, we got a problem already ;-) > for the outside observer this should have been of course:
http://src.opensolaris.org/source/search?q=&defs=&refs=VERIFY&path=&hist=&project=%2Fonnv
--- frankB _______________________________________________ nfs-discuss mailing list nfs-discuss at opensolaris dot org
|
|
|
|
|