OpenSolaris

Discussions Communities Projects Download Source Browser

Home » OpenSolaris Forums » nfs » discuss

Thread: [nfs-discuss] Code reviewers wanted for 6751438 mirror mounted mountpoints panic when umounted

Welcome, Guest Help
Login Login
Guest Settings Guest Settings
Reply to this Thread Reply to this Thread Search Forum Search Forum Back to Thread List Back to Thread List

Permlink Replies: 4 - Last Post: Oct 2, 2008 7:23 AM by: batschul Threads: [ Previous | Next ]
tdh

Posts: 453
From: US

Registered: 12/21/05
[nfs-discuss] Code reviewers wanted for 6751438 mirror mounted mountpoints panic when umounted
Posted: Oct 1, 2008 5:56 PM

  Click to reply to this thread Reply

See http://blogs.sun.com/tdh/entry/debugging_a_refcnt and
http://blogs.sun.com/tdh/entry/more_on_that_refcnt_debugging
for more than you want to know about this bug.

Basically, the last couple of paragraphs of the second article explain
it all:

> Okay, I took my first advice to add more comments and I have a new
> piece of code to question in nfs4_ephemeral_umount
> <http://:
>
> 1763 int is_recursed = FALSE;
> 1764 int was_locked = FALSE;
> ...
> 1770 *pmust_unlock = FALSE;
> ...
> 1817 if (!is_recursed) {
> ...
> 1845 was_locked = TRUE;
> 1846 } else {
> 1847 net->net_refcnt++;
> 1848 ASSERT(net->net_refcnt != 0);
> ...
> 1859 if (was_locked == FALSE) {
> ...
> 1866 if (mutex_tryenter(&net->net_tree_lock)) {
> 1867 *pmust_unlock = TRUE;
> 1868 } else {
>
>
> So, if *!is_recursed*, then we can end up on line 1847 to bump the
> refcnt. At this point, **pmust_unlock* is FALSE and *was_locked* is
> also FALSE. If we grab the *net->net_tree_lock* right away on 1866,
> then crud, it is okay.
>
> But the other case is not. Assume that *is_recursed* is TRUE. That
> means we do not bump the refcnt and at 1859, the following values
> hold: *was_locked* is FALSE and **pmust_unlock* is FALSE. Now if we
> grab the mutex right away, we end up with **pmust_unlock* as TRUE. In
> my examination of the code, that then implies we *must* drop the
> refcnt at some exit point. But we just saw that we never grabbed it in
> this scenario.
>
> This would account for us trying to rele a refcnt we never held. We
> return TRUE to nfs4_free_mount
> <http:// and it
> sends this into nfs4_ephemeral_umount_activate
> <http://
> and thus into nfs4_ephemeral_umount_unlock
> <http://.
>
> I'm going to introduce more state in a *pmust_rele to keep track of
> whether I need to rele or not.
>
> ------------------------------------------------------------------------

Testing has shown that we were dropping the refcnt when we had never
held it in the first place.

I solve the problem by passing in a boolean to keep track of whether or
not we held the
lock and thus we need to rele it.

As shown in the blog entries, I did some other changes as well:

> 1. Introduced a incr function to make sure that all bumps are correct:
> 291 static void
> 292 nfs4_ephemeral_tree_incr(nfs4_ephemeral_tree_t *net)
> 293 {
> 294 ASSERT(mutex_owned(&net->net_cnt_lock));
> 295 net->net_refcnt++;
> 296 ASSERT(net->net_refcnt != 0);
> 297 }
> 298
> 299 static void
> 300 nfs4_ephemeral_tree_hold(nfs4_ephemeral_tree_t *net)
> 301 {
> 302 mutex_enter(&net->net_cnt_lock);
> 303 nfs4_ephemeral_tree_incr(net);
> 304 mutex_exit(&net->net_cnt_lock);
> 305 }
> ...
> 1859 } else {
> 1860 nfs4_ephemeral_tree_incr(net);
> 1861 }
>
> 2. Fixed up the one locking issue:
> 734 } else {
> 735 net = mi->mi_ephemeral_tree;
> 736 nfs4_ephemeral_tree_hold(net);
> 737
> 738 mutex_exit(&mi->mi_lock);
>
> 3. And I still have the change which I thought would fix the bug:
> 2006 if (was_locked == FALSE)
> 2007 mutex_exit(&net->net_tree_lock);
>
>

The webrev can be found here:

http://cr.opensolaris.org/~tdh/6751438/
_______________________________________________
nfs-discuss mailing list
nfs-discuss at opensolaris dot org


batschul

Posts: 254
From: DE

Registered: 3/9/05
Re: [nfs-discuss] Code reviewers wanted for 6751438 mirror mounted mountpoints panic when umounted
Posted: Oct 2, 2008 3:49 AM   in response to: tdh

  Click to reply to this thread Reply

On Thu, 02 Oct 2008 02:56:46 +0200, Tom Haynes <Thomas dot Haynes at Sun dot COM> wrote:

> The webrev can be found here:
> http://cr.opensolaris.org/~tdh/6751438/

looks sufficiently sane to me.
comments though...

292 nfs4_ephemeral_tree_incr(nfs4_ephemeral_tree_t *net)
293 {
294 ASSERT(mutex_owned(&net->net_cnt_lock));
295 net->net_refcnt++;
296 ASSERT(net->net_refcnt != 0);

312 nfs4_ephemeral_tree_decr(nfs4_ephemeral_tree_t *net)
313 {
314 ASSERT(mutex_owned(&net->net_cnt_lock));
315 ASSERT(net->net_refcnt != 0);
316 net->net_refcnt--;

to me, it looks lile we never ever just want to continue if
we encounter the "net_refcnt != 0" case, debug or not, thus
a VERIFY() seems to be more appropriate here.

PS: fwiw, I usually consider code that drops locks that have
been acquired somewhere else 'net_tree_lock' bad practice and
not "safely" ;-) even more if the "go/nogo" decision is being
parsed thru multiple layers here with 'pmust_unlock'

1706 * Common code to safely release net_cnt_lock and net_tree_lock
1707 */
1708 void
1709 nfs4_ephemeral_umount_unlock(bool_t *pmust_unlock,
1710 bool_t *pmust_rele, nfs4_ephemeral_tree_t **pnet)
1711 {
1712 nfs4_ephemeral_tree_t *net = *pnet;
1713
1714 if (*pmust_unlock) {
1715 mutex_enter(&net->net_cnt_lock);
1716 net->net_status &= ~NFS4_EPHEMERAL_TREE_UMOUNTING;
1717 if (*pmust_rele)
1718 nfs4_ephemeral_tree_decr(net);
1719 mutex_exit(&net->net_cnt_lock);
1720
1721 mutex_exit(&net->net_tree_lock);
1722
1723 *pmust_unlock = FALSE;
1724 }
1725 }

---
frankB
_______________________________________________
nfs-discuss mailing list
nfs-discuss at opensolaris dot org


tdh

Posts: 453
From: US

Registered: 12/21/05
Re: [nfs-discuss] Code reviewers wanted for 6751438 mirror mounted mountpoints panic when umounted
Posted: Oct 2, 2008 7:11 AM   in response to: batschul

  Click to reply to this thread Reply

Frank Batschulat (Home) wrote:
> On Thu, 02 Oct 2008 02:56:46 +0200, Tom Haynes <Thomas dot Haynes at Sun dot COM> wrote:
>
>
>> The webrev can be found here:
>> http://cr.opensolaris.org/~tdh/6751438/
>>
>
> looks sufficiently sane to me.
> comments though...
>
> 292 nfs4_ephemeral_tree_incr(nfs4_ephemeral_tree_t *net)
> 293 {
> 294 ASSERT(mutex_owned(&net->net_cnt_lock));
> 295 net->net_refcnt++;
> 296 ASSERT(net->net_refcnt != 0);
>
> 312 nfs4_ephemeral_tree_decr(nfs4_ephemeral_tree_t *net)
> 313 {
> 314 ASSERT(mutex_owned(&net->net_cnt_lock));
> 315 ASSERT(net->net_refcnt != 0);
> 316 net->net_refcnt--;
>
> to me, it looks lile we never ever just want to continue if
> we encounter the "net_refcnt != 0" case, debug or not, thus
> a VERIFY() seems to be more appropriate here.
>
>


I was taught not to use VERIFYs as it caused customers problems.
Yes, I know they still have problems and we are maiking it harder
to observe.

And while this follows common practice in using refcnts in the
NFS code, perhaps to aid in observability we could add some
dtrace probes?

Or is this something we could easily do in a dtrace script?
I could envision a script that would capture it.




> PS: fwiw, I usually consider code that drops locks that have
> been acquired somewhere else 'net_tree_lock' bad practice and
> not "safely" ;-) even more if the "go/nogo" decision is being
> parsed thru multiple layers here with 'pmust_unlock'
>
>

The alternative is to pull code from the v4 unmounting down into this code.

As that code is dependent on whether it is a forced unmount or not, it makes
it harder to abstract. But that could be done.

But I didn't want to pull non-mirror mount code down into the mirror mount
code. I.e., I didn't want to inadvertently hide it because it was put where
not expected.



> 1706 * Common code to safely release net_cnt_lock and net_tree_lock
> 1707 */
> 1708 void
> 1709 nfs4_ephemeral_umount_unlock(bool_t *pmust_unlock,
> 1710 bool_t *pmust_rele, nfs4_ephemeral_tree_t **pnet)
> 1711 {
> 1712 nfs4_ephemeral_tree_t *net = *pnet;
> 1713
> 1714 if (*pmust_unlock) {
> 1715 mutex_enter(&net->net_cnt_lock);
> 1716 net->net_status &= ~NFS4_EPHEMERAL_TREE_UMOUNTING;
> 1717 if (*pmust_rele)
> 1718 nfs4_ephemeral_tree_decr(net);
> 1719 mutex_exit(&net->net_cnt_lock);
> 1720
> 1721 mutex_exit(&net->net_tree_lock);
> 1722
> 1723 *pmust_unlock = FALSE;
> 1724 }
> 1725 }
>
> ---
> frankB
>

_______________________________________________
nfs-discuss mailing list
nfs-discuss at opensolaris dot org


batschul

Posts: 254
From: DE

Registered: 3/9/05
Re: [nfs-discuss] Code reviewers wanted for 6751438 mirror mounted mountpoints panic when umounted
Posted: Oct 2, 2008 7:21 AM   in response to: tdh

  Click to reply to this thread Reply

On Thu, 02 Oct 2008 16:11:01 +0200, Tom Haynes <Thomas dot Haynes at Sun dot COM> wrote:

>> 292 nfs4_ephemeral_tree_incr(nfs4_ephemeral_tree_t *net)
>> 293 {
>> 294 ASSERT(mutex_owned(&net->net_cnt_lock));
>> 295 net->net_refcnt++;
>> 296 ASSERT(net->net_refcnt != 0);
>>
>> 312 nfs4_ephemeral_tree_decr(nfs4_ephemeral_tree_t *net)
>> 313 {
>> 314 ASSERT(mutex_owned(&net->net_cnt_lock));
>> 315 ASSERT(net->net_refcnt != 0);
>> 316 net->net_refcnt--;
>>
>> to me, it looks lile we never ever just want to continue if
>> we encounter the "net_refcnt != 0" case, debug or not, thus
>> a VERIFY() seems to be more appropriate here.
>
> I was taught not to use VERIFYs as it caused customers problems.

Outch ! really ? man, we got a problem already ;-)

http://grok.czech.sun.com:8080/source/search?q=&defs=&refs=VERIFY&path=&hist=&project=%2Fonnv-clone

nah, honestly I'd be interested to know about that, really...

>> PS: fwiw, I usually consider code that drops locks that have
>> been acquired somewhere else 'net_tree_lock' bad practice and
>> not "safely" ;-) even more if the "go/nogo" decision is being
>> parsed thru multiple layers here with 'pmust_unlock'
>
> The alternative is to pull code from the v4 unmounting down into this code.
>
> As that code is dependent on whether it is a forced unmount or not, it makes
> it harder to abstract. But that could be done.
>
> But I didn't want to pull non-mirror mount code down into the mirror mount
> code. I.e., I didn't want to inadvertently hide it because it was put where
> not expected.

quite, wasn't a request for this fix as this was not new code that had been
added but already existing one...

cheers
frankB
_______________________________________________
nfs-discuss mailing list
nfs-discuss at opensolaris dot org


batschul

Posts: 254
From: DE

Registered: 3/9/05
Re: [nfs-discuss] Code reviewers wanted for 6751438 mirror mounted mountpoints panic when umounted
Posted: Oct 2, 2008 7:23 AM   in response to: batschul

  Click to reply to this thread Reply

On Thu, 02 Oct 2008 16:21:04 +0200, Frank Batschulat (Home) <Frank dot Batschulat at Sun dot COM> wrote:

> Outch ! really ? man, we got a problem already ;-)
>
for the outside observer this should have been of course:

http://src.opensolaris.org/source/search?q=&defs=&refs=VERIFY&path=&hist=&project=%2Fonnv

---
frankB
_______________________________________________
nfs-discuss mailing list
nfs-discuss at opensolaris dot org





Terms of Use | Privacy | Trademarks | Copyright Policy | Site Guidelines
Your use of this web site or any of its content or software indicates your agreement to be bound by these Terms of Use.
Copyright © 1995-2005 Sun Microsystems, Inc.