OpenSolaris

Discussions Communities Projects Download Source Browser

Home » OpenSolaris Forums » dtrace » discuss

Thread: Race when removing user static providers

Welcome, Guest Help
Login Login
Guest Settings Guest Settings
Reply to this Thread Reply to this Thread Search Forum Search Forum Back to Thread List Back to Thread List

Permlink Replies: 6 - Last Post: Aug 1, 2007 7:06 PM by: James McIlree
James McIlree
jmcilree@apple.com
Race when removing user static providers
Posted: Aug 1, 2007 1:03 AM

  Click to reply to this thread Reply

I think I've found a race when user static providers are
being removed. What I am observing is that under certain stress
tests, fasttrap meta providers are not being cleaned up. The process
exits, all dtrace consumers exit, but a later dtrace -l still
shows probes from that provider.

I've got a theory and a fix that seems to work so far...

What I think is happening is this:

You've got a dtrace consumer matching probes named: "test*:::"

Process A has a "test" user static provider. They've been enabled.
Process A decides to exit. This means that dtrace_helpers_cleanup()
is called, which synchronously marks the provider as "marked" and
"retired", and schedules an asynchronous cleanup for later.

Very shortly afterwards, the fasttrap exit cleanup is called, and
starts disabling probes.

Suppose, though, that you're on a MP machine, and the asynchronous
cleanup was running on another cpu. It could be already active due
to an earlier exit, or because it immediately got an otherwise idle
cpu.

In the fasttrap cleanup code, we've got these checks:


> /*
> * If this provider has consumers actively
> * creating probes (ftp_ccount) or is a USDT
> * provider (ftp_mcount), we can't unregister
> * or even condense.
> */
> if (fp->ftp_ccount != 0 ||
> fp->ftp_mcount != 0) {
> lck_mtx_unlock(&fp->ftp_mtx);
> fp->ftp_marked = 0;
> continue;
> }
>
> if (!fp->ftp_retired || fp->ftp_rcount != 0)
> fp->ftp_marked = 0;


The first thing I notice is that the ftp_marked value is set outside
the lock in one test. Seems a bit risky :-), but I don't think that is
the problem.

If we arrive at this code section while the meta provider has been
"marked", but not all its probes have been disabled, fp->ftp_rcount
will be
non-zero. The "mark" is cleared. However, there is never again a set
of the
"mark", and so this provider becomes a zombie.

I'm thinking the fix is to detect if the provider is a meta provider,
and then simply punt to later if ftp_rcount is non-zero. Something
like this:

> if (fp->ftp_rcount != 0 && fp->ftp_provider_type ==
> DTFTP_PROVIDER_USDT) {
> later++;
> fpp = &fp->ftp_next;
> lck_mtx_unlock(&fp->ftp_mtx);
> continue;
> }

James M


_______________________________________________
dtrace-discuss mailing list
dtrace-discuss at opensolaris dot org


Adam Leventhal
ahl@eng.sun.com
Re: Race when removing user static providers
Posted: Aug 1, 2007 2:18 PM   in response to: James McIlree

  Click to reply to this thread Reply

Hi James,

Can you describe the sequence of events that could lead to this in a little
more detail? I'm not sure I agree with all of your analysis.

> If we arrive at this code section while the meta provider has been
> "marked", but not all its probes have been disabled, fp->ftp_rcount will be
> non-zero. The "mark" is cleared. However, there is never again a set of the
> "mark", and so this provider becomes a zombie.

Since we've set ftp_retired on the provider, disabling probes will cause us
to attempt to remove the provider:

ASSERT(provider->ftp_rcount > 0);
provider->ftp_rcount--;

if (p != NULL) {
/*
* Even though we may not be able to remove it entirely, we
* mark this retired provider to get a chance to remove some
* of the associated probes.
*/
if (provider->ftp_retired && !provider->ftp_marked)
whack = provider->ftp_marked = 1;
mutex_exit(&provider->ftp_mtx);

mutex_enter(&p->p_lock);
sprunlock(p);
} else {
/*
* If the process is dead, we're just waiting for the
* last probe to be disabled to be able to free it.
*/
if (provider->ftp_rcount == 0 && !provider->ftp_marked)
whack = provider->ftp_marked = 1;
mutex_exit(&provider->ftp_mtx);
}

if (whack)
fasttrap_pid_cleanup();

> Suppose, though, that you're on a MP machine, and the asynchronous
> cleanup was running on another cpu. It could be already active due
> to an earlier exit, or because it immediately got an otherwise idle
> cpu.

That's true, but the provider lock -- and the bucket lock -- will protect
against simultaneous access won't they?

It's not a surprise that your fix works since it will just cause the cleanup
pass to repeat until all USDT provider that were ever marked were gone. The
code is meant to wait until frp_rcount drops to zero. Your modification
addresses the issue, but the underlying problem still seems to remain.

In theory at least, it seems that the modification of ftp_marked outside of
the protection of the provider's lock:

thread 1 thread 2
fasttrap_pid_cleanup_cb()
...
ftp_mcount != 0
drop ftp_mtx
fasttrap_provider_retire() .
take ftp_mtx
ftp_retired = 1; .
ftp_marked = 1;
drop ftp_mtx .
fasttrap_pid_disable()
take ftp_mtx .
ftp_rcount = 0;
ftp_marked = 1; .
drop ftp_mtx
ftp_marked = 0;

I think this is possible, but it's pretty unlikely on Solaris (but could
happen inside of VMware or Xen for example).

Can you describe another scenario whereby we could miss the chance to
destroy the provider when the last probe is disabled? Are you operating
in an environment where a thread could effectively be off CPU for a
while? Can you reproduce the problem?

In any case, both unprotected uses of ftp_marked in fasttrap_pid_cleanup_cb()
should probably be corrected.

Adam

On Wed, Aug 01, 2007 at 01:03:54AM -0700, James McIlree wrote:
> I think I've found a race when user static providers are
> being removed. What I am observing is that under certain stress
> tests, fasttrap meta providers are not being cleaned up. The process
> exits, all dtrace consumers exit, but a later dtrace -l still
> shows probes from that provider.
>
> I've got a theory and a fix that seems to work so far...
>
> What I think is happening is this:
>
> You've got a dtrace consumer matching probes named: "test*:::"
>
> Process A has a "test" user static provider. They've been enabled.
> Process A decides to exit. This means that dtrace_helpers_cleanup()
> is called, which synchronously marks the provider as "marked" and
> "retired", and schedules an asynchronous cleanup for later.
>
> Very shortly afterwards, the fasttrap exit cleanup is called, and
> starts disabling probes.
>
> Suppose, though, that you're on a MP machine, and the asynchronous
> cleanup was running on another cpu. It could be already active due
> to an earlier exit, or because it immediately got an otherwise idle
> cpu.
>
> In the fasttrap cleanup code, we've got these checks:
>
>
> > /*
> > * If this provider has consumers actively
> > * creating probes (ftp_ccount) or is a USDT
> > * provider (ftp_mcount), we can't unregister
> > * or even condense.
> > */
> > if (fp->ftp_ccount != 0 ||
> > fp->ftp_mcount != 0) {
> > lck_mtx_unlock(&fp->ftp_mtx);
> > fp->ftp_marked = 0;
> > continue;
> > }
> >
> > if (!fp->ftp_retired || fp->ftp_rcount != 0)
> > fp->ftp_marked = 0;
>
>
> The first thing I notice is that the ftp_marked value is set outside
> the lock in one test. Seems a bit risky :-), but I don't think that is
> the problem.
>
> If we arrive at this code section while the meta provider has been
> "marked", but not all its probes have been disabled, fp->ftp_rcount
> will be
> non-zero. The "mark" is cleared. However, there is never again a set
> of the
> "mark", and so this provider becomes a zombie.
>
> I'm thinking the fix is to detect if the provider is a meta provider,
> and then simply punt to later if ftp_rcount is non-zero. Something
> like this:
>
> > if (fp->ftp_rcount != 0 && fp->ftp_provider_type ==
> > DTFTP_PROVIDER_USDT) {
> > later++;
> > fpp = &fp->ftp_next;
> > lck_mtx_unlock(&fp->ftp_mtx);
> > continue;
> > }
>
> James M
>
>
> _______________________________________________
> dtrace-discuss mailing list
> dtrace-discuss at opensolaris dot org

--
Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl
_______________________________________________
dtrace-discuss mailing list
dtrace-discuss at opensolaris dot org


James McIlree
jmcilree@apple.com
Re: Race when removing user static providers
Posted: Aug 1, 2007 5:07 PM   in response to: Adam Leventhal

  Click to reply to this thread Reply


On Aug 1, 2007, at 2:18 PM, Adam Leventhal wrote:

> Hi James,
>
> Can you describe the sequence of events that could lead to this in a
> little
> more detail? I'm not sure I agree with all of your analysis.

Drat, I missed the remark in fasttrap_pid_disable. Kind of shoots a
big hole in my theory :-).

>
> Can you describe another scenario whereby we could miss the chance to
> destroy the provider when the last probe is disabled? Are you
> operating
> in an environment where a thread could effectively be off CPU for a
> while? Can you reproduce the problem?

I'll look at it further, and see what I can come up with.

Let me report my observed data, and maybe you can spot the real
race.

The problem is 100% reproducible. Of course, this is MP OS X, and
I don't have a MP Solaris install to test against. SP Solaris does not
show the problem. I really hope its a shared problem, and not some
subtle
difference in OS behavior :-).

A test program with ~10 user static probes sits in a loop,
doing fork - exit, fork - exec, vfork - exit, vfork - exec, as fast
as it can. When it execs, it execs a copy of itself (with probes)
that immediately exits.

In another shell, I run "dtrace -n "test*:::", let it observe
a few probe firings, and then I ctrl-c. Do this ~5 - 10 times.

Stop the test process, and do a dtrace -l. You should observe
USDT probes that belong to processes that have exited. They're zombies.

At one point, I walked the provider hash tables by hand in
the debugger. When I reached a zombie, it's ftp_marked field was unset,
and retired was set. I focused in on the ftp_marked field and have
forgotten
what the rcount was.

As you suspected, my proposed fix did work, but I also agree
I didn't fix the underlying issue.

I'll unwind and try to grab another zombie provider snapshot.

>
> In any case, both unprotected uses of ftp_marked in
> fasttrap_pid_cleanup_cb()
> should probably be corrected.

I only saw one unprotected set of ftp_marked, and fixing it did
not fix the problem.

James M

_______________________________________________
dtrace-discuss mailing list
dtrace-discuss at opensolaris dot org


Adam Leventhal
ahl@eng.sun.com
Re: Race when removing user static providers
Posted: Aug 1, 2007 5:45 PM   in response to: James McIlree

  Click to reply to this thread Reply

On Wed, Aug 01, 2007 at 05:07:25PM -0700, James McIlree wrote:
> >Can you describe another scenario whereby we could miss the chance to
> >destroy the provider when the last probe is disabled? Are you
> >operating
> >in an environment where a thread could effectively be off CPU for a
> >while? Can you reproduce the problem?
>
> I'll look at it further, and see what I can come up with.
>
> Let me report my observed data, and maybe you can spot the real
> race.
>
> The problem is 100% reproducible. Of course, this is MP OS X, and
> I don't have a MP Solaris install to test against. SP Solaris does not
> show the problem. I really hope its a shared problem, and not some
> subtle difference in OS behavior :-).

I can certainly try it out on a big MP Solaris box. That said, DTrace and
USDT have been used on a wide array of MP machines, and not hit this to my
knowledge.

> A test program with ~10 user static probes sits in a loop,
> doing fork - exit, fork - exec, vfork - exit, vfork - exec, as fast
> as it can. When it execs, it execs a copy of itself (with probes)
> that immediately exits.
>
> In another shell, I run "dtrace -n "test*:::", let it observe
> a few probe firings, and then I ctrl-c. Do this ~5 - 10 times.
>
> Stop the test process, and do a dtrace -l. You should observe
> USDT probes that belong to processes that have exited. They're zombies.

No luck. This is on a two-way box. Should I try something bigger? Should I
be waiting for these processes that the test program has forked off? If you
could provide the code for the actual test case that would probably help.
The best way to provide a test case would be in a form that could then be
integrated into the DTrace test suite. You might look at this test since it
sounds similar to what you're trying:
usr/src/cmd/dtrace/test/tst/common/usdt/*forker*

> At one point, I walked the provider hash tables by hand in
> the debugger. When I reached a zombie, it's ftp_marked field was unset,
> and retired was set. I focused in on the ftp_marked field and have
> forgotten what the rcount was.

An aside, next up you guys should port mdb:
> fasttrap_provs::print fth_table | ::array fasttrap_bucket_t 100 | ::print fasttrap_bucket_t ftb_data | ::grep .>0 | ::print fasttrap_provider_t ftp_marked ftp_retired
ftp_marked = 0
ftp_retired = 0
...

> >In any case, both unprotected uses of ftp_marked in
> >fasttrap_pid_cleanup_cb()
> >should probably be corrected.
>
> I only saw one unprotected set of ftp_marked, and fixing it did
> not fix the problem.

Like I said: pretty unlikely given the amount of code that would need to
be executed on one CPU during a very narrow window on another CPU (though
it's possible on a VMware or similar config [1]).

Adam

[1] http://blogs.sun.com/ahl/entry/debugging_cross_calls_on_opensolaris

--
Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl
_______________________________________________
dtrace-discuss mailing list
dtrace-discuss at opensolaris dot org


James McIlree
jmcilree@apple.com
Re: Race when removing user static providers
Posted: Aug 1, 2007 5:54 PM   in response to: Adam Leventhal

  Click to reply to this thread Reply


On Aug 1, 2007, at 5:45 PM, Adam Leventhal wrote:

> On Wed, Aug 01, 2007 at 05:07:25PM -0700, James McIlree wrote:
>>
>> The problem is 100% reproducible. Of course, this is MP OS X, and
>> I don't have a MP Solaris install to test against. SP Solaris does
>> not
>> show the problem. I really hope its a shared problem, and not some
>> subtle difference in OS behavior :-).
>
> I can certainly try it out on a big MP Solaris box. That said,
> DTrace and
> USDT have been used on a wide array of MP machines, and not hit this
> to my
> knowledge.

Okay, another significant piece of observed data.

There is one decrement of ftp_rcount that does not reset marked,
in fasttrap_pid_disable():

> if (!probe->ftp_enabled) {
> lck_mtx_lock(&provider->ftp_mtx);
> provider->ftp_rcount--;
> lck_mtx_unlock(&provider->ftp_mtx);
> return;
> }

I instrumented this, and sure enough, all my zombie providers are
decremented to zero inside this block of code.

My original theory may be revivable :-).

Of course, the next question is *why* we're in this snippet of code...

James M

_______________________________________________
dtrace-discuss mailing list
dtrace-discuss at opensolaris dot org


Adam Leventhal
ahl@eng.sun.com
Re: Race when removing user static providers
Posted: Aug 1, 2007 6:06 PM   in response to: James McIlree

  Click to reply to this thread Reply

On Wed, Aug 01, 2007 at 05:54:39PM -0700, James McIlree wrote:
> Okay, another significant piece of observed data.
>
> There is one decrement of ftp_rcount that does not reset marked,
> in fasttrap_pid_disable():
>
> > if (!probe->ftp_enabled) {
> > lck_mtx_lock(&provider->ftp_mtx);
> > provider->ftp_rcount--;
> > lck_mtx_unlock(&provider->ftp_mtx);
> > return;
> > }
>
> I instrumented this, and sure enough, all my zombie providers are
> decremented to zero inside this block of code.

How old is the code you're using? This was removed 4 months ago. (And did
you guys really rip out the assertions?)

> My original theory may be revivable :-).

I'm not sure I follow.

> Of course, the next question is *why* we're in this snippet of
> code...

Right. The only way ftp_enabled won't be set is if we didn't get through
fasttrap_pid_enable() successfully (we'd like to fail, but what we'd do
in the broader DTrace framework is a bit unclear).

Adam

--
Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl
_______________________________________________
dtrace-discuss mailing list
dtrace-discuss at opensolaris dot org


James McIlree
jmcilree@apple.com
Re: Race when removing user static providers
Posted: Aug 1, 2007 7:06 PM   in response to: Adam Leventhal

  Click to reply to this thread Reply


On Aug 1, 2007, at 6:06 PM, Adam Leventhal wrote:

> On Wed, Aug 01, 2007 at 05:54:39PM -0700, James McIlree wrote:
>> Okay, another significant piece of observed data.
>>
>> There is one decrement of ftp_rcount that does not reset marked,
>> in fasttrap_pid_disable():
>>
>>> if (!probe->ftp_enabled) {
>>> lck_mtx_lock(&provider->ftp_mtx);
>>> provider->ftp_rcount--;
>>> lck_mtx_unlock(&provider->ftp_mtx);
>>> return;
>>> }
>>
>> I instrumented this, and sure enough, all my zombie providers are
>> decremented to zero inside this block of code.
>
> How old is the code you're using? This was removed 4 months ago.
> (And did
> you guys really rip out the assertions?)

Older than 4 months ago, obviously :-).

"We apologise for the missing ASSERT. Those responsible have been
sacked."


>
>> My original theory may be revivable :-).
>
> I'm not sure I follow.

The zombie providers were sticking around because ftp_marked was
cleared
in the cleanup loop, but never reset.

>
>
>> Of course, the next question is *why* we're in this snippet of
>> code...
>
> Right. The only way ftp_enabled won't be set is if we didn't get
> through
> fasttrap_pid_enable() successfully (we'd like to fail, but what we'd
> do
> in the broader DTrace framework is a bit unclear).

Some additional digging is needed.

James M

_______________________________________________
dtrace-discuss mailing list
dtrace-discuss at opensolaris dot org





Terms of Use | Privacy | Trademarks | Copyright Policy | Site Guidelines
Your use of this web site or any of its content or software indicates your agreement to be bound by these Terms of Use.
Copyright © 1995-2005 Sun Microsystems, Inc.