OpenSolaris

Discussions Communities Projects Download Source Browser

Home » OpenSolaris Forums » ha-clusters » discuss

Thread: RGM Question - Forcing a resource offline after repeated failures

Welcome, Guest Help
Login Login
Guest Settings Guest Settings
Reply to this Thread Reply to this Thread Search Forum Search Forum Back to Thread List Back to Thread List

Permlink Replies: 10 - Last Post: Nov 4, 2009 3:27 PM by: deepvrce Threads: [ Previous | Next ]
deepvrce

Posts: 7
From: US

Registered: 5/30/07
RGM Question - Forcing a resource offline after repeated failures
Posted: Oct 6, 2009 9:19 AM
To: Communities » ha-clusters » discuss
Cc: OpenSolaris » help
  Click to reply to this thread Reply

Hi All,

I have a RGM question. Been some time and I can't exactly remember how best to do this..

On a single node cluster, I have a resource. What I want to achieve effectively is that if the application monitored by the resource is not able to stay online (after the designated retry counts), then Sun Cluster offline the resource, rather than leaving it online in a faulted state and quitting to probe it. The problem that I am facing if the resource stays online is that a dependent resource doesn't go offline when this failure happens.

Am I missing an easy way to do that ?

Sorry for the trouble and thanks in advance..
- Shubho.

vchennu

Posts: 14
From:

Registered: 6/25/07
Re: [ha-clusters-discuss] RGM Question - Forcing a resource offline after repeated failures
Posted: Oct 6, 2009 9:44 AM   in response to: deepvrce

  Click to reply to this thread Reply

Hi Subhadeep,

I think you can use offline restart dependency for this. See
this blog for details.

http://blogs.sun.com/SC/entry/disabling_a_depended_on_resource

thanks
-venkat

On 10/06/09 11:19, Subhadeep Sinha wrote:
> Hi All,
>
> I have a RGM question. Been some time and I can't exactly remember how best to do this..
>
> On a single node cluster, I have a resource. What I want to achieve effectively is that if the application monitored by the resource is not able to stay online (after the designated retry counts), then Sun Cluster offline the resource, rather than leaving it online in a faulted state and quitting to probe it. The problem that I am facing if the resource stays online is that a dependent resource doesn't go offline when this failure happens.
>
> Am I missing an easy way to do that ?
>
> Sorry for the trouble and thanks in advance..
> - Shubho.
_______________________________________________
ha-clusters-discuss mailing list
ha-clusters-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/ha-clusters-discuss


deepvrce

Posts: 7
From: US

Registered: 5/30/07
Re: [ha-clusters-discuss] RGM Question - Forcing a resource offline after
Posted: Oct 6, 2009 10:02 AM   in response to: vchennu
To: Communities » ha-clusters » discuss
  Click to reply to this thread Reply

Hi Venkat,

I am using offline restart dependency right now. But I guess since the faulty resource is not put in an offline state (its left online in a faulted state and probing is stopped on it) after the retry counts expire, the dependent resource also doesn't go offline and stays online.

Thanks,
- Shubho.

vchennu

Posts: 14
From:

Registered: 6/25/07
Re: [ha-clusters-discuss] RGM Question - Forcing a resource offline after
Posted: Oct 6, 2009 11:02 AM   in response to: deepvrce

  Click to reply to this thread Reply

Hi Subhadeep,

Sorry. I did not realize that you were using this already.

thanks
-venkat

On 10/06/09 12:02, Subhadeep Sinha wrote:
> Hi Venkat,
>
> I am using offline restart dependency right now. But I guess since the faulty resource is not put in an offline state (its left online in a faulted state and probing is stopped on it) after the retry counts expire, the dependent resource also doesn't go offline and stays online.
>
> Thanks,
> - Shubho.
_______________________________________________
ha-clusters-discuss mailing list
ha-clusters-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/ha-clusters-discuss


ashu

Posts: 75
From:

Registered: 3/9/05
Re: [ha-clusters-discuss] RGM Question - Forcing a resource offline after repeated failures
Posted: Oct 6, 2009 10:45 AM   in response to: deepvrce

  Click to reply to this thread Reply

Subhadeep Sinha wrote:
> Hi All,
>
> I have a RGM question. Been some time and I can't exactly remember how best to do this..
>
> On a single node cluster, I have a resource. What I want to achieve effectively is that if the application monitored by the resource is not able to stay online (after the designated retry counts), then Sun Cluster offline the resource, rather than leaving it online in a faulted state and quitting to probe it. The problem that I am facing if the resource stays online is that a dependent resource doesn't go offline when this failure happens.
>
> Am I missing an easy way to do that ?

Hmmm... i thought what you want above actually is the DEFAULT
out-of-the-box behaviour for SC, no tweaking needed. Because that is
the "most sensible thing" (tm) to do. :-)

If a resource fails to start, it gets into START_FAILED state.
Which is what you want, right? Unless you have tweak things up, that is
what should happen.

Am i missing something obvious?

-ashu

>
> Sorry for the trouble and thanks in advance..
> - Shubho.

_______________________________________________
ha-clusters-discuss mailing list
ha-clusters-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/ha-clusters-discuss


deepvrce

Posts: 7
From: US

Registered: 5/30/07
Re: [ha-clusters-discuss] RGM Question - Forcing a resource offline after
Posted: Oct 6, 2009 11:49 AM   in response to: ashu
To: Communities » ha-clusters » discuss
  Click to reply to this thread Reply

Hey Ashu,

In my case the resource doesn't fail to start ever. The probe fails due to some rough weather the application runs into after it is started and then SC restarts the resource for retry_count times. It is then that the resource enters a faulted/unmonitored state where probing is stopped. But the final state of the resource is still "online".

Thanks,
- Shubho.

nsolter

Posts: 267
From: US

Registered: 11/22/06
Re: [ha-clusters-discuss] RGM Question - Forcing a resource offline after
Posted: Oct 6, 2009 12:37 PM   in response to: deepvrce

  Click to reply to this thread Reply

Subhadeep Sinha wrote:
> Hey Ashu,
>
> In my case the resource doesn't fail to start ever. The probe fails due to some rough weather the application runs into after it is started and then SC restarts the resource for retry_count times. It is then that the resource enters a faulted/unmonitored state where probing is stopped. But the final state of the resource is still "online".
>

Shubo,

What is the value of the Failover_mode property for the resource?

Thanks,
Nick

_______________________________________________
ha-clusters-discuss mailing list
ha-clusters-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/ha-clusters-discuss


deepvrce

Posts: 7
From: US

Registered: 5/30/07
Re: [ha-clusters-discuss] RGM Question - Forcing a resource offline after
Posted: Oct 6, 2009 1:14 PM   in response to: nsolter
To: Communities » ha-clusters » discuss
  Click to reply to this thread Reply

Hey Nick,

I have tried SOFT, HARD, NONE. Anything else that I should set it to ?

Thanks,
- Shubho.

ashu

Posts: 75
From:

Registered: 3/9/05
Re: [ha-clusters-discuss] RGM Question - Forcing a resource offline after
Posted: Oct 6, 2009 1:41 PM   in response to: deepvrce

  Click to reply to this thread Reply

Hi Subhadeep,

I see. Strange situation indeed, sounds like either a
badly designed probe or a badly behaving app or both.

The strange part being that starting the application is
always OK, even though SC (rather the Agents built by Agent
builder) typically would probe the app at start time as well
to make sure it really did start. It must be related to load,
as you suggested.

Badly designed app/probe or not, it would be good to have
a knob in SC which can get you the behavior of getting the
application into an errored/offline state if it keeps failing
after successfully starting.

I do not believe that today we have such a knob in SC,
but perhaps someone else on the list would prove be happily wrong.

-ashu


Subhadeep Sinha wrote:
> Hey Ashu,
>
> In my case the resource doesn't fail to start ever. The probe fails due to some rough weather the application runs into after it is started and then SC restarts the resource for retry_count times. It is then that the resource enters a faulted/unmonitored state where probing is stopped. But the final state of the resource is still "online".
>
> Thanks,
> - Shubho.
_______________________________________________
ha-clusters-discuss mailing list
ha-clusters-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/ha-clusters-discuss


mrattner

Posts: 18
From: US

Registered: 7/23/08
Re: [ha-clusters-discuss] RGM Question - Forcing a resource offline after
Posted: Oct 6, 2009 2:00 PM   in response to: ashu

  Click to reply to this thread Reply

Hi Shubho,

You can't just set a knob to make this happen. However, if you are able
to modify the code of the resource monitor, you could execute a call to
scha_control with the RESOURCE_DISABLE tag when this case occurs. This
will stop the monitor and also stops the resource itself, and
persistently disables the resource (same as executing "clrs disable"
command). The cluster administrator can re-enable the resource when the
problem is fixed.

This scha_control RESOURCE_DISABLE tag is currently used in the
ScalDeviceGroup and ScalMountPoint resource monitors, when an
unrecoverable error occurs. Usually these resource types have
offline-restart dependents, so disabling them takes the dependent
offline too. I think this is the sort of thing you're trying to do.

This is documented on the scha_control(1HA) or (3HA) man page.

--Marty


On 10/ 6/09 01:41 PM, Ashutosh Tripathi wrote:
> Hi Subhadeep,
>
> I see. Strange situation indeed, sounds like either a
> badly designed probe or a badly behaving app or both.
>
> The strange part being that starting the application is
> always OK, even though SC (rather the Agents built by Agent
> builder) typically would probe the app at start time as well
> to make sure it really did start. It must be related to load,
> as you suggested.
>
> Badly designed app/probe or not, it would be good to have
> a knob in SC which can get you the behavior of getting the
> application into an errored/offline state if it keeps failing
> after successfully starting.
>
> I do not believe that today we have such a knob in SC,
> but perhaps someone else on the list would prove be happily wrong.
>
> -ashu
>
>
> Subhadeep Sinha wrote:
>> Hey Ashu,
>>
>> In my case the resource doesn't fail to start ever. The probe fails
>> due to some rough weather the application runs into after it is
>> started and then SC restarts the resource for retry_count times. It
>> is then that the resource enters a faulted/unmonitored state where
>> probing is stopped. But the final state of the resource is still
>> "online".
>>
>> Thanks,
>> - Shubho.
> _______________________________________________
> ha-clusters-discuss mailing list
> ha-clusters-discuss at opensolaris dot org
> http://mail.opensolaris.org/mailman/listinfo/ha-clusters-discuss
_______________________________________________
ha-clusters-discuss mailing list
ha-clusters-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/ha-clusters-discuss


deepvrce

Posts: 7
From: US

Registered: 5/30/07
Re: [ha-clusters-discuss] RGM Question - Forcing a resource offline after
Posted: Nov 4, 2009 3:27 PM   in response to: mrattner
To: Communities » ha-clusters » discuss
  Click to reply to this thread Reply

Thanks Marty ! That worked perfectly fine.

- Shubho.




Terms of Use | Privacy | Trademarks | Copyright Policy | Site Guidelines
Your use of this web site or any of its content or software indicates your agreement to be bound by these Terms of Use.
Copyright © 1995-2005 Sun Microsystems, Inc.