OpenSolaris

Discussions Communities Projects Download Source Browser

Home » OpenSolaris Forums » ha-clusters » discuss

Thread: Code Review for CR 6705938

Welcome, Guest Help
Login Login
Guest Settings Guest Settings
Reply to this Thread Reply to this Thread Search Forum Search Forum Back to Thread List Back to Thread List

Permlink Replies: 5 - Last Post: Aug 25, 2008 12:46 AM by: nigoroll
tirth

Posts: 154
From: IN

Registered: 6/30/06
Code Review for CR 6705938
Posted: Aug 20, 2008 9:43 AM
To: Communities » ha-clusters » discuss
  Click to reply to this thread Reply

Hi,

Please review the fix for
http://bugs.opensolaris.org/view_bug.do?bug_id=6705938

Webrev at
http://cr.opensolaris.org/~tirth/webrev_6705938/
cmm fences off all nodes except the one node that has lost all interconnects

A brief description.
A split brain is being simulated and the partition with only one node is fencing of all the other nodes. The fix introduces a delay to slow down the smaller partition.
In case of clusters with upto 4 nodes, each partition will be atleast n/2 where n is the number of nodes.

For bigger cluster, we let the smaller partition go ahead if they have sufficient number of nodes to tolerate further failures. We do it this way, because this speeds up the cmm reconfiguration and hence less service outage and the probability of a immediate second or third failure is less. Also, another assumption is that the administrators will soon realize the split brain and try to fix it and bring the other nodes online.

Please send all your reviews by 21st Aug 2008.

Thanks,
Tirthankar
http://blogs.sun.com/tirthankar

Sambit Nayak
Sambit.Nayak@Sun.COM
Re: Code Review for CR 6705938
Posted: Aug 21, 2008 3:50 AM   in response to: tirth

  Click to reply to this thread Reply

Hi Tirthankar,

Here are my code review comments.


usr/src/common/cl/cmm/automaton_impl.cc
----------------------------------------------------------
(1) Line 520 : Change "Else" to "Or".
"Else" means different in an if-else scenario.

(2) Line 3838 : Change "parition" to "partition"

(3) Lines 3837-3843 :
The comment should ideally say "heard from" instead of "talk to".
Not absolutely essential though.

(4) Line 3856 :
You could mention that you skew the "allowed" size to
just less than half if the total number of configured nodes
is from 5 to 64 (instead of 5 to 8)


Otherwise, the changes look good.


Thanks & Regards,
Sambit


Tirthankar wrote:
> Hi,
>
> Please review the fix for
> http://bugs.opensolaris.org/view_bug.do?bug_id=6705938
>
> Webrev at
> http://cr.opensolaris.org/~tirth/webrev_6705938/
> cmm fences off all nodes except the one node that has lost all interconnects
>
> A brief description.
> A split brain is being simulated and the partition with only one node is fencing of all the other nodes. The fix introduces a delay to slow down the smaller partition.
> In case of clusters with upto 4 nodes, each partition will be atleast n/2 where n is the number of nodes.
>
> For bigger cluster, we let the smaller partition go ahead if they have sufficient number of nodes to tolerate further failures. We do it this way, because this speeds up the cmm reconfiguration and hence less service outage and the probability of a immediate second or third failure is less. Also, another assumption is that the administrators will soon realize the split brain and try to fix it and bring the other nodes online.
>
> Please send all your reviews by 21st Aug 2008.
>
> Thanks,
> Tirthankar
> http://blogs.sun.com/tirthankar
> --
>
> This message posted from opensolaris.org
>
> _______________________________________________
> ha-clusters-discuss mailing list
> ha-clusters-discuss at opensolaris dot org
> http://mail.opensolaris.org/mailman/listinfo/ha-clusters-discuss
>
_______________________________________________
ha-clusters-discuss mailing list
ha-clusters-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/ha-clusters-discuss


tirth

Posts: 154
From: IN

Registered: 6/30/06
Re: Code Review for CR 6705938
Posted: Aug 21, 2008 4:28 AM   in response to: Sambit Nayak
To: Communities » ha-clusters » discuss
  Click to reply to this thread Reply

Hi Sambit, Ellard,

Thanks for the review.

Sambit,

I have incorporated your feedback other than point number 3.


Thanks,
Tirthankar

http://blogs.sun.com/tirthankar

Nils Goroll
slink@schokola.de
Re: Code Review for CR 6705938
Posted: Aug 22, 2008 4:19 AM   in response to: tirth

  Click to reply to this thread Reply

Hi,

> Please review the fix for
> http://bugs.opensolaris.org/view_bug.do?bug_id=6705938
>
> Webrev at
> http://cr.opensolaris.org/~tirth/webrev_6705938/
> cmm fences off all nodes except the one node that has lost all interconnects

let_partition_wait is to return true if the node running it is in the "smaller"
partition, right?

Why did you decide to define what is a large partition as sizes 2,3,6 relative
to some total cluster size ranges etc. rather than using something like

// don't wait if are in a "large" partition,
// which is defined as at least half the nodes
// (for odd num_nodes actually less than half
// the nodes).

if (num_nodes (num_nodes/2))
retval = false;

?

The larger partition is to fence the smaller partition, right? What happens if
there is no larger partition, for instance if nodes were taken down
administratively or when there are more than two partitions so they will all
wait. Would this do any harm?

I dont have the big picture at the moment, so please ignore my comments if all
this has been thought through.

Nils
_______________________________________________
ha-clusters-discuss mailing list
ha-clusters-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/ha-clusters-discuss


tirth

Posts: 154
From: IN

Registered: 6/30/06
Re: Code Review for CR 6705938
Posted: Aug 24, 2008 9:59 AM   in response to: Nils Goroll

  Click to reply to this thread Reply

Hi Nils,

Thanks for the review.

More replies inline.

On Fri, Aug 22, 2008 at 4:49 PM, Nils Goroll <slink at schokola dot de> wrote:
Hi,

Please review the fix for http://bugs.opensolaris.org/view_bug.do?bug_id=6705938

Webrev at
http://cr.opensolaris.org/~tirth/webrev_6705938/
cmm fences off all nodes except the one node that has lost all interconnects

let_partition_wait is to return true if the node running it is in the "smaller" partition, right?

Yes where the definition of "smaller" changes according to the number of nodes configured. i.e. if n is the number of nodes configured, a smaller partition may be much less than n/2.
 

Why did you decide to define what is a large partition as sizes 2,3,6 relative to some total cluster size ranges etc. rather than using something like

       // don't wait if are in a "large" partition,
       // which is defined as at least half the nodes
       // (for odd num_nodes actually less than half
       //  the nodes).

       if (num_nodes <= 2)
               retval = false;
       else if (local_num_nodes > (num_nodes/2))
               retval = false;

?

At this point, we decide to form the membership. We try to decide which nodes  will be a part of the  cluster.   The code snippet that you are suggesting is almost same as  the one I wrote. Remember, the assumption is that there is 2 partitions. Hence  if one partition is large,  the other partition will be small. So one of them will be waiting. We want the smaller partition to wait.

In your code, the definition of "large" is fixed. In my code, the definition of "small" is variable. So considering that you had a variable definition of "large", your code and my code does the same thing. Because for every small partition, there is a corresponding large partition.

Now the above logic assumes that the split brain will divide the cluster into exactly 2 partitions. This need not be true. But this is the highest probability case. What happens if the cluster is divided into 3 partitions. Now we will have 3 partitions and we do not want to delay the partition which has the bare minimal acceptable number of nodes, which differs depending on the number of nodes configured.




The larger partition is to fence the smaller partition, right?

Yes. But in this case, if we find a smaller partition which  can survive a  second or third failure, we will let it win.
 
What happens if there is no larger partition, for instance if nodes were taken down administratively

Node that is taken down for administrative function is no more a part of the cluster. Hence there is no issue.
 
or when there are more than two partitions so they will all wait. Would this do any harm? 

Yes the split brain can happen for more than 2 partitions as I have described above. In that case also we will allow a smaller partition to go ahead if it has minimum number of nodes. But there may be a case where there is no smaller partition. This is an unfortunate scenario. In this case we will delay all the smaller  partitions as  we know that the smaller partition can not survive subsequent failures. What is implicit is that most clusters are of 4 node, hence the logic works for most of the cases. Since we have very less experience with really large clusters, in future we may need to redefine the definition of "small"

This is a heuristic algorithm that I am trying to apply. Hence as any heuristic algorithm, it tries to solve the problem by coming very close to the best possible solution.



I dont have the big picture at the moment, so please ignore my comments if all this has been thought through.

I very much appreciate that you have taken time to explore and think about this issue.



Nils



--

Tirthankar
http://insanityrulz.blogspot.com/
_______________________________________________ ha-clusters-discuss mailing list ha-clusters-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/ha-clusters-discuss


nigoroll

Posts: 104
From: DE

Registered: 2/9/06
Re: Code Review for CR 6705938
Posted: Aug 25, 2008 12:46 AM   in response to: tirth
To: Communities » ha-clusters » discuss
  Click to reply to this thread Reply

Hi Tirthankar,

> let_partition_wait is to return true if the node running it is in
> the "smaller" partition, right?
>
> Yes where the definition of "smaller" changes according to the number of
> nodes configured. i.e. if n is the number of nodes configured, a smaller
> partition may be much less than n/2.
> [...]
> In your code, the definition of "large" is fixed. In my code, the
> definition of "small" is variable.

Agree. But I would prefer a closed form as the definition of what is considered a small/large partition. The numbers you have chosen seem arbitrary to a certain extent and I believe it would be hard to show that exactly those numbers are a good choice. If you could come up with a formula and some good reasoning behind it, this should be much easier to follow.

> Now we will have 3 partitions and we do not want to delay
> the partition which has the bare minimal acceptable number of nodes,
> which differs depending on the number of nodes configured.

What happens if the remaining partition does not have the minimum number of nodes? How long will the delay be? Have you tested the scenario where a cluster has only "small" partitions left?

> What happens if there is no larger partition, for instance if nodes
> were taken down administratively
>
> Node that is taken down for administrative function is no more a part of
> the cluster. Hence there is no issue.

The documented procedure
http://docs.sun.com/app/docs/doc/819-2971/z4000076997776?l=en&a=view
is to evacuate a node for maintenance. Unconfiguring it would be too much of burden for the admin, IMHO.

So, unless I don't know about new functionality, there is no state information available which marks a cluster node as "not available". In the partitioning scenario, we must assume a node is offline if we cannot communicate with it.

So in short, IMHO there is currently no practical way to reliably determine the total cluster size in a partitioning situation.

Am I wrong? I'd be glad if I was and if I am, please help me understand.

If I am right, it could help to add a node property indicating whether or not the node is available. IMHO, this would also help administrators in handling defective hardware, test scenarios, node-local s/w issues etc.

> What is implicit
> is that most clusters are of 4 node, hence the logic works for most of
> the cases.

I disagree with using such an assumption as the basis for a particular implementation. Might be that Sun has statistics internally about cluster sizes deployed in the field, but as long as the product is supported for other sizes as well, it should work well for all of them.

> This is a heuristic algorithm that I am trying to apply. Hence as any
> heuristic algorithm, it tries to solve the problem by coming very close
> to the best possible solution.

I do agree with this approach, and I understand that your change will improve a particular scenario. I am only worried that it could have negative effects in other scenarios. As I have seen enough of those in the past, so I'd be grateful if you could clarify my remaining questions.

Nils




Terms of Use | Privacy | Trademarks | Copyright Policy | Site Guidelines
Your use of this web site or any of its content or software indicates your agreement to be bound by these Terms of Use.
© 2010, Oracle Corporation and/or its affiliates

Oracle