OpenSolaris

Discussions Communities Projects Download Source Browser

Home » OpenSolaris Forums » networking » discuss

Thread: Clearview IPMP Rearchitecture: high-level design: extended to 9/29

Welcome, Guest Help
Login Login
Guest Settings Guest Settings
Reply to this Thread Reply to this Thread Search Forum Search Forum Back to Thread List Back to Thread List

Permlink Replies: 24 - Last Post: Sep 29, 2005 2:20 PM by: meem
meem

Posts: 3,045
From: US

Registered: 3/9/05
Clearview IPMP Rearchitecture: high-level design: extended to 9/29
Posted: Sep 23, 2005 10:27 AM

  Click to reply to this thread Reply


In response to several requests, and in light of the size of the document,
I have extended the timer for feedback to Thursday, 9/29. As before, the
document is available here:

http://opensolaris.org/os/community/networking/ipmp-highlevel-design.pdf

The document has also been slightly updated in response to the feedback
received so far. Note that there are no changes to the original design,
but several ambiguous statements have been clarified, and the interaction
between DHCP and IPMP has been expanded in Section 4.12 (thanks, dme!).
Accordingly, the version number of the document has been bumped to 1.2.

Thanks again for your feedback -- and keep it coming!
--
meem
_______________________________________________
networking-discuss mailing list
networking-discuss at opensolaris dot org



carlsonj

Posts: 6,810
From: US

Registered: 3/9/05
Re: Clearview IPMP Rearchitecture: high-level design: extended to 9/29
Posted: Sep 26, 2005 12:13 PM   in response to: meem

  Click to reply to this thread Reply

Peter Memishian writes:
> http://opensolaris.org/os/community/networking/ipmp-highlevel-design.pdf

I reviewed 1.0, but I've checked all of my following comments against
1.2. They still seem mostly relevant.

(Comments by page and section number.)

p1, section 1.1:

- A bit of a nit, but it's really not about "sockets." The
applications that benefit from IPMP include those that don't use
sockets at all (such as RPC), and there are sockets-using
applications that don't benefit from it (such as AppleTalk).
Instead, the real term should be just "TCP/IP based."

- The lack of Sun Trunking across the board isn't really a failing
of the technology, but rather of our implementation of it.

p3, first bullet:

- The "failsafe" mechanisms referred to here are probably route flap
dampening and hold-downs.

- Some other things that could be added to this list:

. IPMP's probing mechanism is incompatible with standards-
based VRRP. The former normally requires routers to respond
to ICMP Echo messages, but the latter prohibits the back-up
router from responding to any packets addressed directly to
the virtual address, even simple ping requests. This means
that a fail-over of VRRP triggers fail-over (and eventual
failure) in IPMP.

. Some routers are configured not to reply to 'ping' at all,
as it's sometimes viewed as a "security problem."

. Having large numbers of these probing hosts on a single
network can case high amounts of ICMP Echo traffic, thus
triggering ICMP rate-limiting in many routers, and resulting
in false failure detection.

. At least in previous releases (not sure about today), the
code assumed that the fast-path M_DATA header was the same
on all member links, even though there's no mechanism that
actually causes this to be true.

. The use of source addresses is confusing to users, and often
results in RFEs filed asking for "same interface" semantics.

. The detailed behavior of general multicast (not the
well-known link-local multicast addresses) is less clear.
In particular, the behavior necessary to accomodate
IGMP-snooping switches is probably missing.

p4, requirement 4:

- Should mention here that packet filtering should be done on a
per-group basis to maintain state tracking.

p5, section 3.1:

- What is the MTU on an IPMP interface? Is it the minimum of all
member links?

p8, section 3.7:

- I'm not sure the described behavior here really represents
FAILBACK=no. (But, then, I'm not sure what behavior would really
represent that.) It would help to have some examples worked out
here.

For example, imagine a group of three active interfaces with
FAILBACK=no. Using "A" for active, "I" for inactive, and "F" for
failed, I can come up with differing end results with only minor
changes in timing. For example:

(AAA) -> (FAA) -> (IAA) -> (AFA) -> (AIA)

occurs when interface 1 fails and is repaired, followed by a
failure and repair of interface 2. But if we have interface 2
fail right before the interface 1 repair, we get:

(AAA) -> (FAA) -> (FFA) -> (IFA) -> (IIA)

Maybe that's actually right, and is just a consequence of an
unusual usage model (FAILBACK=no without a designated STANDBY
interface), but the result seems odd to me. And I'm really
concerned about what happens if *all* the interfaces fail and then
recover.

p9, section 3.9:

- It would help a bit, I think, to segregate out the flags that are
on logical interfaces (address flags) from those that are on the
underlying physical interface. (The break is after the 5th entry
in the first table, and after the second in the second table.)

- DL_NOTE_LINK_DOWN causes the kernel to clear out IFF_RUNNING.
Does it now result in the kernel setting IFF_FAILED as well? Or
does in.mpathd (when and only when it monitors a given interface)
set IFF_FAILED in response to IFF_RUNNING?

I can perhaps understand having the IFF_RUNNING flag set by the
kernel to be the logical AND of having the hardware report the
interface up, and software reporting the interface not failed, but
I'm not sure I see why the two (IFF_FAILED and IFF_RUNNING) must
be just mirror images of each other.

In fact, I'm not sure why IFF_RUNNING on the member links would be
cleared out by IFF_FAILED. It's not as though ordinary
applications would ever see those interfaces, so they cannot be
confused by the meaning of the extra bits; they need to set
special Solaris-specific flags to see them at all. So why the
interlock?

p10, section 3.10:

- Could note that in.mpathd already has DLPI open, so this isn't a
significant change. (But based on the above, I don't see why
special link up/down handling is actually needed.)

p10, section 3.11:

- It would be possible to move IP addresses from one L2 address (or
member link) to another from within in.mpathd, perhaps using the
existing ARP ioctls. Does this really need to be sent to the
kernel? (No problem if you _want_ it there, just that it doesn't
seem to _need_ to be there.)

p13, section 3.16.4:

- I had a lot of trouble reading this section.

First, I don't see why "usesrc" is specifically senseless with
IPMP. OSPF-MP assumes that the "real" interfaces themselves do
have addresses, but that those addresses just aren't supposed to
be used by normal applications.

I suspect part the confusion here might be between the source
address selected, and the way IPMP does inbound load spreading.
However, the load spreading is done a different way when OSPF-MP
is in use, and isn't incompatible with IPMP. Peers will see those
IPMP addresses as independent, equal-cost next-hop addresses, and
will establish routes to each one. The return packets will be
load-spread according to those ECMP routes rather than by the
destination address (which almost certainly isn't on-link anyway).

Secondly, I don't quite see why the new model helps or hurts
here, at least in the ways described. It certainly does help by
getting rid of the distracting test addresses, but I see no
particular relationship with "usesrc."

Thirdly, the claim about network utilization doesn't seem to be
substantiated. Why exactly would IPMP improve utilization?
Certainly, all interfaces in an OSPF-MP group are expected to be
able to send packets. (Is that the underlying problem that this
text is referencing -- the lack of ECMP support on Solaris? If
so, then I don't think that recommending IPMP to solve that
problem is the right path.)

Finally, I'm not sure I understand why we would want to make our
implementation of "usesrc" less uniform than it is now. We did
try in the original design to make sure that it didn't have odd
interactions with other technologies, and could be reused as
needed, and it seems odd that we're roping it off in this case.

p14, section 4.1.1:

- Who does the "next available" selection when a ~NOFAILOVER address
is transferred to an ipmp logical interface? Is this done in the
kernel itself, in in.mpathd, or in ifconfig? (If it's the latter,
what happens to existing applications that plumb IP interfaces?)

p15, section 4.1.4:

- There's a new semantic implied here. It's no longer possible to
set the group name first and then set the NOFAILOVER flag. That
is, attempts to do this will produce unexpected results:

# ifconfig ce0 10.0.0.1 netmask + broadcast + up
# ifconfig ce0 group foo
# ifconfig ce0 -failover

After that second command, the interface that's being configured
will (presumably) migrate to the ipmp interface, leaving that last
command to uselessly set the flag on the unused ce0 member link.

- I would recommend keeping the "is ipmp" notion separate from the
group name establishment just for the purpose of clarity. In
other words, I'd prefer something like this instead:

# ifconfig outside0 ipmp group b

Or even this:

# ifconfig outside0 plumb ipmp group b

- It's specifically against the rules to have an option that takes
an optional parameter, as the proposed "[ipmp [groupname]]" syntax
would allow. (The problem is that it makes subsequent keywords
ambiguous, and ifconfig is tortured enough as it is. ;-})

- Is it possible to give an ipmp interface the name of a real
interface on the system? What happens if I do that? (I assume
that the "ipmp" keyword causes an error because the interface
already exists and can't switch types.)

What happens if I give it the name of an interface that _later_ is
established as a real one, as with DR. Does the new interface get
rejected by the system?

p16, section 4.1.5:

- The lack of symmetry between the "ifconfig ipmp1 ipmp b" and
"ifconfig ipmp1 unplumb" operations doesn't look too pretty.

p16, section 4.1.6:

- Who does this address migration?

- Administrative issue to be documented: accidentally setting a
group name (or the wrong group name) on an up interface is now
highly toxic. It means that the address slips out from the
administrator's control, and doesn't come back unless he does a
series of unintuitive commands. (I.e., just clearing out the
group name won't fix the problem.)

- An IPMP group with no member links has IFF_RUNNING cleared, right?
Is it also IFF_FAILED?

p17, section 4.2:

- What's the privilege model for the new command?

- I like the flag-verbs, but I'm nearly certain that xDesign won't.

- I suggest that you create a machine parseable output format _now_,
since Explorer-consumers are growing rampant and are nailing all
the other utilities that don't have parseable forms.

p18, section 4.2.1:

- What shows in "-g" output under the 'fdt' column if the group
doesn't have probe targets or doesn't have test addresses?

- "Degraded" might need a tighter definition. In particular, if I
have a group with one stand-by interface, and one of the main
interfaces has failed over to the stand-by, is that group now in
"degraded" mode? It's not degraded from the bandwidth or
availability point of view, though perhaps it is from a hardware
maintenance point of view.

p18, section 4.2.2:

- Should there be a "-n" option to suppress address-to-name
translation? (And should "names" be the default the way they are
most everywhere else?)

- Suggest using "--" rather than "n/a" for consistency with other
existing tools.

- Can "-a" or some other tool show which interface is the current
multicast/broadcast "lead" interface for duplicate suppression?
This part still isn't observable.

- How does the utility get this information? Via ARP ioctls or some
other mechanism?

p19, section 4.2.3:

- What permissions does "-i" need? Won't it need to be root to look
at the DLPI driver for the "probe" column? (Or is some other
magic afoot?)

- What does the "active" column represent? It's not really
explained. (It doesn't just appear to be the inverse of
IFF_INACTIVE.)

- Why is it impossible to report link up/down status when the link
is offline? Shouldn't the system still monitor link up/down
status, even while the link is administratively offlined? Or is
there some interference here with DR?

p20, section 4.2.5:

- Is it really possible to get probes that march backwards in time,
as shown by the 1438 to 1439 transition? Or is that just a
cut-and-paste issue?

- Should the column header be "seq" instead of "probe?"

- Why does ipmpstat need to query the IPMP subsystem periodically?
Can't it just block awaiting notification from IPMP?

- It might be helpful to have the time displayed in some what that's
aligned with snoop. (Though exactly how, I'm not sure.)

p21, top of page:

- When exactly is a packet declared "lost?" Is it when we go to
send another and the previous hasn't arrived yet? Or is it
related to the "FDT?"

p22, section 4.3.2:

- There seems to be some surprising (and unintended) new
functionality here. If I manage things by group name, then I
don't need to remember which ipmp group is which in order to add a
new address. I can just do something like this:

# ifconfig foobar0 plumb group a 10.0.0.1 up

and since "foobar0" will never exist, this will add the address to
the named group.

p23, 'route' changes:

- If this functionality is implemented in the 'route' command
itself, rather than in the kernel, what does that mean for
existing utilities? It seems like the "add static route" feature
in Zebra and the like will be harmed by this.

(For what it's worth, I think those utilities are probably blown
out of the water by removing "ce0" from the SIOCGIFCONF data, and
will need manual intervention to convert their configurations
over. I hope that there's not much mixed IPMP/Zebra usage.)

p23, section 4.6.1:

- "duplicate address detection will be used to ensure that no other
on-link hosts are currently using it." On *what* link? I assume
this means that one will be chosen arbitrarily and just used.

- Why is the link-local address unreachable if there are no
interfaces in the group? Aren't local addresses always reachable?

- Why would an IPv6 ipmp interface have the BROADCAST flag set?

p24, section 4.6.2:

- NumAddrs: ew. This should really be based on the number of member
links in the IPMP group. You'll want to have at least one data
address per member link in order to get the inbound load-spreading
right. It'd be better still if in.ndpd just did the right thing.

p25, top of page:

- Need to know exactly how statistics (particularly errors) are
handled. At a guess, I think you'll need to keep a record of what
the error counter for a member link was at the time it joined, so
that you can add the delta during membership to the total.
Otherwise, removing a link from a group could cause the counters
to roll backwards, and that's a big No-No.

p25, section 4.8:

- What needs to change in ARP so that multiple L2 addresses are
accepted as local by the system? It seems to me that, since ARP
is plumbed over each real interface, ARP will need to be in on the
game that IPMP is playing. There need to be signals between IPMP
and ARP to accomplish this.

p26, section 4.10:

- "Sent to and received from" in the context of /dev/ipnet/ipmp*
actually means sent or received on any member link, and not just
filtered based on address. Right?

p28, section 5.1:

- One of the applications affected is SNMP.

p29, section 5.2:

- 1: If an address is up and marked IFF_NOFAILOVER, can I cause the
address to migrate by clearing IFF_NOFAILOVER?

- 2: Is IFF_INACTIVE modified both by the kernel and by
applications?

- 4: Why isn't IFF_COS_ENABLED set? This will likely break IPQoS.
Shouldn't it just be the logical AND of all the IFF_COS_ENABLED
bits on the underlying interfaces?

- 6: When adding a new interface to a group, what happens? Is the
interface's IFF_ROUTER flag changed to match the flag used for the
existing group?

- 7: What does IFF_XRESOLV mean on an ipmp interface?

p30, table 2:

- Might be nice to indicate which ones are physical and which are
logical.

- Why would IFF_MIPRUNNING appear on any underlying interface?

- Why can't IFF_MULTI_BCAST be visible on an ipmp interface?

p31, top of page:

- Is the automatic IFF_DEPRECATED logic done in the kernel or user
space?

- As before, I'm not really sure why clearing IFF_RUNNING on the
member links in response to IFF_FAILED makes sense. I'd much
rather see these two flags retain their original meaning, and have
the ipmp aggregate interface just show ~IFF_RUNNING when all
member interfaces have *either* IFF_FAILED set *or* IFF_RUNNING
cleared. That would better illustrate the layering, as the
"failed" concept is really an artifact of how IPMP works
internally.

p31, section 5.4.1:

- There's no such thing as a "routing socket associated with the
IPMP interface." There are just global routing sockets; they're
associated with IP itself, not any particular interface.

p31, section 5.4.2:

- This section is a little confusing, because it talks about the
possibility of seeing the IPMP member links well before it says
how this might happen.

p32, top of page:

- I had trouble reading this. I assume it means just that
RTM_NEWADDR will occur during address transfer, and not that
IFF_NOFAILOVER is the only possible way this message could be
sent.

- Is any message sent to the user when the SO_RTSIPMP flag is set or
cleared? If not, then how does the user know that he's got a
consistent view of the world? (I suspect the answer is that the
user should set the flag before doing SIOCGLIFCONF with LIFC_IPMP,
and then just _never_ clear it.)

p32, section 5.5:

- If we did allow routes to interfaces, then the behavior in section
4.5 would become a bit weird.

p32, section 5.6:

- What happens if an ifindex of a member link is used? (I assume it
results in an error.)

p33, section 5.7:

- It would be nice if SIOCLIFADDIF worked to add the zeroth address
on the interface when the address configured there is 0.0.0.0.
This would remove one bit of asymmetry from the current design.

p33, section 5.9:

- What do SIOC[GS]LIFLNKINFO and SIOC[GS]LIFMUXID mean on ipmp
interfaces? The latter seems senseless, as no real plumbing can
occur there.

- What do SIOC[GS]LIFMETRIC mean on member link interfaces? This
doesn't make sense, as routing (the consumer of interface metrics)
won't use them. Does something in IPMP itself use them?

p34, section 5.10:

- Probably need more detailed behavior for SIOCDARP.

p34, section 5.12:

- What happens if I set the zone ID to a non-global zone first, and
then set the IFF_NOFAILOVER flag?

p35, section 5.14:

- It might be worth mentioning that IGMP operation under IPMP is
really unclear. (Should membership messages be echoed out all
interfaces so that switches know that all interfaces are
partitipating? Or just over the "lead" interface? If it's the
latter, shouldn't the IGMP messages be repeated if a new "lead" is
chosen?)

p35, section 5.15:

- "Reverse-engineered" might be a little strong. The same interface
exists on other Mentat-derived systems, and (like fast-path) is
probably documented there.

p36, section 5.16:

- If the vni driver goes away, does the vni(7D) man page go too?

p36:

- Any changes for PSARC 2002/137 (IPMP Asynchronous Event
Definitions) due to this project?

--
James Carlson, KISS Network <james dot d dot carlson at sun dot com>
Sun Microsystems / 1 Network Drive 71.232W Vox +1 781 442 2084
MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677
_______________________________________________
networking-discuss mailing list
networking-discuss at opensolaris dot org



meem

Posts: 3,045
From: US

Registered: 3/9/05
Re: Clearview IPMP Rearchitecture: high-level design: extended to 9/29
Posted: Sep 27, 2005 10:07 PM   in response to: carlsonj

  Click to reply to this thread Reply


Jim,

Thanks as always for your thorough comments -- the document is quite a bit
improved thanks to them. My replies are inline. In most cases, I have
gone ahead and updated the document, which is now available as version
1.3. I suspect subsequent reads will find some typos in my revisions,
thus don't be startled if it's 1.3.1 or 1.3.2 by the time you get a chance
to look at it again ;-)

When replying, please remove anything non-controversial so that we can
converge on the remaining issues.

> p1, section 1.1:
>
> - A bit of a nit, but it's really not about "sockets." The
> applications that benefit from IPMP include those that don't use
> sockets at all (such as RPC), and there are sockets-using
> applications that don't benefit from it (such as AppleTalk).

The intent is to make it clear that the focus is on sockets (hence all of
the changes to the socket-level API's). We're not putting effort into
ensuring that XTI or TLI applications will benefit, though I believe they
will since AFAIK all of the TLI/XTI interface and local address discovery
operations are implemented in terms of sockets.

> Instead, the real term should be just "TCP/IP based."

So UDP and SCTP don't benefit? :-P I can say "AF_INET[6]-based" if you
prefer. But claiming that IPMP will work with all IP-based applications,
including those using TLI/XTI, seems a bit too bold to me (and I'm not
convinced it's time well spent to guarantee it).

> - The lack of Sun Trunking across the board isn't really a failing
> of the technology, but rather of our implementation of it.

Sure; clarified in the document (also true with 802.3ad).

> p3, first bullet:
>
> - The "failsafe" mechanisms referred to here are probably route flap
> dampening and hold-downs.

Yes; clarified.

> - Some other things that could be added to this list:
>
> . IPMP's probing mechanism is incompatible with standards-
> based VRRP. The former normally requires routers to respond
> to ICMP Echo messages, but the latter prohibits the back-up
> router from responding to any packets addressed directly to
> the virtual address, even simple ping requests. This means
> that a fail-over of VRRP triggers fail-over (and eventual
> failure) in IPMP.
>
> . Some routers are configured not to reply to 'ping' at all,
> as it's sometimes viewed as a "security problem."
>
> . Having large numbers of these probing hosts on a single
> network can case high amounts of ICMP Echo traffic, thus
> triggering ICMP rate-limiting in many routers, and resulting
> in false failure detection.

Sadly, these three problems are inherent in the probe-based failure
detection mechanism. I'm not sure what we can do about them from a
technical standpoint.

> . The use of source addresses is confusing to users, and often
> results in RFEs filed asking for "same interface" semantics.

Agreed, and incorporated.

> . At least in previous releases (not sure about today), the
> code assumed that the fast-path M_DATA header was the same
> on all member links, even though there's no mechanism that
> actually causes this to be true.

This seems like an implementation flaw that has no impact on the design.

> . The detailed behavior of general multicast (not the
> well-known link-local multicast addresses) is less clear.
> In particular, the behavior necessary to accomodate
> IGMP-snooping switches is probably missing.

I'm not convinced this is a problem that shapes the *high-level* design.

> p4, requirement 4:
>
> - Should mention here that packet filtering should be done on a
> per-group basis to maintain state tracking.

Sure; mentioned.

> p5, section 3.1:
>
> - What is the MTU on an IPMP interface? Is it the minimum of all
> member links?

Yes. I've added a new section, 3.10, which covers how MTU used to be
handled, and how it will be handled in the future. In retrospect, the
omission of this topic is glaring; apologies.

> p8, section 3.7:
>
> - I'm not sure the described behavior here really represents
> FAILBACK=no. (But, then, I'm not sure what behavior would really
> represent that.) It would help to have some examples worked out
> here.
>
> For example, imagine a group of three active interfaces with
> FAILBACK=no. Using "A" for active, "I" for inactive, and "F" for
> failed, I can come up with differing end results with only minor
> changes in timing. For example:
>
> (AAA) -> (FAA) -> (IAA) -> (AFA) -> (AIA)
>
> occurs when interface 1 fails and is repaired, followed by a
> failure and repair of interface 2. But if we have interface 2
> fail right before the interface 1 repair, we get:
>
> (AAA) -> (FAA) -> (FFA) -> (IFA) -> (IIA)
>
> Maybe that's actually right, and is just a consequence of an
> unusual usage model (FAILBACK=no without a designated STANDBY
> interface), but the result seems odd to me. And I'm really
> concerned about what happens if *all* the interfaces fail and then
> recover.

Yes, it's a bit of an odd bird. Personally, I'd love to get rid of this
feature, but I know there are customers who hate unnecessary rebinding of
addresses to interfaces (because of the affect it has on others hosts) and
thus want to have that happen as little as possible.

I've added a little more rationale behind the feature, but I don't want to
devote too much space to this wart (I'd like to kill it, but I can't).

> p9, section 3.9:
>
> - It would help a bit, I think, to segregate out the flags that are
> on logical interfaces (address flags) from those that are on the
> underlying physical interface. (The break is after the 5th entry
> in the first table, and after the second in the second table.)

Segregate how -- with an extra line in the table? And for what purpose?

> - DL_NOTE_LINK_DOWN causes the kernel to clear out IFF_RUNNING.
> Does it now result in the kernel setting IFF_FAILED as well? Or
> does in.mpathd (when and only when it monitors a given interface)
> set IFF_FAILED in response to IFF_RUNNING?

I'm uncomfortable with the idea of a flag that is sometimes set by the
kernel, and sometimes by an application. Since there are situations where
IFF_FAILED must be set by in.mpathd, I think i'd rather have it always set
it. This also keeps all the policy of setting IFF_FAILED in one place
(even if the policy is rigid, I prefer it centralized).

> I can perhaps understand having the IFF_RUNNING flag set by the
> kernel to be the logical AND of having the hardware report the
> interface up, and software reporting the interface not failed, but
> I'm not sure I see why the two (IFF_FAILED and IFF_RUNNING) must
> be just mirror images of each other.

The mirror-image must be maintained to ensure that naive applications
behave correctly: if there was a situation where IFF_FAILED was set and
IFF_RUNNING was set, then an application would try to use an unusable
interface. (The other case, where IFF_FAILED was clear and IFF_RUNNING
was clear makes no semantic sense: how can the interface not be
IFF_RUNNING, but not be IFF_FAILED?)

> In fact, I'm not sure why IFF_RUNNING on the member links would be
> cleared out by IFF_FAILED. It's not as though ordinary
> applications would ever see those interfaces, so they cannot be
> confused by the meaning of the extra bits; they need to set
> special Solaris-specific flags to see them at all. So why the
> interlock?

To make it clear to someone using ifconfig or other administrative tools.

> p10, section 3.10:
>
> - Could note that in.mpathd already has DLPI open, so this isn't a
> significant change. (But based on the above, I don't see why
> special link up/down handling is actually needed.)

It does?

> p10, section 3.11:
>
> - It would be possible to move IP addresses from one L2 address (or
> member link) to another from within in.mpathd, perhaps using the
> existing ARP ioctls. Does this really need to be sent to the
> kernel? (No problem if you _want_ it there, just that it doesn't
> seem to _need_ to be there.)

I feel it's more natural in the kernel.

> p13, section 3.16.4:
>
> - I had a lot of trouble reading this section.

We talked offline about this. I have updated the text to make it clear
that I was referring to usesrc being too sharp a knife with the current
IPMP administrative model, and to clarify that the network utilization
comments assume the lack of ECMP.

> Finally, I'm not sure I understand why we would want to make our
> implementation of "usesrc" less uniform than it is now. We did
> try in the original design to make sure that it didn't have odd
> interactions with other technologies, and could be reused as
> needed, and it seems odd that we're roping it off in this case.

It's not less uniform -- both now and in the future, IPMP is not
supported. I'm roping it off because I can't see a compelling reason
to support the configuration, especially vs. OSPF-MP with ECMP.

> p14, section 4.1.1:
>
> - Who does the "next available" selection when a ~NOFAILOVER address
> is transferred to an ipmp logical interface? Is this done in the
> kernel itself, in in.mpathd, or in ifconfig? (If it's the latter,
> what happens to existing applications that plumb IP interfaces?)

The kernel will do it as part of bringing the interface IFF_UP. This was
intended to be implied by list item 1 on page 30.

> p15, section 4.1.4:
>
> - There's a new semantic implied here. It's no longer possible to
> set the group name first and then set the NOFAILOVER flag. That
> is, attempts to do this will produce unexpected results:
>
> # ifconfig ce0 10.0.0.1 netmask + broadcast + up
> # ifconfig ce0 group foo
> # ifconfig ce0 -failover

Actually, that has always potentially led to unexpected results (e.g., if
ce0 was failed). However, I agree that some systems may have been
misconfigured this way and thus we should have a release note explaining
the situation. As such, I have updated the document to discuss the issue.

> - I would recommend keeping the "is ipmp" notion separate from the
> group name establishment just for the purpose of clarity. In
> other words, I'd prefer something like this instead:
>
> # ifconfig outside0 ipmp group b
>
> Or even this:
>
> # ifconfig outside0 plumb ipmp group b
>
> - It's specifically against the rules to have an option that takes
> an optional parameter, as the proposed "[ipmp [groupname]]" syntax
> would allow. (The problem is that it makes subsequent keywords
> ambiguous, and ifconfig is tortured enough as it is. ;-})

To cover all of the above: based on our offline discussion, I've changed
the syntax to be "ifconfig outside0 ipmp group b" (or, as a shorthand,
"ifconfig outside0 ipmp"). I have also relaxed the constraint on changing
the group name: it is now permitted as long as there are no underlying
interfaces in the group.

The document has been updated.

> - Is it possible to give an ipmp interface the name of a real
> interface on the system? What happens if I do that? (I assume
> that the "ipmp" keyword causes an error because the interface
> already exists and can't switch types.)

Right, this is covered in section 4.1.4.

> What happens if I give it the name of an interface that _later_ is
> established as a real one, as with DR. Does the new interface get
> rejected by the system?

The interface isn't rejected, but it won't be plumbed by IP (and an error
will be logged). C'est la vie.

> p16, section 4.1.5:
>
> - The lack of symmetry between the "ifconfig ipmp1 ipmp b" and
> "ifconfig ipmp1 unplumb" operations doesn't look too pretty.

The asymmetry is annoying, but the alternatives are:

* Use "plumb ipmp" for creation: this is problematic because
leaving out the word "ipmp" would do something *totally*
different, which I found unacceptable.

* Invent a synonym for "unplumb" which must be used with IPMP
interfaces. That seemed gratuitous and a bit user-hostile
(who wants to remember a second command?)

Anyway, I suspect the xDesign guys will have an opinion on this, so let's
see what they have to say.

> p16, section 4.1.6:
>
> - Who does this address migration?

It hasn't been decided -- either the kernel, or ifconfig.

> - Administrative issue to be documented: accidentally setting a
> group name (or the wrong group name) on an up interface is now
> highly toxic. It means that the address slips out from the
> administrator's control, and doesn't come back unless he does a
> series of unintuitive commands. (I.e., just clearing out the
> group name won't fix the problem.)

Yes, that's a risk. Footnote added.

> - An IPMP group with no member links has IFF_RUNNING cleared, right?
> Is it also IFF_FAILED?

Yes -- it's not usable. I've updated section 5.3 to cover this.

> p17, section 4.2:
>
> - What's the privilege model for the new command?

As per the prompt, any user can run it. It may internally require some
privileges to work, but the specifics aren't known yet. I've updated the
document to mention this.

> - I like the flag-verbs, but I'm nearly certain that xDesign won't.

:-) It seems wrong to end up with a command that consists of nothing but
show-* subcommands, so maybe I have an argument. We'll see.

> - I suggest that you create a machine parseable output format _now_,
> since Explorer-consumers are growing rampant and are nailing all
> the other utilities that don't have parseable forms.

Agreed; I have now defined one -- see section 4.2.6. Better ideas are
welcome, but I'd prefer not to have to rope off too many meta-characters
(right now, just "=" and "\n" are roped off).

> p18, section 4.2.1:
>
> - What shows in "-g" output under the 'fdt' column if the group
> doesn't have probe targets or doesn't have test addresses?

"n/a" -- updated.

> - "Degraded" might need a tighter definition. In particular, if I
> have a group with one stand-by interface, and one of the main
> interfaces has failed over to the stand-by, is that group now in
> "degraded" mode? It's not degraded from the bandwidth or
> availability point of view, though perhaps it is from a hardware
> maintenance point of view.

Yes, that group would now be "degraded". I agree a tighter definition is
needed, and I will talk to the FMA guys about this. (Thanks for reminding
me about this!)

> p18, section 4.2.2:
>
> - Should there be a "-n" option to suppress address-to-name
> translation? (And should "names" be the default the way they are
> most everywhere else?)

What name translation?

> - Suggest using "--" rather than "n/a" for consistency with other
> existing tools.

Can't say I have a strong preference here. Let's wait to see what the
xDesign guys have to say.

> - Can "-a" or some other tool show which interface is the current
> multicast/broadcast "lead" interface for duplicate suppression?
> This part still isn't observable.

Good point. I've added it as a "flags" field member to ipmpstat -i.

> - How does the utility get this information? Via ARP ioctls or some
> other mechanism?

The design of ipmpstat will be covered in a separate document, but: it
will be a mix of ioctl's and calls through libipmp into in.mpathd.

> p19, section 4.2.3:
>
> - What permissions does "-i" need? Won't it need to be root to look
> at the DLPI driver for the "probe" column? (Or is some other
> magic afoot?)

As implied by the prompt, none will be required by the user. However, it
may need some subset of privileges to actually work. TBD.

> - What does the "active" column represent? It's not really
> explained. (It doesn't just appear to be the inverse of
> IFF_INACTIVE.)

Check the glossary (the original definition is back in section 3.8).

> - Why is it impossible to report link up/down status when the link
> is offline? Shouldn't the system still monitor link up/down
> status, even while the link is administratively offlined? Or is
> there some interference here with DR?

Once an interface is offlined, it cannot be attached to with DLPI. So,
there is no way to access the link up/down status. Footnote added.

> p20, section 4.2.5:
>
> - Is it really possible to get probes that march backwards in time,
> as shown by the 1438 to 1439 transition? Or is that just a
> cut-and-paste issue?

It depends on some aspects of the output format that I haven't decided on.
If things remain ordered by sequence number (which I prefer), then it's
entirely possible that responses on some interfaces arrived before those
on others. However, that raises the question of whether to delay all
output "waiting" for lost probes -- so I may end up changing things to be
sorted by time.

> - Should the column header be "seq" instead of "probe?"

That seems too geeky.

> - Why does ipmpstat need to query the IPMP subsystem periodically?
> Can't it just block awaiting notification from IPMP?

It could. I've changed the text to be more vague, as the details of that
are really a topic for another document.

> - It might be helpful to have the time displayed in some what that's
> aligned with snoop. (Though exactly how, I'm not sure.)

Yeah, dunno how to do that.

> p21, top of page:
>
> - When exactly is a packet declared "lost?" Is it when we go to
> send another and the previous hasn't arrived yet? Or is it
> related to the "FDT?"

Yes, that's when (and the rate of sending packets is related to the FDT).
I've updated the document (we also need to update the public IPMP
documentation to cover this -- sigh).

> p22, section 4.3.2:
>
> - There seems to be some surprising (and unintended) new
> functionality here. If I manage things by group name, then I
> don't need to remember which ipmp group is which in order to add a
> new address. I can just do something like this:
>
> # ifconfig foobar0 plumb group a 10.0.0.1 up
>
> and since "foobar0" will never exist, this will add the address to
> the named group.

I don't really see how this is new -- I can craft up arbitrary
hostname.<if> files today and achieve similar results. Note that if group
"a" doesn't exist at all by the time the system gets to handling missing
interfaces, the above will be ignored.

> p23, 'route' changes:
>
> - If this functionality is implemented in the 'route' command
> itself, rather than in the kernel, what does that mean for
> existing utilities? It seems like the "add static route" feature
> in Zebra and the like will be harmed by this.

I'd prefer to isolate this to route. Why would zebra be adding routes to
the underlying interfaces?

> (For what it's worth, I think those utilities are probably blown
> out of the water by removing "ce0" from the SIOCGIFCONF data, and
> will need manual intervention to convert their configurations
> over. I hope that there's not much mixed IPMP/Zebra usage.)

Why would they want to know about ce0? Please elaborate.

> p23, section 4.6.1:
>
> - "duplicate address detection will be used to ensure that no other
> on-link hosts are currently using it." On *what* link? I assume
> this means that one will be chosen arbitrarily and just used.

Yes; clarified.

> - Why is the link-local address unreachable if there are no
> interfaces in the group? Aren't local addresses always reachable?

The intent was to state that it's not reachable via another host. Of
course it can be locally used. Updated.

> - Why would an IPv6 ipmp interface have the BROADCAST flag set?

Clearly a mistake; fixed.

> p24, section 4.6.2:
>
> - NumAddrs: ew. This should really be based on the number of member
> links in the IPMP group. You'll want to have at least one data
> address per member link in order to get the inbound load-spreading
> right. It'd be better still if in.ndpd just did the right thing.

I'm fine with having in.ndpd try to initially configure as many global
addresses as there are interfaces, but I'm not sure what to do if an
interface is removed -- is it really okay to blow away a global address at
that point? It *feels* wrong to do that.

Document updated.

> p25, top of page:
>
> - Need to know exactly how statistics (particularly errors) are
> handled. At a guess, I think you'll need to keep a record of what
> the error counter for a member link was at the time it joined, so
> that you can add the delta during membership to the total.
> Otherwise, removing a link from a group could cause the counters
> to roll backwards, and that's a big No-No.

Right. I haven't decided on the implementation yet. I think this is too
low-level for this document.

> p25, section 4.8:
>
> - What needs to change in ARP so that multiple L2 addresses are
> accepted as local by the system? It seems to me that, since ARP
> is plumbed over each real interface, ARP will need to be in on the
> game that IPMP is playing. There need to be signals between IPMP
> and ARP to accomplish this.

Yep, this will be covered in a low-level document. But first, we need to
get some code running to see what approach makes the most sense.

> p26, section 4.10:
>
> - "Sent to and received from" in the context of /dev/ipnet/ipmp*
> actually means sent or received on any member link, and not just
> filtered based on address. Right?

That depends on whether it's in promiscuous-mode or not. If in
promiscuous-mode: yes. I've updated the document to be clearer.

> p28, section 5.1:
>
> - One of the applications affected is SNMP.

Footnote added.

> p29, section 5.2:
>
> - 1: If an address is up and marked IFF_NOFAILOVER, can I cause the
> address to migrate by clearing IFF_NOFAILOVER?

Yes. In response to other review feedback, this is indirectly covered in
section 4.1.3.

> - 2: Is IFF_INACTIVE modified both by the kernel and by
> applications?

No, only by applications (specifically in.mpathd, though I suppose
anything could have a whack at it if it really wanted).

> - 4: Why isn't IFF_COS_ENABLED set? This will likely break IPQoS.
> Shouldn't it just be the logical AND of all the IFF_COS_ENABLED
> bits on the underlying interfaces?

IPQoS is already broken [rimshot]. "Fixed."

> - 6: When adding a new interface to a group, what happens? Is the
> interface's IFF_ROUTER flag changed to match the flag used for the
> existing group?

Yes. Clarified.

> - 7: What does IFF_XRESOLV mean on an ipmp interface?

You're right, it will not be supported. Likewise with IFF_NOARP and
IFF_NONUD. Fixed.

>
> p30, table 2:
>
> - Might be nice to indicate which ones are physical and which are
> logical.

It's not that simple -- there are four levels of hierarchy:

* Per ill_t
* Per ill_t address family (IPv6)
* Per ipif_t
* Per IPMP group

I thought the table was overwhelming as-is, and I couldn't see a clean way
to incorporate this information.

> - Why would IFF_MIPRUNNING appear on any underlying interface?

Probably nothing, but there's nothing that would prevent it and I can't
see a point in stopping it.

> - Why can't IFF_MULTI_BCAST be visible on an ipmp interface?

Are you making the same logical-AND argument as IFF_MULTI_BCAST? If so,
okay.

> p31, top of page:
>
> - Is the automatic IFF_DEPRECATED logic done in the kernel or user
> space?

Kernel; clarified.

> - As before, I'm not really sure why clearing IFF_RUNNING on the
> member links in response to IFF_FAILED makes sense. I'd much
> rather see these two flags retain their original meaning, and have
> the ipmp aggregate interface just show ~IFF_RUNNING when all
> member interfaces have *either* IFF_FAILED set *or* IFF_RUNNING
> cleared. That would better illustrate the layering, as the
> "failed" concept is really an artifact of how IPMP works
> internally.

See my earlier response.

> p31, section 5.4.1:
>
> - There's no such thing as a "routing socket associated with the
> IPMP interface." There are just global routing sockets; they're
> associated with IP itself, not any particular interface.

Wow, I must've been really tired when I wrote that section. Fixed.

> p31, section 5.4.2:
>
> - This section is a little confusing, because it talks about the
> possibility of seeing the IPMP member links well before it says
> how this might happen.

I removed the first sentence; hopefully this makes it clearer.

> p32, top of page:
>
> - I had trouble reading this. I assume it means just that
> RTM_NEWADDR will occur during address transfer, and not that
> IFF_NOFAILOVER is the only possible way this message could be
> sent.

How else could it happen? Any UP data addresses have already migrated to
the IPMP interface.

> - Is any message sent to the user when the SO_RTSIPMP flag is set or
> cleared? If not, then how does the user know that he's got a
> consistent view of the world? (I suspect the answer is that the
> user should set the flag before doing SIOCGLIFCONF with LIFC_IPMP,
> and then just _never_ clear it.)

How does the user know he's got a consistent state of the world with
"normal" routing sockets? I think it's the same -- an application would
open a routing socket, set SO_RTSIPMP, do an SIOCGLIFCONF, build its
state, and listen for routing socket messages to update that state.

> p32, section 5.5:
>
> - If we did allow routes to interfaces, then the behavior in section
> 4.5 would become a bit weird.

It *already* is a bit weird ;-)

> p32, section 5.6:
>
> - What happens if an ifindex of a member link is used? (I assume it
> results in an error.)

Yes; updated (in.mpathd makes use of the undocumented IP_DONTFAILOVER_IF
option to guarantee that probes go out a specific interface).

> p33, section 5.7:
>
> - It would be nice if SIOCLIFADDIF worked to add the zeroth address
> on the interface when the address configured there is 0.0.0.0.
> This would remove one bit of asymmetry from the current design.

Yes, but fixing that is out-of-scope.

> p33, section 5.9:
>
> - What do SIOC[GS]LIFLNKINFO and SIOC[GS]LIFMUXID mean on ipmp
> interfaces? The latter seems senseless, as no real plumbing can
> occur there.

Since in.ndpd is performed on IPMP interfaces, my understanding is that it
would use SIOC[GS]LIFLNKINFO. SIOC[GS]LIFMUXID is must provided for
completeness.

> - What do SIOC[GS]LIFMETRIC mean on member link interfaces? This
> doesn't make sense, as routing (the consumer of interface metrics)
> won't use them. Does something in IPMP itself use them?

Could you explain more about how routing currently makes use of LIFMETRIC?
Since routing daemons will run over the IPMP interface, I'd (clearly
incorrectly) assumed that it would use this ioctl.

> p34, section 5.10:
>
> - Probably need more detailed behavior for SIOCDARP.

Please elaborate -- what more would you like to see?

>
> p34, section 5.12:
>
> - What happens if I set the zone ID to a non-global zone first, and
> then set the IFF_NOFAILOVER flag?

It will also fail. Updated.

> p35, section 5.14:
>
> - It might be worth mentioning that IGMP operation under IPMP is
> really unclear. (Should membership messages be echoed out all
> interfaces so that switches know that all interfaces are
> partitipating? Or just over the "lead" interface? If it's the
> latter, shouldn't the IGMP messages be repeated if a new "lead" is
> chosen?)

I agree it should be discussed. Let me look into the issues involved; I
will update the document once I have answers.

> p35, section 5.15:
>
> - "Reverse-engineered" might be a little strong. The same interface
> exists on other Mentat-derived systems, and (like fast-path) is
> probably documented there.

I was not aware of that; changed to "discovered".

> p36, section 5.16:
>
> - If the vni driver goes away, does the vni(7D) man page go too?

Yes; made explicit.

> p36:
>
> - Any changes for PSARC 2002/137 (IPMP Asynchronous Event
> Definitions) due to this project?

Probably. That's not a documented interface, so it was not discussed.
However, this is of little concern, as it seems that Sun Cluster never
made use of the definitions.

Thanks again,
--
meem
_______________________________________________
networking-discuss mailing list
networking-discuss at opensolaris dot org



paulj

Posts: 215
From: Scotland

Registered: 9/15/05
Re: Clearview IPMP Rearchitecture: high-level design: extended to 9/29
Posted: Sep 28, 2005 2:54 AM   in response to: meem

  Click to reply to this thread Reply

Hi Peter,

Footnote 9 is confusing, does it refer to new IPMP or old? The footnote
is tacked onto new, but seems to refer to 'old' style IPMP?

On Wed, 28 Sep 2005, Peter Memishian wrote:
> Jim wrote:

> > - What is the MTU on an IPMP interface? Is it the minimum of all
> > member links?

In a similar vein what about baud_rate and metric? (I don't think
baud_rate is set to anything useful at the moment, but I wouldn't mind
seeing it made useful in future. See further below for metric).

> > In fact, I'm not sure why IFF_RUNNING on the member links would be
> > cleared out by IFF_FAILED. It's not as though ordinary
> > applications would ever see those interfaces, so they cannot be
> > confused by the meaning of the extra bits; they need to set
> > special Solaris-specific flags to see them at all. So why the
> > interlock?
>
> To make it clear to someone using ifconfig or other administrative tools.

But the interface may well be functioning fine, it may be the probe
target(s) alone which have failed, while other on-link hosts are still
reachable. An application that specifically asked to see the (otherwise
hidden) underlying interfaces might well be interested in the
difference.

Further, clearing IFF_RUNNING due to IFF_FAILOVER is going to cause
problems for routing socket listeners where:

1. There are both IPMP-member logical IP interfaces and non-IPMP-member
IP logical interfaces bound to the same physical interfaces

2. The 0th logical interface is an IPMP member.

3. The application uses the if_msghdr if_flags field to retrieve
physical interface related flags (rather than GLIFFLAGS)

See further below.

> > p23, 'route' changes:
> >
> > - If this functionality is implemented in the 'route' command
> > itself, rather than in the kernel, what does that mean for
> > existing utilities? It seems like the "add static route" feature
> > in Zebra and the like will be harmed by this.
>
> I'd prefer to isolate this to route. Why would zebra be adding routes to
> the underlying interfaces?

> > (For what it's worth, I think those utilities are probably blown
> > out of the water by removing "ce0" from the SIOCGIFCONF data, and
> > will need manual intervention to convert their configurations
> > over. I hope that there's not much mixed IPMP/Zebra usage.)
>
> Why would they want to know about ce0? Please elaborate.

They wouldn't want to, but they may already have definitions in their
configuration for these interfaces from pre-new-IPMP which would then
need to be migrated over to the new ipmpX interface. However, I don't
think it'd be a problem (for 'zebra' at least, given the
incompatibilities with the older model).

> Could you explain more about how routing currently makes use of
> LIFMETRIC? Since routing daemons will run over the IPMP interface, I'd
> (clearly incorrectly) assumed that it would use this ioctl.

It could also try acquire the metric from the IFINFO message. It's an
administrative metric used to seed any originated routes deriving from
that route and also influence route calculation.

I can't think of any way how you'd reconcile differing metrics of member
links though to arrive at an 'aggregate' metric. Hence I'd suggest that
the only compatible way would be to only use the member interfaces with
the same best metric which are active, and treat any other lower-metric
interfaces as STANDBY.

> > - If the vni driver goes away, does the vni(7D) man page go too?
>
> Yes; made explicit.

Hmm, the VNI driver is useful, eg for hosting addresses on - if you
wanted more than 8192 addresses. ;)

I have a question about routing socket behaviour:

- Is IPMP group membership a per-logical interface thing? Ie is it
possible to have a set of logical interfaces where some addresses are
members of an IPMP group (NOFAILOVER, and hence meant to be hidden)
and where some are not (and hence not meant to be hidden)?

If this is the case could I suggest the following:

- Do *not* suppress RTM_INFO events (listeners very likely key internal
interface creation/deletion events on this event, and/or use it to
'grab' PHYINT flags - IFINFO is only ever sent for 0th interface)

- Suppress/fake *just* the address related events, RTM_{NEW,DEL}ADDR
pertaining to the IPMP address, as appropriate.

- Clear IPMP related flags (IFF_FAILOVER particularly) from IFINFO for
all interfaces bar IFF_IPMP interfaces. However, this is going to have
issues with the proposed IFF_FAILOVER/IFF_RUNNING mirroring scheme.

Otherwise applications will potentially receive RTM_NEWADDR's for
addresses (the normal, not FAILOVER address) on interfaces which they
never received an IFINFO for.

The worst case with this modification is that you send IFINFO for
interfaces with only IFF_FAILOVER addresses and hence you never actually
send any RTM_NEWADDRs for that ifindex, but an application /must/ be
able to handle this already anyway.

Note that the answer might well to be fix routing socket, rather than
change the IPMP proposal.

regards,

--paulj

_______________________________________________
networking-discuss mailing list
networking-discuss at opensolaris dot org



paulj

Posts: 215
From: Scotland

Registered: 9/15/05
Re: Clearview IPMP Rearchitecture: high-level design: extended to 9/29
Posted: Sep 28, 2005 4:40 AM   in response to: paulj

  Click to reply to this thread Reply

On Wed, 28 Sep 2005, Paul Jakma wrote:

> But the interface may well be functioning fine, it may be the probe target(s)
> alone which have failed, while other on-link hosts are still reachable. An
> application that specifically asked to see the (otherwise hidden) underlying
> interfaces might well be interested in the difference.
>
> Further, clearing IFF_RUNNING due to IFF_FAILOVER is going to cause problems
> for routing socket listeners where:
>
> 1. There are both IPMP-member logical IP interfaces and non-IPMP-member
> IP logical interfaces bound to the same physical interfaces
>
> 2. The 0th logical interface is an IPMP member.
>
> 3. The application uses the if_msghdr if_flags field to retrieve
> physical interface related flags (rather than GLIFFLAGS)

Ah, and even if the app does go do a GLIFFLAGS, it only has name of the
0th interface anyway (at least via IFINFO), so it won't get the
logical interface flags.

> - Clear IPMP related flags (IFF_FAILOVER particularly) from IFINFO for
> all interfaces bar IFF_IPMP interfaces. However, this is going to have
> issues with the proposed IFF_FAILOVER/IFF_RUNNING mirroring scheme.

One answer to both this is and the first paragraph (apps that want to
know link-state of member interfaces for some strange reason) would be
to introduce support for the BSD if_link_state - it would /explicitely/
reflect link-state and link-state only.

That would also solve the routing-socket problems for applications which
are updated to look at link-state instead, rather than the RUNNING.

--paulj

_______________________________________________
networking-discuss mailing list
networking-discuss at opensolaris dot org



meem

Posts: 3,045
From: US

Registered: 3/9/05
Re: Clearview IPMP Rearchitecture: high-level design: extended to 9/29
Posted: Sep 28, 2005 8:55 AM   in response to: paulj

  Click to reply to this thread Reply


> Footnote 9 is confusing, does it refer to new IPMP or old? The footnote
> is tacked onto new, but seems to refer to 'old' style IPMP?

It refers to the new model. I will update it to be clearer.

> > > - What is the MTU on an IPMP interface? Is it the minimum of all
> > > member links?
>
> In a similar vein what about baud_rate and metric? (I don't think
> baud_rate is set to anything useful at the moment, but I wouldn't mind
> seeing it made useful in future. See further below for metric).

IPMP does not work in over IFF_POINTOPOINT -- so I do not see the
relevance of baud_rate.

> > > In fact, I'm not sure why IFF_RUNNING on the member links would be
> > > cleared out by IFF_FAILED. It's not as though ordinary
> > > applications would ever see those interfaces, so they cannot be
> > > confused by the meaning of the extra bits; they need to set
> > > special Solaris-specific flags to see them at all. So why the
> > > interlock?
> >
> > To make it clear to someone using ifconfig or other administrative tools.
>
> But the interface may well be functioning fine, it may be the probe
> target(s) alone which have failed, while other on-link hosts are still
> reachable. An application that specifically asked to see the (otherwise
> hidden) underlying interfaces might well be interested in the
> difference.

Right. In that case, the underlying interface will not be marked
IFF_FAILED. However, if the underlying interface is IFF_FAILED, then it
should also have IFF_RUNNING cleared to make it clear to applications that
it's not usable.

> Further, clearing IFF_RUNNING due to IFF_FAILOVER is going to cause
> problems for routing socket listeners where:

You mean IFF_FAILED, not IFF_FAILOVER, right?

> 1. There are both IPMP-member logical IP interfaces and non-IPMP-member
> IP logical interfaces bound to the same physical interfaces
>
> 2. The 0th logical interface is an IPMP member.
>
> 3. The application uses the if_msghdr if_flags field to retrieve
> physical interface related flags (rather than GLIFFLAGS)

None of this is possible because IPMP membership is an interface property,
not a logical interface property.

> > Why would they want to know about ce0? Please elaborate.
>
> They wouldn't want to, but they may already have definitions in their
> configuration for these interfaces from pre-new-IPMP which would then
> need to be migrated over to the new ipmpX interface. However, I don't
> think it'd be a problem (for 'zebra' at least, given the
> incompatibilities with the older model).

Since IPMP and routing do not currently work together, I don't think we
need to worry about migration. However, we will need to explain to folks
how to take their existing non-IPMP configurations and make them work in
an IPMP environment.

> > Could you explain more about how routing currently makes use of
> > LIFMETRIC? Since routing daemons will run over the IPMP interface, I'd
> > (clearly incorrectly) assumed that it would use this ioctl.
>
> It could also try acquire the metric from the IFINFO message. It's an
> administrative metric used to seed any originated routes deriving from
> that route and also influence route calculation.
>
> I can't think of any way how you'd reconcile differing metrics of member
> links though to arrive at an 'aggregate' metric. Hence I'd suggest that
> the only compatible way would be to only use the member interfaces with
> the same best metric which are active, and treat any other lower-metric
> interfaces as STANDBY.

The applications will not be aware of member interfaces, so I don't see
how that would work.

> > > - If the vni driver goes away, does the vni(7D) man page go too?
> >
> > Yes; made explicit.
>
> Hmm, the VNI driver is useful, eg for hosting addresses on - if you
> wanted more than 8192 addresses. ;)

The VNI IP interface will still exist. We are only talking about the
implementation.

> I have a question about routing socket behaviour:
>
> - Is IPMP group membership a per-logical interface thing?

No.

> Ie is it
> possible to have a set of logical interfaces where some addresses are
> members of an IPMP group (NOFAILOVER, and hence meant to be hidden)
> and where some are not (and hence not meant to be hidden)?

No.

> Note that the answer might well to be fix routing socket, rather than
> change the IPMP proposal.

If you want to fix routing sockets, by all means go ahead :-) That work is
too large and too tangential to the IPMP rearchitecture to be done as part
of this work. Nothing in this proposal precludes a sane rearchitecture of
routing sockets.

--
meem
_______________________________________________
networking-discuss mailing list
networking-discuss at opensolaris dot org



paulj

Posts: 215
From: Scotland

Registered: 9/15/05
Re: Clearview IPMP Rearchitecture: high-level design: extended to 9/29
Posted: Sep 28, 2005 9:59 AM   in response to: meem

  Click to reply to this thread Reply

On Wed, 28 Sep 2005, Peter Memishian wrote:

> IPMP does not work in over IFF_POINTOPOINT -- so I do not see the
> relevance of baud_rate.

baud_rate isn't specific to PtP though. However, we don't set it at all
- yet. (But if I happen to figure out where/when MII information is
available and get a chance to store it in the phyint, I'd love to do so
;) ).

> Right. In that case, the underlying interface will not be marked
> IFF_FAILED. However, if the underlying interface is IFF_FAILED, then it
> should also have IFF_RUNNING cleared to make it clear to applications that
> it's not usable.

Ok.

> > Further, clearing IFF_RUNNING due to IFF_FAILOVER is going to cause
> > problems for routing socket listeners where:
>
> You mean IFF_FAILED, not IFF_FAILOVER, right?

Yes :).

> None of this is possible because IPMP membership is an interface property,
> not a logical interface property.

Ah ok. Grand so - no problems with route-sock listeners if the member
interfaces will be completely invisible.

> Since IPMP and routing do not currently work together, I don't think
> we need to worry about migration. However, we will need to explain to
> folks how to take their existing non-IPMP configurations and make them
> work in an IPMP environment.

Yep.

> The applications will not be aware of member interfaces, so I don't
> see how that would work.

The application won't be, but if there are differing metrics between the
member interfaces, maybe IPMP should try honour the metrics?

> > - Is IPMP group membership a per-logical interface thing?
>
> No.

Ah, ok.

So logical subnets will not be possible on such physical interfaces? If
you wanted that, you'd create additional IPMP groups, right?

> If you want to fix routing sockets, by all means go ahead :-) That
> work is too large and too tangential to the IPMP rearchitecture to be
> done as part of this work. Nothing in this proposal precludes a sane
> rearchitecture of routing sockets.

:)

I plan to try some experiments later in the year, might be difficult to
do and remain fully backward compatible though :(. We'll see.

--paulj

_______________________________________________
networking-discuss mailing list
networking-discuss at opensolaris dot org



meem

Posts: 3,045
From: US

Registered: 3/9/05
Re: Clearview IPMP Rearchitecture: high-level design: extended to 9/29
Posted: Sep 28, 2005 10:35 AM   in response to: paulj

  Click to reply to this thread Reply


> > IPMP does not work in over IFF_POINTOPOINT -- so I do not see the
> > relevance of baud_rate.
>
> baud_rate isn't specific to PtP though. However, we don't set it at all
> - yet. (But if I happen to figure out where/when MII information is
> available and get a chance to store it in the phyint, I'd love to do so
> ;) ).

Could you say more about what broadcast networks operate over a modem? In
any case, once you have specifics, I'm sure we can work it in.

> > The applications will not be aware of member interfaces, so I don't
> > see how that would work.
>
> The application won't be, but if there are differing metrics between the
> member interfaces, maybe IPMP should try honour the metrics?

I don't see how that could be used properly by an application, as an
application has no idea interface (in the group) a given packetwill be
sent over. So, I think the IPMP group interface needs to be the maximum
metric associated with any interface in the group. It could also be
argued that configuring different metrics on different underlying
interfaces is an administrative error.

> So logical subnets will not be possible on such physical interfaces? If
> you wanted that, you'd create additional IPMP groups, right?

The addresses on each physical interface need not be on the same subnet
(e.g., you could have multiple subnets on the same link).

--
meem
_______________________________________________
networking-discuss mailing list
networking-discuss at opensolaris dot org



carlsonj

Posts: 6,810
From: US

Registered: 3/9/05
Re: Clearview IPMP Rearchitecture: high-level design: extended to 9/29
Posted: Sep 28, 2005 10:54 AM   in response to: meem

  Click to reply to this thread Reply

Peter Memishian writes:
> > > IPMP does not work in over IFF_POINTOPOINT -- so I do not see the
> > > relevance of baud_rate.
> >
> > baud_rate isn't specific to PtP though. However, we don't set it at all
> > - yet. (But if I happen to figure out where/when MII information is
> > available and get a chance to store it in the phyint, I'd love to do so
> > ;) ).
>
> Could you say more about what broadcast networks operate over a modem? In
> any case, once you have specifics, I'm sure we can work it in.

"baud_rate" is a lousy name, as it clearly makes people think
"modems." "Bit rate" is better.

Interfaces do have various important metrics: nominal speed and delay
are two of the more important ones, but there are certainly others,
such as indications of shared facilities for those concerned about
path diversity.

In an aggregate interface, such as ipmp0, you often have to represent
both the aggregate speed (sum of all the member links) as well as the
largest reservable chunk (speed of fastest link).

But that's probably way overkill given the lack of constraint-based
routing. Merely showing the speed of the fastest link (as the
'ifspeed' kstat from "MIB-II KSTATS" [PSARC 1997/198]) might be
sufficient.

--
James Carlson, KISS Network <james dot d dot carlson at sun dot com>
Sun Microsystems / 1 Network Drive 71.232W Vox +1 781 442 2084
MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677
_______________________________________________
networking-discuss mailing list
networking-discuss at opensolaris dot org



meem

Posts: 3,045
From: US

Registered: 3/9/05
Re: Clearview IPMP Rearchitecture: high-level design: extended to 9/29
Posted: Sep 28, 2005 11:07 AM   in response to: carlsonj

  Click to reply to this thread Reply


> "baud_rate" is a lousy name, as it clearly makes people think
> "modems." "Bit rate" is better.

Ah, I see what was meant now.

> In an aggregate interface, such as ipmp0, you often have to represent
> both the aggregate speed (sum of all the member links) as well as the
> largest reservable chunk (speed of fastest link).
>
> But that's probably way overkill given the lack of constraint-based
> routing. Merely showing the speed of the fastest link (as the
> 'ifspeed' kstat from "MIB-II KSTATS" [PSARC 1997/198]) might be
> sufficient.

Why not slowest (i.e., the most we can guarantee)?

--
meem
_______________________________________________
networking-discuss mailing list
networking-discuss at opensolaris dot org



carlsonj

Posts: 6,810
From: US

Registered: 3/9/05
Re: Clearview IPMP Rearchitecture: high-level design: extended to 9/29
Posted: Sep 28, 2005 11:10 AM   in response to: meem

  Click to reply to this thread Reply

Peter Memishian writes:
> > But that's probably way overkill given the lack of constraint-based
> > routing. Merely showing the speed of the fastest link (as the
> > 'ifspeed' kstat from "MIB-II KSTATS" [PSARC 1997/198]) might be
> > sufficient.
>
> Why not slowest (i.e., the most we can guarantee)?

Hence the problem ...

Any of those answers would actually be fine by me.

--
James Carlson, KISS Network <james dot d dot carlson at sun dot com>
Sun Microsystems / 1 Network Drive 71.232W Vox +1 781 442 2084
MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677
_______________________________________________
networking-discuss mailing list
networking-discuss at opensolaris dot org



meem

Posts: 3,045
From: US

Registered: 3/9/05
Re: Clearview IPMP Rearchitecture: high-level design: extended to 9/29
Posted: Sep 28, 2005 11:15 AM   in response to: carlsonj

  Click to reply to this thread Reply


> Hence the problem ...
>
> Any of those answers would actually be fine by me.

Under what situation would it be ideal to report the the fastest link
speed?

--
meem
_______________________________________________
networking-discuss mailing list
networking-discuss at opensolaris dot org



carlsonj

Posts: 6,810
From: US

Registered: 3/9/05
Re: Clearview IPMP Rearchitecture: high-level design: extended to 9/29
Posted: Sep 28, 2005 12:22 PM   in response to: meem

  Click to reply to this thread Reply

Peter Memishian writes:
>
> > Hence the problem ...
> >
> > Any of those answers would actually be fine by me.
>
> Under what situation would it be ideal to report the the fastest link
> speed?

Not sure about "ideal," but it would help in distinguishing a group
that has all low-speed interfaces from one that has mostly high-speed
interfaces.

--
James Carlson, KISS Network <james dot d dot carlson at sun dot com>
Sun Microsystems / 1 Network Drive 71.232W Vox +1 781 442 2084
MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677
_______________________________________________
networking-discuss mailing list
networking-discuss at opensolaris dot org



paulj

Posts: 215
From: Scotland

Registered: 9/15/05
Re: Clearview IPMP Rearchitecture: high-level design: extended to 9/29
Posted: Sep 28, 2005 1:20 PM   in response to: meem

  Click to reply to this thread Reply

On Wed, 28 Sep 2005, Peter Memishian wrote:

> Could you say more about what broadcast networks operate over a modem?

Ethernets have a baud rate too, I'd be hard-pressed to think of a
network type which didn't.

It has become corrupted though to generally mean "bit rate", despite the
fact that often baud-rate != bit-rate, eg Gige has a bit-rate of 1Gb/s
(b = bit), but a baud rate of ~125Mb/s (b = baud) per pair, 500Mbaud/s
in total, somesuch.

But generally it's taken to be bit/s.

> In any case, once you have specifics, I'm sure we can work it in.

See above. It's the "band-width" of the interface :). And it would be
useful to report it if possible - so that OSPF would not need to have
interface 'bandwidth' administratively defined. (Another corrupted use
of a term, I know ;) ).

> I don't see how that could be used properly by an application, as an
> application has no idea interface (in the group) a given packetwill be
> sent over.

The application wouldn't have a use no. But IPMP shouldn't allow
different metric interfaces to be joined together, at least - the
non-best metric interfaces should be STANDBY or somesuch.

> So, I think the IPMP group interface needs to be the maximum metric
> associated with any interface in the group.

Hmm, no. That would clash if there were some other interface with a
metric 'in between'. Remember, routing protocols /will/ make use of this
metric if it is present to decide which interfaces to install routes out
of, and possibly with what protocol cost to advertise certain
addresses/routes to others. If a metric is set, it is set by the
administrator and presumably for good reason.

eg a system with:

ipmp0: with members bge0 (metric 100) and bge1 (metric 1000)
bge2: metric X

If X is 1000, a routing application might consider ipmp0 and bge2 to be
wholly equal - which clearly they are not. Lacking ECMP it might decide
(arbitrarily) to use bge2, when clearly ipmp0 is the better interface
(for underlying interface bge1 has same metric, and bge0 a much better
metric).

Similar problems if you report the best-member-metric instead.

> It could also be argued that configuring different metrics on
> different underlying interfaces is an administrative error.

That seems a simple and perfectly fine answer.

> The addresses on each physical interface need not be on the same subnet
> (e.g., you could have multiple subnets on the same link).

Neat :).

--paulj

_______________________________________________
networking-discuss mailing list
networking-discuss at opensolaris dot org



carlsonj

Posts: 6,810
From: US

Registered: 3/9/05
Re: Clearview IPMP Rearchitecture: high-level design: extended to 9/29
Posted: Sep 28, 2005 1:27 PM   in response to: paulj

  Click to reply to this thread Reply

Paul Jakma writes:
> > In any case, once you have specifics, I'm sure we can work it in.
>
> See above. It's the "band-width" of the interface :). And it would be
> useful to report it if possible - so that OSPF would not need to have
> interface 'bandwidth' administratively defined. (Another corrupted use
> of a term, I know ;) ).

I think the discussion is getting pretty far off the topic of IPMP
redesign ... but we already have an interface speed reporting
mechanism as part of the existing MIB-II family of kstats. It's
called "ifspeed." We won't need another.

> > So, I think the IPMP group interface needs to be the maximum metric
> > associated with any interface in the group.
>
> Hmm, no. That would clash if there were some other interface with a
> metric 'in between'. Remember, routing protocols /will/ make use of this
> metric if it is present to decide which interfaces to install routes out
> of, and possibly with what protocol cost to advertise certain
> addresses/routes to others. If a metric is set, it is set by the
> administrator and presumably for good reason.

I agree. I think the ipmp interface ought to have its own metric,
because this is an administratively-assigned value, not something that
is inherent in the underlying interface. (Thus, it's not really the
same as MTU or speed.)

And, in fact, simply disallowing the user from ever setting or
querying the metric on the underlying links ought to underscore the
issue.

(If you want to copy the metric over from the first physical link that
establishes the group in the case where someone doesn't create ipmp0
explicitly, that might make sense, but I don't think it's really
necessary.)

--
James Carlson, KISS Network <james dot d dot carlson at sun dot com>
Sun Microsystems / 1 Network Drive 71.232W Vox +1 781 442 2084
MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677
_______________________________________________
networking-discuss mailing list
networking-discuss at opensolaris dot org



paulj

Posts: 215
From: Scotland

Registered: 9/15/05
Re: Clearview IPMP Rearchitecture: high-level design: extended to 9/29
Posted: Sep 28, 2005 1:41 PM   in response to: carlsonj

  Click to reply to this thread Reply

On Wed, 28 Sep 2005, James Carlson wrote:

> I think the discussion is getting pretty far off the topic of IPMP
> redesign

Sorry :).

> ... but we already have an interface speed reporting mechanism as part
> of the existing MIB-II family of kstats. It's called "ifspeed." We
> won't need another.

Well, ifi_baud_rate exists already, just needs to be updated with this
'ifspeed' - but another matter indeed.

What /is/ of concern to IPMP is what to report for this
ifspeed/ifi_baud_rate if members have differing values. I'd agree it's
fairly arbitrary, lowest speed would probably be the most conservative
though (would be best choice for OSPF at least).

> I agree. I think the ipmp interface ought to have its own metric,
> because this is an administratively-assigned value, not something that
> is inherent in the underlying interface.

> And, in fact, simply disallowing the user from ever setting or
> querying the metric on the underlying links ought to underscore the
> issue.

0 unless set explicitely on the ipmp interface? Even better, yes.

--paulj


_______________________________________________
networking-discuss mailing list
networking-discuss at opensolaris dot org



meem

Posts: 3,045
From: US

Registered: 3/9/05
Re: Clearview IPMP Rearchitecture: high-level design: extended to 9/29
Posted: Sep 28, 2005 2:37 PM   in response to: paulj

  Click to reply to this thread Reply


> > I agree. I think the ipmp interface ought to have its own metric,
> > because this is an administratively-assigned value, not something that
> > is inherent in the underlying interface.
>
> > And, in fact, simply disallowing the user from ever setting or
> > querying the metric on the underlying links ought to underscore the
> > issue.
>
> 0 unless set explicitely on the ipmp interface? Even better, yes.

Done; see section 5.7 of version 1.3.1 of the document, which I just
posted. The link is the same:

http://opensolaris.org/os/community/networking/ipmp-highlevel-design.pdf

Thanks guys.
--
meem
_______________________________________________
networking-discuss mailing list
networking-discuss at opensolaris dot org



Paul Jakma
paul@clubi.ie
Re: Clearview IPMP Rearchitecture: high-level design: extended to 9/29
Posted: Sep 28, 2005 2:54 PM   in response to: meem

  Click to reply to this thread Reply

On Wed, 28 Sep 2005, Peter Memishian wrote:

> Done; see section 5.7 of version 1.3.1 of the document, which I
> just posted. The link is the same:
>
> http://opensolaris.org/os/community/networking/ipmp-highlevel-design.pdf

Ah, could the networking community page link to this? :) Also to
the tunnel doc?

> Thanks guys.

No worries! Night!

regards,
--
Paul Jakma paul at clubi dot ie paul at jakma dot org Key ID: 64A2FF6A
Fortune:
mummy, n.:
An Egyptian who was pressed for time.
_______________________________________________
networking-discuss mailing list
networking-discuss at opensolaris dot org



meem

Posts: 3,045
From: US

Registered: 3/9/05
Re: Clearview IPMP Rearchitecture: high-level design: extended to 9/29
Posted: Sep 29, 2005 2:20 PM   in response to: Paul Jakma

  Click to reply to this thread Reply


> > http://opensolaris.org/os/community/networking/ipmp-highlevel-design.pdf
>
> Ah, could the networking community page link to this? :) Also to
> the tunnel doc?

Done.

--
meem
_______________________________________________
networking-discuss mailing list
networking-discuss at opensolaris dot org



carlsonj

Posts: 6,810
From: US

Registered: 3/9/05
Re: Clearview IPMP Rearchitecture: high-level design: extended to 9/29
Posted: Sep 28, 2005 10:32 AM   in response to: meem

  Click to reply to this thread Reply

Peter Memishian writes:
> > Instead, the real term should be just "TCP/IP based."
>
> So UDP and SCTP don't benefit?

UDP, SCTP, ICMP, and many other protocols are typically considered
part of the "TCP/IP protocol suite."

> :-P I can say "AF_INET[6]-based" if you
> prefer. But claiming that IPMP will work with all IP-based applications,
> including those using TLI/XTI, seems a bit too bold to me (and I'm not
> convinced it's time well spent to guarantee it).

It also doesn't work with _all_ sockets-based applications, so I'm not
sure what the point is.

What I'm trying to say is that where this solution works at all, it
works with IPv4 and IPv6, and not anything else, and that there are
non-sockets applications (such as NFS/RPC) that work fine with it, so
"sockets" isn't the right peg for that hat.

[ping problems elided]
> Sadly, these three problems are inherent in the probe-based failure
> detection mechanism. I'm not sure what we can do about them from a
> technical standpoint.

True. As long as it's clear that the list has been trimmed to include
only those things that are fixable by this project (and leaves some
things on the table), I suppose that's ok.

The list just read strangely to me because I was _looking_ to see
those things. (Yes, I realize that one of the issues is the length of
the document. No, I don't think that means that all the issues need
to be included.)

> > . The detailed behavior of general multicast (not the
> > well-known link-local multicast addresses) is less clear.
> > In particular, the behavior necessary to accomodate
> > IGMP-snooping switches is probably missing.
>
> I'm not convinced this is a problem that shapes the *high-level* design.

One of the criticisms leveled against routing is that there's no good
solution for multicast. I'm pointing out that multicast isn't
completely solved here, either.

> > - I'm not sure the described behavior here really represents
> > FAILBACK=no. (But, then, I'm not sure what behavior would really
[...]
> Yes, it's a bit of an odd bird. Personally, I'd love to get rid of this
> feature, but I know there are customers who hate unnecessary rebinding of
> addresses to interfaces (because of the affect it has on others hosts) and
> thus want to have that happen as little as possible.
>
> I've added a little more rationale behind the feature, but I don't want to
> devote too much space to this wart (I'd like to kill it, but I can't).

My point here is that I don't think the new behavior really represents
very well what the old code did. In particular, this case:

(AA) -> (FA) -> (FF) -> (IF) -> (II)

seems to be quite problematic. The old code would have failed over
the addresses to the second interface at that first event, but then
performed no other changes. This would have left the equivalent of
(IA) as the end result, but that's not what this proposed
implementation seems to do. Instead, it leaves the whole group failed
out as "inactive."

Did I read that correctly?

> > p9, section 3.9:
> >
> > - It would help a bit, I think, to segregate out the flags that are
> > on logical interfaces (address flags) from those that are on the
> > underlying physical interface. (The break is after the 5th entry
> > in the first table, and after the second in the second table.)
>
> Segregate how -- with an extra line in the table?

A bold line between the two groups would do it.

> And for what purpose?

The implications of the two sets of flags are very different, and it's
clear that many people looking at the flags (and likely quite a few
reading this document) are just unclear on the difference -- or that
there even is any.

> The mirror-image must be maintained to ensure that naive applications
> behave correctly: if there was a situation where IFF_FAILED was set and
> IFF_RUNNING was set, then an application would try to use an unusable
> interface. (The other case, where IFF_FAILED was clear and IFF_RUNNING
> was clear makes no semantic sense: how can the interface not be
> IFF_RUNNING, but not be IFF_FAILED?)

That latter case does in fact happen -- when in.mpathd isn't running.

I can mostly understand having the kernel clear IFF_RUNNING when
IFF_FAILED is set by the application. But I suspect that you need to
maintain the "real" state underneath so that IP can turn IFF_RUNNING
back on when IFF_FAILED is cleared *AND* the hardware state is
copacetic, and not turn it back on if the hardware state isn't right.

> > In fact, I'm not sure why IFF_RUNNING on the member links would be
> > cleared out by IFF_FAILED. It's not as though ordinary
> > applications would ever see those interfaces, so they cannot be
> > confused by the meaning of the extra bits; they need to set
> > special Solaris-specific flags to see them at all. So why the
> > interlock?
>
> To make it clear to someone using ifconfig or other administrative tools.

Seeing "FAILED" in the ifconfig output looks pretty clear to me. And
you're proposing changes that make it *certain* that any
administrative tools that can see these underlying interfaces at all
must already be updated to support IPMP.

In addition to that, clearing out RUNNING down at the member link
level means that IPMP-aware applications are *compelled* to use DLPI
to figure out what's going on at the physical layer. That seems
unfortunate, as we previously used IFF_RUNNING for exactly that
purpose.

Moreover, the "FAILED" flag up at the ipmp bundle level doesn't seem
to me to add a lot more value over clearing out RUNNING (which is what
non-IPMP-aware applications will look at).

So, I don't think the FAILED->~RUNNING functionality is actually
needed. (And we've had a bit of a history in getting bits tangled
together, so if it can be avoided, it'd be nice.)

> > p18, section 4.2.2:
> >
> > - Should there be a "-n" option to suppress address-to-name
> > translation? (And should "names" be the default the way they are
> > most everywhere else?)
>
> What name translation?

I'm just asking about parallelism with other *stat commands, such as
netstat. Those tend to print out _names_ rather than raw addresses by
default, and use a "-n" flag to suppress it.

But if you want this one to be different, and always print numeric
addresses, that's fine by me.

> > # ifconfig foobar0 plumb group a 10.0.0.1 up
> >
> > and since "foobar0" will never exist, this will add the address to
> > the named group.
>
> I don't really see how this is new -- I can craft up arbitrary
> hostname.<if> files today and achieve similar results. Note that if group
> "a" doesn't exist at all by the time the system gets to handling missing
> interfaces, the above will be ignored.

The difference is that it fails from ifconfig today:

# ifconfig foobar0 plumb group a 10.0.0.1 up
ifconfig: plumb: foobar0: No such file or directory
#

> > p23, 'route' changes:
> >
> > - If this functionality is implemented in the 'route' command
> > itself, rather than in the kernel, what does that mean for
> > existing utilities? It seems like the "add static route" feature
> > in Zebra and the like will be harmed by this.
>
> I'd prefer to isolate this to route. Why would zebra be adding routes to
> the underlying interfaces?

Because (a) Zebra and other daemons allow you to add static routes and
(b) user's existing configuration files will already mention those
interfaces.

> > (For what it's worth, I think those utilities are probably blown
> > out of the water by removing "ce0" from the SIOCGIFCONF data, and
> > will need manual intervention to convert their configurations
> > over. I hope that there's not much mixed IPMP/Zebra usage.)
>
> Why would they want to know about ce0? Please elaborate.

I don't think they "want" to know about it. Today, they need to
specify "ce0" (or its ifindex) if they want to tie the route to the
interface group. There's no "group" representation, so the interface
name is what they're using.

On upgrade, those configurations will now become unusable because the
interface names have disappeared.

I agree that it's a bit of a corner case -- someone has to be using
the belt-and-suspenders approach of having both IPMP and some routing
daemon on the system. We can just hope this doesn't happen (and
document around it if it does).

> > p24, section 4.6.2:
> >
> > - NumAddrs: ew. This should really be based on the number of member
> > links in the IPMP group. You'll want to have at least one data
> > address per member link in order to get the inbound load-spreading
> > right. It'd be better still if in.ndpd just did the right thing.
>
> I'm fine with having in.ndpd try to initially configure as many global
> addresses as there are interfaces, but I'm not sure what to do if an
> interface is removed -- is it really okay to blow away a global address at
> that point? It *feels* wrong to do that.

Agreed. I think you end up having to support using the maximum number
that were configured at any one time.

In practice, that shouldn't be too bad. It's hard to add an unbounded
amount of hardware to one group.

> > - I had trouble reading this. I assume it means just that
> > RTM_NEWADDR will occur during address transfer, and not that
> > IFF_NOFAILOVER is the only possible way this message could be
> > sent.
>
> How else could it happen? Any UP data addresses have already migrated to
> the IPMP interface.

ifconfig ipmp0:2 10.0.0.1 up?

> > - What do SIOC[GS]LIFMETRIC mean on member link interfaces? This
> > doesn't make sense, as routing (the consumer of interface metrics)
> > won't use them. Does something in IPMP itself use them?
>
> Could you explain more about how routing currently makes use of LIFMETRIC?

Sure. It's taken to be the administrator's reported "expense" of
sending or receiving a packet through that interface, and is used to
adjust the preference of routes within the RIB. (Preferred routes are
selected out and injected into the FIB -- the kernel's forwarding
table.)

For RIP-2, we just add the SIOCLIFMETRIC to the hop count when
figuring the attractiveness of routes we learn over that interface and
when advertising to others.

It provides a way for the administrator to say, "you shouldn't
normally use this interface because it's slow, but if you have no
other choice, go ahead."

It's pretty primitive, but the usage is an ancient BSD-ism.

> Since routing daemons will run over the IPMP interface, I'd (clearly
> incorrectly) assumed that it would use this ioctl.

Right. They will -- on the ipmp interface, not on the underlying
member links.

> > p34, section 5.10:
> >
> > - Probably need more detailed behavior for SIOCDARP.
>
> Please elaborate -- what more would you like to see?

I should have written "SIOCDXARP." That can take an interface name,
which we currently use with ill_lookup_on_name. It's not clear to me
whether that ought to take a member link name or the ipmp interface
name or perhaps both in different contexts.

> > p36:
> >
> > - Any changes for PSARC 2002/137 (IPMP Asynchronous Event
> > Definitions) due to this project?
>
> Probably. That's not a documented interface, so it was not discussed.
> However, this is of little concern, as it seems that Sun Cluster never
> made use of the definitions.

Yoiks! Thanks for the update.

--
James Carlson, KISS Network <james dot d dot carlson at sun dot com>
Sun Microsystems / 1 Network Drive 71.232W Vox +1 781 442 2084
MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677
_______________________________________________
networking-discuss mailing list
networking-discuss at opensolaris dot org



meem

Posts: 3,045
From: US

Registered: 3/9/05
Re: Clearview IPMP Rearchitecture: high-level design: extended to 9/29
Posted: Sep 28, 2005 6:05 PM   in response to: carlsonj

  Click to reply to this thread Reply


> > > Instead, the real term should be just "TCP/IP based."
> >
> > So UDP and SCTP don't benefit?
>
> UDP, SCTP, ICMP, and many other protocols are typically considered
> part of the "TCP/IP protocol suite."

Hmm, I find that too easy to misunderstand. I've gone with "IP-based
networking applications". Is that acceptable?

> > Sadly, these three problems are inherent in the probe-based failure
> > detection mechanism. I'm not sure what we can do about them from a
> > technical standpoint.
>
> True. As long as it's clear that the list has been trimmed to include
> only those things that are fixable by this project (and leaves some
> things on the table), I suppose that's ok.
>
> The list just read strangely to me because I was _looking_ to see
> those things. (Yes, I realize that one of the issues is the length of
> the document. No, I don't think that means that all the issues need
> to be included.)

I see. The intent is to focus on techincal problems that can be fixed.

> > > . The detailed behavior of general multicast (not the
> > > well-known link-local multicast addresses) is less clear.
> > > In particular, the behavior necessary to accomodate
> > > IGMP-snooping switches is probably missing.
> >
> > I'm not convinced this is a problem that shapes the *high-level* design.
>
> One of the criticisms leveled against routing is that there's no good
> solution for multicast. I'm pointing out that multicast isn't
> completely solved here, either.

Agreed, but it's not flawed from a high-level design standpoint. As we
get into more of the detailed design, I think there will be room to cover
this.

>
> > > - I'm not sure the described behavior here really represents
> > > FAILBACK=no. (But, then, I'm not sure what behavior would really
> [...]
> > Yes, it's a bit of an odd bird. Personally, I'd love to get rid of this
> > feature, but I know there are customers who hate unnecessary rebinding of
> > addresses to interfaces (because of the affect it has on others hosts) and
> > thus want to have that happen as little as possible.
> >
> > I've added a little more rationale behind the feature, but I don't want to
> > devote too much space to this wart (I'd like to kill it, but I can't).
>
> My point here is that I don't think the new behavior really represents
> very well what the old code did. In particular, this case:

The old code doesn't work at all, so I'd hope we don't represent that ;-)

>
> (AA) -> (FA) -> (FF) -> (IF) -> (II)
>
> seems to be quite problematic. The old code would have failed over
> the addresses to the second interface at that first event, but then
> performed no other changes. This would have left the equivalent of
> (IA) as the end result, but that's not what this proposed
> implementation seems to do. Instead, it leaves the whole group failed
> out as "inactive."
>
> Did I read that correctly?

Yes, I see the problem now. It seems that upon repair, in.mpathd should
only set INACTIVE if there is another usable interface in the group.
Otherwise, it should leave the interface active. That then should cause:

(AA) -> (FA) -> (FF) -> (AF) -> (AI)

Let me know if this fully addresses your concern, or whether you have
deeper issues.

> > > p9, section 3.9:
> > >
> > > - It would help a bit, I think, to segregate out the flags that are
> > > on logical interfaces (address flags) from those that are on the
> > > underlying physical interface. (The break is after the 5th entry
> > > in the first table, and after the second in the second table.)
> >
> > Segregate how -- with an extra line in the table?
>
> A bold line between the two groups would do it.

Sadly, a bold line seems to be more challenging in LaTeX than one would
think. If I stumble on a good way to do it, I'll update.

> > And for what purpose?
>
> The implications of the two sets of flags are very different, and it's
> clear that many people looking at the flags (and likely quite a few
> reading this document) are just unclear on the difference -- or that
> there even is any.

Agreed.

> > The mirror-image must be maintained to ensure that naive applications
> > behave correctly: if there was a situation where IFF_FAILED was set and
> > IFF_RUNNING was set, then an application would try to use an unusable
> > interface. (The other case, where IFF_FAILED was clear and IFF_RUNNING
> > was clear makes no semantic sense: how can the interface not be
> > IFF_RUNNING, but not be IFF_FAILED?)
>
> That latter case does in fact happen -- when in.mpathd isn't running.

in.mpathd should be viewed as a critical system component -- what happens
with IPMP when it's not running is no more relevant than what happens to
DR when rcm_daemon isn't running.

> I can mostly understand having the kernel clear IFF_RUNNING when
> IFF_FAILED is set by the application. But I suspect that you need to
> maintain the "real" state underneath so that IP can turn IFF_RUNNING
> back on when IFF_FAILED is cleared *AND* the hardware state is
> copacetic, and not turn it back on if the hardware state isn't right.
>
> [ ... ]
>
> Seeing "FAILED" in the ifconfig output looks pretty clear to me. And
> you're proposing changes that make it *certain* that any
> administrative tools that can see these underlying interfaces at all
> must already be updated to support IPMP.
>
> In addition to that, clearing out RUNNING down at the member link
> level means that IPMP-aware applications are *compelled* to use DLPI
> to figure out what's going on at the physical layer. That seems
> unfortunate, as we previously used IFF_RUNNING for exactly that
> purpose.
>
> Moreover, the "FAILED" flag up at the ipmp bundle level doesn't seem
> to me to add a lot more value over clearing out RUNNING (which is what
> non-IPMP-aware applications will look at).
>
> So, I don't think the FAILED->~RUNNING functionality is actually
> needed. (And we've had a bit of a history in getting bits tangled
> together, so if it can be avoided, it'd be nice.)

We talked a bit offline about this -- it really comes down to how one
interprets the RUNNING flag. You clearly feel that it represents the link
and hardware state, and I can certainly understand why. My take is that
it represents IP's notion of whether the interface is usable -- and that
is based on both the link/hardware state, *and* the probe state.

I'm leaning towards agreeing with you, but I want some more time to think
about it and talk with some other folks.

> > > p18, section 4.2.2:
> > >
> > > - Should there be a "-n" option to suppress address-to-name
> > > translation? (And should "names" be the default the way they are
> > > most everywhere else?)
> >
> > What name translation?
>
> I'm just asking about parallelism with other *stat commands, such as
> netstat. Those tend to print out _names_ rather than raw addresses by
> default, and use a "-n" flag to suppress it.
>
> But if you want this one to be different, and always print numeric
> addresses, that's fine by me.

For "ipmpstat -a", using hostnames rather than addresses means that the
table key (the first column) is no longer guaranteed to be unique -- ick.
The only other context that addresses come up is with regard to probe
targets. I'd be willing to go either way on that one.

> > > # ifconfig foobar0 plumb group a 10.0.0.1 up
> > >
> > > and since "foobar0" will never exist, this will add the address to
> > > the named group.
> >
> > I don't really see how this is new -- I can craft up arbitrary
> > hostname.<if> files today and achieve similar results. Note that if group
> > "a" doesn't exist at all by the time the system gets to handling missing
> > interfaces, the above will be ignored.
>
> The difference is that it fails from ifconfig today:
>
> # ifconfig foobar0 plumb group a 10.0.0.1 up
> ifconfig: plumb: foobar0: No such file or directory
> #

I'm confused why this won't fail after the rearchitecture.

> > > p23, 'route' changes:
> > >
> > > - If this functionality is implemented in the 'route' command
> > > itself, rather than in the kernel, what does that mean for
> > > existing utilities? It seems like the "add static route" feature
> > > in Zebra and the like will be harmed by this.
> >
> > I'd prefer to isolate this to route. Why would zebra be adding routes to
> > the underlying interfaces?
>
> Because (a) Zebra and other daemons allow you to add static routes and
> (b) user's existing configuration files will already mention those
> interfaces.

But this will only affect folks migrating to an IPMP-based configuration.
I'd much rather provide clear documentation that static routes should not
be associated with underlying interfaces than support the remapping in the
kernel.

> > > (For what it's worth, I think those utilities are probably blown
> > > out of the water by removing "ce0" from the SIOCGIFCONF data, and
> > > will need manual intervention to convert their configurations
> > > over. I hope that there's not much mixed IPMP/Zebra usage.)
> >
> > Why would they want to know about ce0? Please elaborate.
>
> I don't think they "want" to know about it. Today, they need to
> specify "ce0" (or its ifindex) if they want to tie the route to the
> interface group. There's no "group" representation, so the interface
> name is what they're using.

If they were using IPMP with any routing daemon today, bless their souls.
(In the case of Quagga, we document that it is not supported.)

> On upgrade, those configurations will now become unusable because the
> interface names have disappeared.
>
> I agree that it's a bit of a corner case -- someone has to be using
> the belt-and-suspenders approach of having both IPMP and some routing
> daemon on the system. We can just hope this doesn't happen (and
> document around it if it does).

In this case, that seems reasonable to me.

> > > - I had trouble reading this. I assume it means just that
> > > RTM_NEWADDR will occur during address transfer, and not that
> > > IFF_NOFAILOVER is the only possible way this message could be
> > > sent.
> >
> > How else could it happen? Any UP data addresses have already migrated to
> > the IPMP interface.
>
> ifconfig ipmp0:2 10.0.0.1 up?

The RTM_NEWADDR discussion you refer to is in the context of removing an
underlying interface from an IPMP group. In this case, there cannot be
any UP data addresses. Of course, adding a data address directly to an
IPMP interface will trigger an RTM_NEWADDR, but I fail to see what that
has to do with the behavior of underlying interfaces.

> > > - What do SIOC[GS]LIFMETRIC mean on member link interfaces? This
> > > doesn't make sense, as routing (the consumer of interface metrics)
> > > won't use them. Does something in IPMP itself use them?
> >
> > Could you explain more about how routing currently makes use of LIFMETRIC?

Sorry, I misread your initial response (I thought you were talking about
the IPMP interface metric). As per our other discussion, I've added
section 5.7, covering our agreed-upon semantics for interface metrics.

> > > p34, section 5.10:
> > >
> > > - Probably need more detailed behavior for SIOCDARP.
> >
> > Please elaborate -- what more would you like to see?
>
> I should have written "SIOCDXARP." That can take an interface name,
> which we currently use with ill_lookup_on_name. It's not clear to me
> whether that ought to take a member link name or the ipmp interface
> name or perhaps both in different contexts.

I see; added (and I believe it should always take an IPMP interface name
-- if I'm wrong, SIOC[GS]XARP will need to be revisited as well).

--
meem
_______________________________________________
networking-discuss mailing list
networking-discuss at opensolaris dot org



carlsonj

Posts: 6,810
From: US

Registered: 3/9/05
Re: Clearview IPMP Rearchitecture: high-level design: extended to 9/29
Posted: Sep 29, 2005 4:13 AM   in response to: meem

  Click to reply to this thread Reply

Peter Memishian writes:
> Hmm, I find that too easy to misunderstand. I've gone with "IP-based
> networking applications". Is that acceptable?

Yep.

> > One of the criticisms leveled against routing is that there's no good
> > solution for multicast. I'm pointing out that multicast isn't
> > completely solved here, either.
>
> Agreed, but it's not flawed from a high-level design standpoint. As we
> get into more of the detailed design, I think there will be room to cover
> this.

OK.

> Yes, I see the problem now. It seems that upon repair, in.mpathd should
> only set INACTIVE if there is another usable interface in the group.
> Otherwise, it should leave the interface active. That then should cause:
>
> (AA) -> (FA) -> (FF) -> (AF) -> (AI)
>
> Let me know if this fully addresses your concern, or whether you have
> deeper issues.

No, I think that solves the problem.

> > A bold line between the two groups would do it.
>
> Sadly, a bold line seems to be more challenging in LaTeX than one would
> think. If I stumble on a good way to do it, I'll update.

A doubled line isn't too hard ...

> > That latter case does in fact happen -- when in.mpathd isn't running.
>
> in.mpathd should be viewed as a critical system component -- what happens
> with IPMP when it's not running is no more relevant than what happens to
> DR when rcm_daemon isn't running.

I disagree with that. in.mpathd isn't running on systems that don't
use IPMP. This means that the resulting interface is non-uniform. On
systems where IPMP is in use, ~RUNNING becomes FAILED. But on systems
where it's not ~RUNNING is just on its own. So the symmetry of the
two flags exists only in _some_ cases.

> We talked a bit offline about this -- it really comes down to how one
> interprets the RUNNING flag. You clearly feel that it represents the link
> and hardware state, and I can certainly understand why. My take is that
> it represents IP's notion of whether the interface is usable -- and that
> is based on both the link/hardware state, *and* the probe state.
>
> I'm leaning towards agreeing with you, but I want some more time to think
> about it and talk with some other folks.

OK.

> For "ipmpstat -a", using hostnames rather than addresses means that the
> table key (the first column) is no longer guaranteed to be unique -- ick.
> The only other context that addresses come up is with regard to probe
> targets. I'd be willing to go either way on that one.

I'd say names are even more important for the probe targets. In
general, though, I'm just suggesting that our commands ought to be
consistent from one to another. If the tradition (ping, netstat,
traceroute) is to translate numbers to names unless specifically
disabled, then new commands ought to do the same.

I think the argument that might work here is that this is more like
ifconfig than like any of those other commands, and ifconfig doesn't
print names. I'm not sure I _agree_ with that, but it's at least
plausible.

As for uniqueness, I don't see how that really matters. If someone
has multiple addresses mapping to a single name, then the output of
other commands on that same system (e.g., "netstat -i") is going to
show the same lack of uniqueness unless "-n" is used.

> > The difference is that it fails from ifconfig today:
> >
> > # ifconfig foobar0 plumb group a 10.0.0.1 up
> > ifconfig: plumb: foobar0: No such file or directory
> > #
>
> I'm confused why this won't fail after the rearchitecture.

Maybe I misunderstood. I read this section as implying that ifconfig
itself would be changed to deal with interface plumbing failure by
transferring the address.

Or are we sticking with the existing ifparse-based machinery in
net_include.sh?

> But this will only affect folks migrating to an IPMP-based configuration.
> I'd much rather provide clear documentation that static routes should not
> be associated with underlying interfaces than support the remapping in the
> kernel.

It seems odd that we'd try harder with /sbin/route, but OK.

> > > > - I had trouble reading this. I assume it means just that
> > > > RTM_NEWADDR will occur during address transfer, and not that
[...]
> The RTM_NEWADDR discussion you refer to is in the context of removing an
> underlying interface from an IPMP group. In this case, there cannot be
> any UP data addresses. Of course, adding a data address directly to an
> IPMP interface will trigger an RTM_NEWADDR, but I fail to see what that
> has to do with the behavior of underlying interfaces.

Returning to my original comment: the text wasn't clear. It said that
the "only" way RTM_NEWADDR happens is with transfer, and that's not
the case.

> > I should have written "SIOCDXARP." That can take an interface name,
> > which we currently use with ill_lookup_on_name. It's not clear to me
> > whether that ought to take a member link name or the ipmp interface
> > name or perhaps both in different contexts.
>
> I see; added (and I believe it should always take an IPMP interface name
> -- if I'm wrong, SIOC[GS]XARP will need to be revisited as well).

Not sure. There are certainly ARP entries for the underlying
interfaces as well. If there weren't, then you couldn't actually
probe any of those targets.

And when manipulating ARP entries for the IPMP interface, things get a
bit confusing. Do you select the underlying interface with which a
given entry is associated by specifying its MAC address? How do I
say, "add it to the group and set its MAC address to be the same as
some appropriate member?" What's the MAC address of an IPMP
interface?

(Note that published ARP entries _can_ have MAC addresses that don't
match any local address, so the solution can't restrict the interface
to handle only matching addresses.)

--
James Carlson, KISS Network <james dot d dot carlson at sun dot com>
Sun Microsystems / 1 Network Drive 71.232W Vox +1 781 442 2084
MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677
_______________________________________________
networking-discuss mailing list
networking-discuss at opensolaris dot org



meem

Posts: 3,045
From: US

Registered: 3/9/05
Re: Clearview IPMP Rearchitecture: high-level design: extended to 9/29
Posted: Sep 29, 2005 9:07 AM   in response to: carlsonj

  Click to reply to this thread Reply


> A doubled line isn't too hard ...

Indeed, Phil Kirk showed me a way to do it. I've updated the document
accordingly.

> > > That latter case does in fact happen -- when in.mpathd isn't running.
> >
> > in.mpathd should be viewed as a critical system component -- what happens
> > with IPMP when it's not running is no more relevant than what happens to
> > DR when rcm_daemon isn't running.
>
> I disagree with that. in.mpathd isn't running on systems that don't
> use IPMP. This means that the resulting interface is non-uniform. On
> systems where IPMP is in use, ~RUNNING becomes FAILED. But on systems
> where it's not ~RUNNING is just on its own. So the symmetry of the
> two flags exists only in _some_ cases.

Hence "what happens with IPMP" in my above statement. Since FAILED will
never be set when IPMP is not in-use, they will clearly be asymmetric in
that case.

Anyway, as previously discussed, I will think about the FAILED/RUNNING
interplay a bit more and make a decision shortly.

> As for uniqueness, I don't see how that really matters. If someone
> has multiple addresses mapping to a single name, then the output of
> other commands on that same system (e.g., "netstat -i") is going to
> show the same lack of uniqueness unless "-n" is used.

But it will be harder for scripts to parse the machine-parseable format,
because we will not be able to guarantee the uniqueness of each key (see
section 4.2.6 of revision 1.3 or later of the document).

> > > The difference is that it fails from ifconfig today:
> > >
> > > # ifconfig foobar0 plumb group a 10.0.0.1 up
> > > ifconfig: plumb: foobar0: No such file or directory
> > > #
> >
> > I'm confused why this won't fail after the rearchitecture.
>
> Maybe I misunderstood. I read this section as implying that ifconfig
> itself would be changed to deal with interface plumbing failure by
> transferring the address.

That was not the intended implication.

> Or are we sticking with the existing ifparse-based machinery in
> net_include.sh?

Sort of. What will happen is that the boot scripts will first try to
plumb everything, and collect a list of the plumb operations that failed.
For the set of failed interfaces, it will use ifparse to determine what
IPMP group they were supposed to be part of, and add those addresses to
the relevant IPMP group. This may potentially create the IPMP group
along the way.

Does this address your concern? If so, I will update the document to make
this explicit.

> > But this will only affect folks migrating to an IPMP-based configuration.
> > I'd much rather provide clear documentation that static routes should not
> > be associated with underlying interfaces than support the remapping in the
> > kernel.
>
> It seems odd that we'd try harder with /sbin/route, but OK.

The difference is that many sites today use /sbin/route with IPMP.

> > > > > - I had trouble reading this. I assume it means just that
> > > > > RTM_NEWADDR will occur during address transfer, and not that
> [...]
> > The RTM_NEWADDR discussion you refer to is in the context of removing an
> > underlying interface from an IPMP group. In this case, there cannot be
> > any UP data addresses. Of course, adding a data address directly to an
> > IPMP interface will trigger an RTM_NEWADDR, but I fail to see what that
> > has to do with the behavior of underlying interfaces.
>
> Returning to my original comment: the text wasn't clear. It said that
> the "only" way RTM_NEWADDR happens is with transfer, and that's not
> the case.

This comment is regarding section 5.4.2, which is explicitly about routing
socket messages associated with the underlying physical interfaces. The
routing socket message you're talking about would be associated with the
IPMP group interface (section 5.4.1). However, 5.4.1 does not explicitly
discuss the behavior of RTM_NEWADDR or RTM_DELADDR on the IPMP group
interface; I will add this explanation.

> Not sure. There are certainly ARP entries for the underlying
> interfaces as well. If there weren't, then you couldn't actually
> probe any of those targets.

I don't see any reason why an application should be mucking with the ARP
entries for test addresses -- thus, the expectation is that those ARP
entries will be maintained by the kernel, and will not be directly
modifiable by applications (they could be indirectly modified by changing
a hardware address).

> And when manipulating ARP entries for the IPMP interface, things get a
> bit confusing. Do you select the underlying interface with which a
> given entry is associated by specifying its MAC address? How do I
> say, "add it to the group and set its MAC address to be the same as
> some appropriate member?"

The only time this should happen is proxy ARP, right? In that case, as we
discussed, I think the most reasonable behavior is to treat the proxied
address as if it belongs to the IPMP interface itself, and migrate it
between interfaces in the group according to failure and repair. Thus, if
an application asks to establish a binding from IP address I to hardware
address H1, but H1 is associated with a failed interface, and H2 (also in
the group) is functioning, we will establish a binding from I to H2.
Likewise, if the interface associated with H2 later fails, but the
interface associated with H1 is working, then the binding will be changed
to be from I to H1.

I will update the document if you agree.

> What's the MAC address of an IPMP interface?

It doesn't have one -- but from the perspective of SIOCG[X]ARP, the IP
addresses it hosts are associated with a hardware address associated with
one of the interfaces in the group.

--
meem
_______________________________________________
networking-discuss mailing list
networking-discuss at opensolaris dot org



carlsonj

Posts: 6,810
From: US

Registered: 3/9/05
Re: Clearview IPMP Rearchitecture: high-level design: extended to 9/29
Posted: Sep 29, 2005 9:19 AM   in response to: meem

  Click to reply to this thread Reply

Peter Memishian writes:
> > As for uniqueness, I don't see how that really matters. If someone
> > has multiple addresses mapping to a single name, then the output of
> > other commands on that same system (e.g., "netstat -i") is going to
> > show the same lack of uniqueness unless "-n" is used.
>
> But it will be harder for scripts to parse the machine-parseable format,
> because we will not be able to guarantee the uniqueness of each key (see
> section 4.2.6 of revision 1.3 or later of the document).

I still don't see a problem here. For those who really care about the
issue, turning off address-to-name mapping is the right answer.

Not all users manage their systems in exactly the same way or
necessarily do the same things in every single script.

> > Or are we sticking with the existing ifparse-based machinery in
> > net_include.sh?
>
> Sort of. What will happen is that the boot scripts will first try to
> plumb everything, and collect a list of the plumb operations that failed.
> For the set of failed interfaces, it will use ifparse to determine what
> IPMP group they were supposed to be part of, and add those addresses to
> the relevant IPMP group. This may potentially create the IPMP group
> along the way.
>
> Does this address your concern? If so, I will update the document to make
> this explicit.

The above sounds like a "yes."

> This comment is regarding section 5.4.2, which is explicitly about routing
> socket messages associated with the underlying physical interfaces. The
> routing socket message you're talking about would be associated with the
> IPMP group interface (section 5.4.1). However, 5.4.1 does not explicitly
> discuss the behavior of RTM_NEWADDR or RTM_DELADDR on the IPMP group
> interface; I will add this explanation.

OK.

> > And when manipulating ARP entries for the IPMP interface, things get a
> > bit confusing. Do you select the underlying interface with which a
> > given entry is associated by specifying its MAC address? How do I
> > say, "add it to the group and set its MAC address to be the same as
> > some appropriate member?"
>
> The only time this should happen is proxy ARP, right? In that case, as we
> discussed, I think the most reasonable behavior is to treat the proxied
> address as if it belongs to the IPMP interface itself, and migrate it
> between interfaces in the group according to failure and repair. Thus, if
> an application asks to establish a binding from IP address I to hardware
> address H1, but H1 is associated with a failed interface, and H2 (also in
> the group) is functioning, we will establish a binding from I to H2.
> Likewise, if the interface associated with H2 later fails, but the
> interface associated with H1 is working, then the binding will be changed
> to be from I to H1.
>
> I will update the document if you agree.

Sounds good.

> > What's the MAC address of an IPMP interface?
>
> It doesn't have one -- but from the perspective of SIOCG[X]ARP, the IP
> addresses it hosts are associated with a hardware address associated with
> one of the interfaces in the group.

OK.

--
James Carlson, KISS Network <james dot d dot carlson at sun dot com>
Sun Microsystems / 1 Network Drive 71.232W Vox +1 781 442 2084
MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677
_______________________________________________
networking-discuss mailing list
networking-discuss at opensolaris dot org



meem

Posts: 3,045
From: US

Registered: 3/9/05
Re: Clearview IPMP Rearchitecture: high-level design: extended to 9/29
Posted: Sep 29, 2005 1:40 PM   in response to: carlsonj

  Click to reply to this thread Reply


> > But it will be harder for scripts to parse the machine-parseable format,
> > because we will not be able to guarantee the uniqueness of each key (see
> > section 4.2.6 of revision 1.3 or later of the document).
>
> I still don't see a problem here. For those who really care about the
> issue, turning off address-to-name mapping is the right answer.
>
> Not all users manage their systems in exactly the same way or
> necessarily do the same things in every single script.

True. I think I will make this change, but I need to think about it a
little more.

I've updated the document to version 1.4. This includes a rewrite of the
description of the ARP handling to take into account the issues you
brought up, and a host of other smaller clarifications.

At this point, I believe I've addressed all of your feedback, with the
following exceptions:

* Changing the handling of FAILED vs ~RUNNING for underlying
interfaces (still thinking about this).

* Changing the output of ipmpstat to default to hostnames, with
an option to show IP addresses (as per the discussion above).

* Covering IGMP handling -- and, as per Ramesh's earlier
feedback, the behavior associated with all-nodes multicasts
(and some other minor multicast issues).

* Covering "degraded" in more depth (need to talk to the FMA
team about this one).

* Covering the high-level interaction with existing IPMP API's
(asynchronous events and the query interface).

As always, the latest version is here:

http://opensolaris.org/os/community/networking/ipmp-highlevel-design.pdf

Please let me know if I've overlooked anything else.
--
meem
_______________________________________________
networking-discuss mailing list
networking-discuss at opensolaris dot org






Terms of Use | Privacy | Trademarks | Copyright Policy | Site Guidelines
Your use of this web site or any of its content or software indicates your agreement to be bound by these Terms of Use.
Copyright © 1995-2005 Sun Microsystems, Inc.