From casey  Wed Jan 29 11:56:56 1992
Received: by darkstar.s1.gov (910711.SGI.UNSUPPORTED.PROTOTYPE/910805.SGI.UNSUPPORTED.PROTOTYPE)
	for werner id AA19907; Wed, 29 Jan 92 11:56:56 -0800
Date: Wed, 29 Jan 92 11:56:56 -0800
From: casey (Casey Leedom)
Message-Id: <9201291956.AA19907@darkstar.s1.gov>
To: project
Subject: Report on multicast grooup meeting of 1/16/92 and USENIX discussions

[[Wednesday, January 29, 1992: Sorry about getting this report to you late.
I had it nearly completed just before I headed off to this year's winter
USENIX conference but didn't quite make it.  While at the conference I had
a chance to talk with Van Jacobson.  I've also had more time to read some
of the papers on distributed communication.  I'm not quite as optimistic
about a true network level multicast implementation as I was just before
USENIX but I haven't edited that optimism out.  None of the information is
wrong, it's just a deeper understanding of the complexity of the problems
that shakes my confidence.  Besides, maybe next week I'll be optimistic
again ...]]

Multicast Working Group Meeting
9:30am Thursday, January 16, 1992

Report:

  The forecast for a true network multicast implementation of the multicast
communication harness may be much better than we thought.  I still need to
do more reading and research, but the network layer multicast primitives
offered by SGI seem to be an implementation of a proposed Internet standard
(RFC-1112 ``Host Extensions for IP Multicasting'').  Also, the proposed
Internet standard for VMTP (RFC-1045 ``Versatile Message Transaction
Protocol'') may answer our reliable, single-source ordered transport layer
needs.

  Bob Shectman expressed concern regarding the portability of depending on
the availability of RFC-1112 implementations in other vendors' platforms.
It turns out that there are implementations available for SunOS 4.0, Ultrix
3.0/MIPS, 4.3BSD-tahoe/VAX and 4.3BSD-tahoe/Tahoe.  That's not everything
in the universe, but it is a fairly big segment of the Unix market.
Besides, since there is no other standard for accessing multicast
facilities and this one seems to address our needs, I think that it makes
sense to go with it.  Besides, anything that we came up with on our own
definitely wouldn't be available on any other vendors' platform ...  [[At
USENIX, Mike Karels told me that he's trying to get together with Steve
Deering (the author of RFC-1112), Andrew Cherenson and Van Jacobson to
discuss implementing RFC-1112 in BSD4.4.  Mike's only big problem with
RFC-1112 that he mentioned to me is that it forces applications to deal
with network interfaces.]]

  I am worried about VMTP's appropriateness because:

    1.	It may not address our needs.  I need to take the time to read the
	standard.

    2.	It may address our needs but carry too much other baggage with it.
	The standard is an ominous 102 pages long ...

    3.	I don't think that it's been ported to the SGI.  There are VMTP
	ports available for all the operating systems I mentioned above
	that had IP Multicast implementations.  I'll have to grunge around
	to see what the situation is.  If there isn't a kernel level
	implementation available, we may be able to implement it in
	user-land, but I can think of several things (like protocol timers)
	that might be virtually impossible to implement reliably in
	user-land ...

    4.	I have doubts that VMTP will be able to handle multiple multicast
	groups on the same multicast address (see agenda report item 3
	below).  This would be a serious problem ...

Nevertheless, the VMTP protocol may offer us a lot of methods to deal with
our transport layer needs.

  With that preamble out of the way, I'll get to the report on our meeting
agenda items.

    1.	What kind of delivery characteristics do we need?  We definitely
	need reliable, single-source ordered messaging.  Do we need
	multi-source ordered messaging?  [See Garcia-Molina and Spauster
	1991.]  We need single-source ordering, for example, because we
	don't want a LOCK, UNLOCK sequence to be received as UNLOCK, LOCK.

	No one in the meeting was able to come up with any scenario
	requiring multi-sourced ordered delivery.  Please comment on this
	if you can think of a need.

    2.	What kind of application model do we want?  Datagram or stream?

	Currently the applications that we have are message oriented.
	Also, nothing on the horizon looks like it will do anything except
	messaging.  This leads us to think that the multicast transport
	layer should mirror the higher layer usage and offer inherent
	message framing capabilities.  Thus, our tentative decision on this
	is that we should concentrate on a datagram transport.

    3.	It's almost certain that we'll need to create, use and destroy
	multicast groups very dynamically.  What kinds of mechanisms will
	this require?  Multicast group identifier generation, registration,
	defense and naming are prime issues.  Also, given the nature of the
	universe (see Murphy), it's certain that we'll see conflicts with
	different groups of applications trying to use the same multicast
	addresses.  Some mechanism of disambiguating such overloaded use
	will have to developed.

	It can be shown that no matter what we do, there will come a time
	when more than one multicast group is sharing the same multicast
	address for at least some small finite period of time.  Basically,
	if autonomous applications on different segments of a temporarily
	partitioned network dynamically pick multicast addresses for their
	multicast groups, there's no way to prevent them from picking the
	same multicast address.  There are schemes to generate unique names
	on a distributed basis, but there would be no way to map those
	uniquely onto multicast addresses because the multicast address
	space is too small.  We could envision some kind of
	``frequency-hopping'' approach to disentangle multiple multicast
	groups from the same multicast address but that leads to even
	larger problems (see agenda report item 5 below).  Besides, that
	still doesn't address the main issue of how to deal with the
	problem for the time period during which more than one multicast
	group is operational on a multicast address.

	Thus, our thought is that we should just accept that there will be
	times when multiple multicast groups are sharing a multicast
	address for an indefinite period -- potentially the lifetimes of
	the applications.  That is, we view multicasting as a performance
	optimization that enables applications to send a single copy of
	their packets out and saves them from having to filter their
	multicast group messages out from dozens or hundreds of other
	multicast groups, but, those applications may very well still have
	to filter their messages out from several other multicast groups.

	All of this means that there will have to be a mechanism to
	uniquely identify multicast packets belonging to specific multicast
	groups.  My proposal is to use one of the schemes I alluded to
	above to generate unique multicast group ``names'' on a distributed
	basis and insert that name into every multicast message.  [[When
	I mentioned this to Van Jacobson he said ``Of course.  What else
	would you do?''  Sometimes it's hard talking with Van.  He's too
	damn smart!]]

	New associated question: Once we consider putting ``destination
	addresses'' in every packet it seems natural to wonder if ``source
	addresses'' should also be put in every packet.  What use would
	they be?  What form would they take?  One possible use would be
	sending ``unicast'' messages back to a specific entity in a
	multicast group.  This could be used for, say, requesting a
	``lock'' from an object generator.  If such a source address were
	instituted, it would almost certainly have to be independent of the
	host address that the entity currently resided on in order to allow
	for backup processes taking over from a failed process and process
	migration.

    4.	What kind of application programming interface (API) do we want?
	Subscribe/unsubscribe?  Synchronous or asynchronous delivery?
	Both?  Separate file descriptor for each subscribed group or one
	for all?

	We envision an interface that looks something like the following:

	    MCastGroup = CreateMCastGroup(void)
	    DestroyMCastGroup(MCastGroup)

	    SubscribeToMCastGroup(MCastGroup, Flags, MCastReceiveCallback)
	    UnsubscribeToMCastGroup(MCastGroup)

	Where MCastReceiveCallback would get called with MCastGroup as a
	parameter when data was available on the multicast group and could
	be specified as NULL to indicate that asynchronous input wasn't
	desired.  Note that we've hidden the single versus multiple file
	descriptor question down underneath the API where it won't confuse
	the applications programmer.

	Things get uglier when we start looking at actually getting data
	into and out of the multicast group.  The main problems seem to
	come from the [assumed] message nature of our application.  If we
	force the application to send and receive entire messages at once
	that leads to potential inconvenience and inefficiency.  And there
	seem to be complementary problems if we force applications to
	incrementally send and receive messages ...  One solution might be
	to allow both paradigms ...  This is an area that we're going to
	spend more time on.  Here are outlines of the two potential I/O
	paradigms:

	Whole message sent/received at once:

		SendToMCastGroup(MCastGroup, ?Message?, ?Flags?)
		ReceiveFromMCastGroup(MCastGroup, ?&Message?, ?Flags?)

	    For SendToMCastGroup would ?Message? be a character pointer and
	a length?  If so, then that would force the application to
	construct a message in one piece and wouldn't allow gathering it
	together from several places.  This is a real problem because it's
	very likely that applications will run application layer protocols
	on top of the multicast transport and those application protocols
	will consist of headers and separate data.  Forcing the application
	to construct a message in one piece would probably require the
	application to suffer weird pointer manipulations or efficiency
	hits as it copied the header and other portions of its message into
	a single memory area.  Note that because IRIX doesn't have a real
	implementation of writev that this ``hand gathering'' will have to
	be done in any case, but that's the SGI's problem.  Or is it?

	    For ReceiveFromMCastGroup the problem is even worse because
	    receiving a whole message at once implies that the application
	    knows the length of the message ahead of time ...

	    One real possibility that strikes me to solve these and other
	    problems is to create a Message class that allowed incremental
	    building and parsing.  Care would have to be taken to avoid
	    forcing all I/O to go through an extra copy phase which would
	    impact performance.

	Incremental message transmission (the stdio approach):

		WriteMCastGroup(MCastGroup, Buffer, BufferLength, ?Flags?)
		ReadMCastGroup(MCastGroup, Buffer, BufferLength, ?Flags?)

	    In this approach the application would just write pieces of its
	    messages onto an MCastGroup.  Those pieces would get buffered
	    until the end of the message was written.  Similarly, the
	    application would simply read from the MCastGroup as much or
	    little as it wanted to until it ran up against the end of a
	    message.  The disadvantage of this is that it forces all
	    applications to go through a copying/ buffering stage which
	    would impact performance.  It's also unclear about how best to
	    indicate the ends of messages on either writing or reading.
	    Finally, this doesn't match the message oriented model that
	    we're offering and that's bound to lead to problems sometime
	    along the way ...

	This is all probably far more detail than you wanted to read, but I
	thought I'd pass our current thoughts about this on since many of
	you will be using whatever API we come up with and are therefore
	interested observers.

	Personally, I'm leaning towards implementing a Message class that
	allows incremental building and parsing, with provisions for direct
	access and reference counting.  This would allow applications to
	use a stdio buffering style approach to construct and disassemble
	messages while still allowing direct memory buffer manipulations.
	I include the reference counting because the current mux daemon
	needs that and I can see other message based applications also
	benefiting from that.

	The question of synchronous/asynchronous error notification hasn't
	even been touched on yet.  Unfortunately this will have to be dealt
	with ...

    5.	How will we handle application faults and transient network
	partitioning?  I.e. what do we do if a multicast group subscriber
	crashes or hangs?  And what do we do when a multicast group is
	temporary partitioned because of transient network problems causing
	potential multicast group shared state inconsistency?  And, for
	that matter, how will we even detect those conditions?  The big
	issue here is state.  Our main state so far is the concept of
	object locking but there may be more in the future ...

	[[I had a chance to talk with Van Jacobson at USENIX about this.
	He said that locks were bad.  Stay away from locks.  Don't even
	look at them.  They were nothing except trouble ... :-) He said
	that a much better approach in most cases was to instantiate a new
	object that represented the association (lock) that one wants and
	have that object carry an expiration timer.  This would ensure that
	``locks'' were dropped automatically if a client went away.
	There's a hell of a lot more to this though ...]]

What's next?

  Obviously I need to do more reading and talk to you guys more in
meetings.  But I also need to talk to some people who've thought about this
problem area for a long time.  I'll probably talk to Danny Nessett very
soon and get his thoughts.  I'm also hoping to convince Van Jacobson that
it's worth his time to talk to me.  (Van is very interested in this area
and has an application that he wants to build using this technology.  On
the positive side I'm getting paid full time to do this and am therefore a
good resource for Van.  On the negative side I don't know what I'm doing
and don't have any experience in the area.  We'll have to see ...)

Casey

