banana.xhtml [plain text]

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>The Banana Protocol</title>
<style src="stylesheet-unprocessed.css"></style>
</head>

<body>
<h1>The Banana Protocol</h1>

<em>NOTE! This is all preliminary and is more an exercise in semiconscious
protocol design than anything else. Do not believe this document. This
sentence is lying. So there.</em>

<h2>Banana tokens</h2>

<p>At the lowest layer, the wire transport takes the form of Tokens. These
all take the shape of header/type-byte/body, where the type byte has its
high bit set, to distinguish it from the header bytes that precede it. Some
tokens have bodies, some do not, and the length of all bodies are determined
by data in the header. The bodies are composed of arbitrary bytes. For most
tokens, the header is a base-128 number, but for some it may be a UTF-7
encoded string. (?? maybe just strict ASCII. the high bit of all bytes in
the header must be zero). The maximum legal header length is 64 bytes, so
tokens which treat it as a number have a maximum value of 2**(64*7)-1.</p>

<p>Tokens are described below as [header-TOKEN-body], where either
<q>header</q> or <q>body</q> may be empty. For example, [len-LIST-empty]
indicates that the length is put into the header, <q>LIST</q> is the token
being used, and the body is empty.</p>

<p>The possible Token types are:</p>

<ul>
  <li>
  <code>0x80: LIST (old): [len-LIST-empty]</code>

  <p>This token marks the beginning of a list with LEN elements. It acts as
  the <q>open parenthesis</q>, and the matching <q>close parenthesis</q> is
  implicit, based upon the length of the list. It will be followed by LEN
  things, which may be tokens like INTs or STRINGS, or which may be
  sublists. Banana keeps a list stack to handle nested sublists.</p>

  <p>This token (and the notion of length-prefixed lists in general) is from
  oldbanana. In newbanana it is only used during the initial dialect
  negotiation (so that oldbanana peers can be detected). Newbanana requires
  that LIST(old) tokens be followed exclusively by strings and have a rather
  limited allowable length (say, 640 dialects long).</p>
  </li>

  <li>
  <code>0x81: INT: [value-INT-empty]</code>

  <p>This token defines a single positive integer. The protocol defines its
  range as [0, 2**31), so the largest legal value is 2**31-1. The
  recipient is responsible for choosing an appropriate local type to hold the
  number. For Python, if the value represented by the incoming base-128
  digits grows larger than a regular Python IntType can accomodate, the
  receiving system will use a LongType or a BigNum as necessary.</p>

  <p>Anything larger than this range must be sent with a LONGINT token
  instead.</p>

  <p>(oldbanana compatibility note: a python implementation can accept
  anything in the range [0, 2**448), limited by the 64-byte maximum header
  size).</p>
  
  <p>The range was chosen to allow INT values to always fit in C's s_int32_t
  type, so an implementation that doesn't have a useful bignum type can
  simply reject LONGINT tokens.</p>
  </li>

  <li>
  <code>0x82: STRING [len-STRING-chars]</code>

  <p>This token defines a string. To be precise, it defines a sequence of
  bytes. The length is a base-128-encoded integer. The type byte is followed
  by LEN bytes of data which make up the string. LEN is required to be
  shorter than 640k: this is intended to reduce the amount of memory that
  can be consumed on the receiving end before user code gets to decide
  whether to accept the data or not.</p>
  </li>

  <li>
  <code>0x83: NEG: [value-NEG-empty]</code>

  <p>This token defines a negative integer. It is identical to the
  <code>INT</code> tag except that the results are negated before storage.
  The range is defined as [-2**31, 0), again to make an implementation using
  s_int32_t easier. Any numbers smaller (more negative) than this range must
  be sent with a LONGNEG token.</p>

  <p>Implementations should be tolerant when receiving a <q>negative zero</q>
  and turn it into a 0, even though they should not send such things.</p>

  <p>Note that NEG can represent a number (-2**31) whose absolute value
  (2**31) is one larger than the greatest number that INT can represent
  (2**31-1).</p>
  </li>

  <li>
  <code>0x84: FLOAT [empty-FLOAT-value]</code>

  <p>This token defines a floating-point number. There is no header, and the
  type byte is followed by 8 bytes which are a 64-bit IEEE <q>double</q>, as
  defined by <code class="python">struct.pack("!d", num)</code>.</p>
  </li>

  <li>
  <p><code>0x85: OLDLONGINT: [value-OLDLONGINT-empty]</code></p>
  <p><code>0x86: OLDLONGNEG: [value-OLDLONGNEG-empty]</code></p>

  <p>These were used by oldbanana to represent large numbers. Their size was
  limited by the number of bytes in the header (max 64), so they can
  represent [0, 2**448).</p>
  </li>

  <li>
  <code>0x87: VOCAB: [index-VOCAB-empty]</code>
  
  <p>This defines a tokenized string. Banana keeps a mapping of common
  strings, each one is assigned a small integer. These strings can be sent
  compressed as a two-byte (index, VOCAB) sequence. They are delivered to
  Jelly as plain strings with no indication that they were compressed for
  transit.</p>

  <p>The strings in this mapping are populated by the sender when it sends a
  special <q>vocab</q> OPEN sequence. The intention is that this mapping
  will be sent just once when the connection is first established, but a
  sufficiently ambituous sender could use this to implement adaptive forward
  compression.</p>
  </li>

  <li>
  <p><code>0x88: OPEN: [[num]-OPEN-empty]</code></p>
  <p><code>0x89: CLOSE: [[num]-CLOSE-empty]</code></p>

  <p>These tokens are the newbanana parenthesis markers. They carry an
  optional number in their header: if present, the number counts the
  appearance of OPEN tokens in the stream, starting at 0 for the first OPEN
  used for a given connection and incrementing by 1 for each subsequent
  OPEN. The matching CLOSE token must contain an identical number. These
  numbers are solely for debugging and may be omitted. They may be removed
  from the protocol once development has been completed.</p>

  <p>In contrast to oldbanana (with the LIST token), newbanana does not use
  length-prefixed lists. Instead it relies upon the Banana layer to track
  OPEN/CLOSE tokens.</p>

  <p>The token which follows an OPEN marker must be a string: either a
  STRING token or a VOCAB token. This string indicates what kind of new
  sub-expression is being started.</p>
  </li>

  <li>
  <p><code>0x8A: ABORT: [[num]-ABORT-empty]</code></p>

  <p>This token indicates that something has gone wrong on the sender side,
  and that the resulting object must not be handed upwards in the unslicer
  stack. It may be impossible or inconvenient for the sender to stop sending
  the tokens associated with the unfortunate object, so the receiver must be
  prepared to silently drop all further tokens up to the matching STOP
  marker.</p>

  <p>The number, if present, will be the same one used by the OPEN
  token.</p>
  </li>

  <li>
  <p><code>0x8B: LONGINT: [len-LONGINT-bytes]</code></p>
  <p><code>0x8C: LONGNEG: [len-LONGNEG-bytes]</code></p>

  <p>These are processed like STRING tokens, but the bytes form a base-256
  encoded number, most-significant-byte first (note that this may require
  several passes and some intermediate storage). The size is (barely) limited
  by the length field, so the theoretical range is [0, 2**(2**(64*7)-1)-1),
  but the receiver can impose whatever length limit they wish.</p>

  <p>LONGNEG is handled exactly like LONGINT but the number is negated
  first.</p>
  </li>
  
</ul>

<p>TODO: Add TRUE, FALSE, and NONE tokens.</p>


<h2>Object graph serialization</h2>

<p>When serializing an object, it is useful to view it as a directed graph.
The root object is the one you start with, any objects it refers to are
children of that root. Those children may point back to other objects that
have already been serialized, or which will be serialized later.</p>

<p>Banana, like pickle and other serialization schemes, does a depth-first
traversal of this graph. Serialization is begun on each node before going
down into the child nodes. Banana tracks previously-handled nodes and
replaces them with numbered <code>reference</code> tokens to break loops in
the graph.</p>


<h2>Banana Slices</h2>

<p>A <em>Banana Slicer</em> is responsible for serializing a single user
object: it <q>slices</q> that object into a series of Banana tokens. On the
receiving end, there is a corresponding <em>Banana Unslicer</em> which
accepts the incoming tokens and re-creates the user object. There are
different kinds of Slicers and Unslicers for lists, tuples, dictionaries,
and instances. Classes can provide their own Slicers if they want more
control over the serialization process.</p>

<h2>The Banana Stack</h2>

<p>The serialization context is stored in a <q>Banana</q> object (names are
still being decided). This holds a stack of Banana Slicers, one per object
currently being serialized (i.e. one per node in the path from the root
object to the object currently being serialized).</p>

<p>For example, suppose a class instance is being serialized, and this class
chose to use a dictionary to hold its instance state. That dictionary holds
a list of numbers in one of its values. The Banana Stack would hold the root
slicer, an InstanceSlicer, a DictSlicer, and finally a ListSlicer.</p>

<p>(note: it might be possible to move the functionality of the Banana
object entirely into the <q>root slice</q> (<q>root slicer</q>?)). </p>

<p>Upon unserialization, the <q>Unbanana</q> object holds this context. A
stack of <q>Unslicer</q> objects handle incoming tokens. The <q>Unbanana</q>
is responsible for tracking OPEN and CLOSE tokens, making sure a failure in
an Unslicer doesn't cause a loss of synchronization. Unslicer methods may
raise exceptions: these are caught by the Unbanana and cause the object
currently being unserialized to fail: its parent gets a UnbananaFailure
instead of the dict or list or instance that it would normally have
received.</p>

<p>The stack is used to determine three things:</p>

<ul>
  <li> Whether to allow a given child to be serialized: the Taster
       function</li>
  <li> How to handle a child object: which Slicer should be used</li>
  <li> How to track object references, to break cycles in the object graph</li>
</ul>

<p>The default case puts a <q>Parent</q> slice at the bottom of the stack.
This can also be interpreted as a <q>root object</q>, if you imagine that
any given user object being serialized is somehow a child of the overall
serialization context. In PB, for example, the root object would be related
to the connection.</p>

<p>In addition, the stack can be queried to find out what path leads from
the root object to the one currently being serialized. If something goes
wrong in the serialization process (an exception is thrown), this path can
make it much easier to find out <em>when</em> the trouble happened, as
opposed to merely where. Knowing which method of your FooObject failed
during serialization isn't very useful when you have 500 of them inside your
data structure and you need to know whether it was <code>bar.thisfoo</code>
or <code>bar.thatfoo</code> which caused the problem. To this end, each
Slicer has a <code>.describe</code> method which is supposed to return a
short string that explains how to get to the child node currently being
processed. When an error occurs, these strings are concatenated together and
put into the failure object.</p>

<p>The Parent slice is meant to provide the default behavior for the stack.
The default class currently does the following:</p>

<ul>
  <li> Allow all objects to be serialzed: .taster is empty.</li>
  <li> Use the BananaRegistry mapping from object types to Slicer classes.</li>
  <li> Record object references in a dict inside the parent slice.</li>
</ul>

<p>TODO: The idea is to let other serialization contexts to other things.
The tokens should probably go to the parent slice for handling: turning into
bytes and sending over a wire, saving to a file, etc. Having the whole stack
participate in the Tasting process means that objects can put restrictions
on what is sent on their behalf: objects could refuse to let certain classes
be sent as part of their instance state.</p>


<h2>Bananaing</h2>

<p>Serialization starts with the Parent Slicer being asked to serialize the
given object. The Parent gives the object to Banana. Banana starts by
walking the stack (which, of course, has only the Parent on it at that
point), calling the <code>.taste</code> method for each Slicer there. If any
of them have a problem with the object being serialized, they express it by
raising an exception (TODO: which one? InsecureBanana?).</p>

<p>If the Taster stack passes the object, Banana's next job is to find a new
Slicer to handle the object. It does this by walking the stack, calling
<code>.newSlicer</code> on each slice. The first one that returns an object
ends the search. In most cases, this is the Parent slice, which just looks
up the <code>type()</code> of the object in the <code>SlicerRegistry</code>.
A type which does not have a Slicer registered for it will cause an
exception to be raised here.</p>

<p>The new Slicer is pushed on the the stack. It is then sent three methods
in succession: <code>.start</code>, <code>.slice</code>, and
<code>.finish</code>. <code>start</code> defaults to registering the object
with <code>setRefID</code> and sending a appropriate OPEN token.
<code>slice</code> is defined on a per-Slicer basis to send all the
necessary tokens. <code>finish</code> sends the CLOSE token.</p>

<p>Banana keeps strict track of the nesting level. For safety, each OPEN
gets a sequence number so it can be matched with its CLOSE token. If a
Slicer's .close method fails to send the close token, very bad things will
happen (in general, all further objects will become children of the one that
didn't CLOSE properly). The sequence numbers are an attempt to minimize the
damage.</p>


<h2>Unbananaing</h2>

<p>The Unbanana object has a corresponding stack of <em>Banana Unslicer</em>
objects. Each one receives the tokens emitted by the matching Slicer on the
sending side. The whole stack is used to determine new Unslicer objects,
perform Tasting of incoming tokens, and manage object references.</p>

<p>OPEN tokens have a string (or short list of tokens.. see below) to
indicate what kind of object is being started. This is looked up in the
UnbananaRegistry just like object types are looked up in the BananaRegistry.
The new Unslicer is pushed onto the stack.</p>

<p><q>ABORT</q> tokens indicate that something went wrong on the sending
side and that the current object is to be aborted. It causes the receiver to
ignore all tokens until the CLOSE token which closes the current node. This
is implemented by replacing the top-most slice with a DiscardUnslicer.</p>

<p><q>CLOSE</q> tokens finish the current node. The slice will pass its
completed object up to the <q>childFinished</q> method of its parent.</p>

<h3>Open Index tokens</h3>

<p>To be precise, OPEN tokens are followed by an arbitrary list of other
tokens which are used to determine which UnslicerFactory should be invoked
to create the new Unslicer. Basic Python types are designated with a simple
string, like (OPEN <q>list</q>) or (OPEN <q>dict</q>), but instances are
serialized with two strings (OPEN <q>instance</q> <q>classname</q>), and
various exotic PB objects like method calls may involve a list of strings
and numbers (OPEN <q>call</q> reqID objID methodname). The unbanana code
works with the unslicer stack's designated <q>opener</q> object to apply
constraints to these indexing tokens and finally obtain the new Unslicer
when enough indexing tokens have been received.</p>

<p>The reason for assembling this list before creating the Unslicer (instead
of using a generic InstanceUnslicer which switches behavior depending upon
its first received token) is to support classes or PB methods which wish to
push custom Unslicers to handle their deserialization process. For example,
a class could push a StreamingFileUnslicer that accepts a series of string
tokens and appends their contents to a file on disk. This Unslicer could
reduce memory consumption (by only holding one chunk at a time) and update
some kind of progress indicator as the data arrives. This particular feature
was provided by the old StringPager utility, but custom Unslicers offer more
flexibility and better efficiency (no additional round-trips).</p>

<p>(note: none of this affects the serialization side: those Slicers emit
both their indexing tokens and their state tokens. It is only the receiving
side where the index tokens are handled by a different piece of code than
the content tokens).</p>

<p>In yet greater detail:</p>

<ul>

  <li>Each OPEN sequence is divided into an <q>Index phase</q> and a
  <q>Contents phase</q>. The first one (or two or three) tokens are the
  Index Tokens and the rest are the Content Tokens. The sequence ends with a
  CLOSE token.</li>

  <li>Banana.inOpen is a boolean which indicates that we are in the Index
  Phase. It is set to True when the OPEN token is received and returns to
  False after the new Unslicer has been pushed.</li>

  <li>Banana.opentype is a list of Index Tokens that are being accumulated.
  It is cleared each time .inOpen is set to True. The tuple form of opentype
  is passed to Slicer.doOpen, Constraint.checkOpentype, and used as a key in
  the RootSlicer.openRegistry dictionary. Each Unslicer type is indexed by
  an opentype tuple.</li>

</ul>

<p>If .inOpen is True, each new token type will be passed (through
Banana.getLimit and top.openerCheckToken) to the opener's .openerCheckToken
method, along with the current opentype tuple. The opener gets to decide if
the token is acceptable (possibly raising a BananaError exception), and may
return a length limit (usually for strings). Note that the opener does not
maintain state about what phase the decoding process is in, so it may want
to condition its response upon the length of the opentype.</p>

<p>After each index token is complete, it is appended to .opentype, then the
list is passed (through Banana.handleOpen, top.doOpen, and top.open) to the
opener's .open method. This can either return an Unslicer (which will finish
the index phase: all further tokens will be sent to the new Unslicer),
return None (to continue the index phase), raise a Violation (which causes
an UnbananaFailure to be passed to the current top unslicer), or raise
another exception (which causes the connection to be abandoned).</p>

<h3>Unslicer Lifecycle</h3>

<p>Each Unslicer handles a single <q>OPEN sequence</q>, which starts with an
OPEN token and ends with a CLOSE token (or an ABORT token).</p>


<h4>Creation</h4>

<p>Acceptance of the OPEN token simply sets a flag to indicate that we are
in the Index Phase. (The OPEN token might not be accepted: it is submitted
to checkToken for approval first, as described below). During the Index
Phase, all tokens are appended to the current <code>opentype</code> list and
handed as a tuple to the top-most Unslicer's <code>doOpen</code> method.
This method can do one of the following things:</p>

<ul>
  <li>Return a new Unslicer object. It does this when there are enough index
  tokens to specify a new Unslicer. The new child is pushed on top of the
  Unslicer stack (Banana.receiveStack) and initialized by calling the
  <code>start</code> method described below. This ends the Index Phase.</li>

  <li>Return None. This indicates that more index tokens are required. The
  Banana protocol object simply remains in the Index Phase and continues to
  accumulate index tokens.</li>

  <li>Raise a Violation</li>
  <li>Raise some other exception</li>
</ul>

<p>When a new Unslicer object is pushed on the top of the stack, it has its
<code>.start</code> method called, in which it has an opportunity to create
whatever internal state is necessary to record the incoming content tokens.
Each created object will have a separate Unslicer instance. (TODO: could
optimize with singleton Unslicer objects, using start/finish methods to do
cleanup). The start method can run normally, or raise a Violation exception,
or some other exception.</p>

<p>This Unslicer is responsible for all incoming tokens until either 1: it
pushes a new one on the stack, or 2: it receives a CLOSE token.</p>


<h4>checkToken</h4>

<p>Each token starts with a length sequence, up to 64 bytes which are turned
into an integer. This is followed by a single type byte, distinguished from
the length bytes by having the high bit set (the type byte is always 0x80 or
greater). When the typebyte is received, the topmost Unslicer is asked about
its suitability by calling the <code>.checkToken</code> method. (note that
CLOSE and ABORT tokens are always legal, and are not submitted to
checkToken). This method is expected to do one of the following:</p>

<ul>
  <li>Return an integer called <code>sizelimit</code>. If the token body is
  larger than sizelimit, the token will be rejected. This is used for
  STRING, LONGINT, and LONGNEG tokens.</li>

  <li>Raise a <code>Violation</code> exception to reject the token. This
  will cause the remainder of the current OPEN sequence to be discarded (all
  tokens through the matching CLOSE token). Unslicers should raise this if
  their constraints will not accept the incoming object: for example a
  constraint which is expecting a series of integers can accept
  INT/NEG/LONGINT/LONGNEG tokens and reject OPEN/STRING/VOCAB/FLOAT tokens.
  The topmost Unslicer (the same one which raised Violation) will receive
  (through its <code>.receiveChild</code> method) an UnbananaFailure object
  which encapsulates the reason for the rejection </li>

  <li>Raise some other exception</li>
</ul>

<p>If the token sequence is in the <q>index phase</q> (i.e. it is just after
an OPEN token and a new Unslicer has not yet been pushed), then instead of
<code>.checkToken</code> the top unslicer is sent
<code>.openerCheckToken</code>. This method behaves just like checkToken,
but in addition to the type byte it is also given the opentype list (which
is built out of all the index tokens received during this index phase).</p>

<h4>receiveChild</h4>

<p>If the type byte is accepted, and the size limit is obeyed, then the rest
of the token is read and a finished (primitive) object is created: a string,
number, boolean, or None. This object is handed to the topmost Unslicer's
<code>.receiveChild</code> method, where again it is has a few options:</p>

<ul>
  <li>Run normally: if the object is acceptable, it should append or record
  it somehow.</li>
  <li>Raise Violation, just like checkToken.</li>
  <li>Raise some other exception</li>

  <li>invoke <code>self.abort</code>, which does
  <code>protocol.abandonUnslicer</code></li>
</ul>

<p>If the child is handed an UnbananaFailure object, and it wishes to pass
it upwards to its parent, then <code>self.abort</code> is the appropriate
thing to do. Raising an exception will accomplish the same thing, but with a
new UnbananaFailure that describes the exception raised here instead of the
one raised by a child object. It is bad to both call <code>abort</code> and
raise an exception.</p>

<h4>Finishing</h4>

<p>When the CLOSE token arrives, the Unslicer will have its
<code>.receiveClose</code> method called. This is expected to do:</p>

<ul>
  <li>Return an object: this object is the finished result of the
  deserialization process. It will be passed to <code>.receiveChild</code>
  of the parent Unslicer.</li>

  <li>Return a Deferred: this indicates that the object cannot be created
  yet (tuples that contain references to an enclosing tuple, for example).
  The Deferred will be fired (with the object) when it completes.</li>

  <li>Raise Violation</li>
  <li>Raise some other exception</li>
</ul>

<p>After receiveClose has finished, the child is told to clean up by calling
its <code>.finish</code> method. This can complete normally or raise a
Violation, BananaError, or something else.</p>

<p>Then, the old top-most Unslicer is popped from the stack and discarded.
Its parent is now the new top-most Unslicer, and the newly-unserialized
object is given to it with the <code>.receiveChild</code> method. Note that
this method is used to deliver both primitive objects (from raw tokens)
<em>and</em> composite objects (from other Unslicers).</p>


<h3>Error Handling</h3>

<p>Schemas are enforced by Constraint objects which are given an opportunity
to pass judgement on each incoming token. When they do not like something
they are given, they respond by raising a <code>Violation</code> exception.
The Violation exception is sometimes created with an argument that describes
the reason for the rejection, but frequently it is just a bare exception.
Most Violations are raised by the <code>checkOpentype</code> and
<code>checkObject</code> methods of the various classes in
<code>schema.py</code>.</p>

<p>Exceptions which occur in an Unslicer (whether Violations or something
more serious) can be confined to a single sub-tree of the object graph. The
object being deserialized (and all of its children) is abandoned, and all
remaining tokens for that object are discarded. However, the parent object
(to which the abandoned object would have been given) gets to decide what
happens next: it can either fail itself, or absorb the failure (much like an
exception handler can choose to re-raise the exception or eat it).</p>

<p>Confinable exceptions are wrapped in an <code>UnbananaFailure</code>
object. The UnbananaFailure behaves like a regular
<code>twisted.python.failure.Failure</code> object, except that it has an
attribute named <code>.where</code> which indicate the object-graph pathname
where the problem occurred. The UnbananaFailure is passed up the Unslicer
stack in lieu of an actual object. Most Unslicers have code in their
<code>receiveChild</code> methods to detect an UnbananaFailure and trigger
an abort, which causes all further tokens of the sub-tree to be discarded.
The connection is not dropped. Unslicers which partition their children's
sub-graphs (like the PBRootUnslicer, for which each child is a separate
operation) can simply ignore the UnbananaFailure, or respond to it by
sending an error message to the other end.</p>

<p>Other exceptions may occur during deserialization. Unrecognized opentypes
will probably cause a KeyError when the opener attempts to map them to an
Unserializer class. Coding erors could cause unexpected exceptions. If these
occur in the context of an Unslicer, they are wrapped in UnbananaFailures
and handled just like Violations. The only difference is that non-Violations
are always logged with <code>log.err</code> (including a traceback), whereas
Violations are passed through the Banana.logViolation method which may or
may not log it (and will probably not include a traceback).</p>

<p>Exceptions which occur outside of an Unslicer indicate coding errors or
severe protocol violations and cause the connection to be dropped. The
exception is logged on the local side with <code>log.err</code>, but the
remote end will not be given any reason for the disconnection. Protocol
violations are indicated with a BananaError exception.</p>

<p>The Banana object can also choose to respond to Violations by terminating
the connection. For example, the <code>.hangupOnLengthViolation</code> flag
causes string-too-long violations to be raised directly instead of being
handled, which will cause the connection to be dropped (as it occurs in the
dataReceived method).</p>


<h3>Example</h3>

<p>The serialized form of <code class="python">["foo",(1,2)]</code> is the
following token sequence: OPEN STRING(list) STRING(foo) OPEN STRING(tuple)
INT(1) INT(2) CLOSE CLOSE. In practice, the STRING(list) would really be
something like VOCAB(7), likewise the STRING(tuple) might be VOCAB(8). Here
we walk through how this sequence is processed.</p>

<p>The initial Unslicer stack consists of the single RootUnslicer
<code>rootun</code>.</p>

<pre>
OPEN
  rootun.checkToken(OPEN) : must not raise Violation
  enter index phase

VOCAB(7)  (equivalent to STRING(list))
  rootun.openerCheckToken(VOCAB, ()) : must not raise Violation
  VOCAB token is looked up in .incomingVocabulary, turned into "list"
  rootun.doOpen(("list",)) : looks in UnslicerRegistry, returns ListUnslicer
  exit index phase
  the ListUnslicer is pushed on the stack
  listun.start()

STRING(foo)
  listun.checkToken(STRING) : must return None or 3 or greater
  string is assembled
  listun.receiveChild("foo") : appends to list

OPEN
  listun.checkToken(OPEN) : must not raise Violation
  enter index phase

VOCAB(8)  (equivalent to STRING(tuple))
  listun.openerCheckToken(VOCAB, ()) : must not raise Violation
  VOCAB token is looked up, turned into "tuple"
  listun.doOpen(("tuple",)) : delegates through:
                                 BaseUnslicer.open
                                 self.opener (usually the RootUnslicer)
                                 self.opener.open(("tuple",))
                              returns TupleUnslicer
  exit index phase
  TupleUnslicer is pushed on the stack
  tupleun.start()

INT(1)
  tupleun.checkToken(INT) : must not raise Violation
  integer is assembled
  tupleun.receiveChild(1) : appends to list

INT(2)
  tupleun.checkToken(INT) : must not raise Violation
  integer is assembled
  tupleun.receiveChild(2) : appends to list

CLOSE
  tupleun.receiveClose() : creates and returns the tuple (1,2)
                           (could also return a Deferred)
  TupleUnslicer is popped from the stack and discarded
  listun.receiveChild((1,2))

CLOSE
  listun.receiveClose() : creates and returns the list ["foo", (1,2)]
  ListUnslicer is popped from the stack and discarded
  rootun.receiveChild(["foo", (1,2)])
</pre>


<h2>Other Issues</h2>


<h3>Deferred Object Recreation: The Trouble With Tuples</h3>

<p>Types and classes are roughly classified into containers and
non-containers. The containers are further divided into mutable and
immutable. Some examples of immutable containers are tuples and bound
methods. Lists and dicts are mutable containers. Ints and strings are
non-containers. Non-containers are always leaf nodes in the object
graph.</p>

<p>During unserialization, objects are in one of three states: uncreated,
referenceable (but not complete), and complete. Only mutable containers can
be referenceable but not complete: immutable containers have no intermediate
referenceable state.</p>

<p>Mutable containers (like lists) are referenceable but not complete during
traversal of their child nodes. This means those children can reference the
list without trouble.</p>

<p>Immutable containers (like tuples) present challenges when unserializing.
The object cannot be created until all its components are referenceable.
While it is guaranteed that these component objects will be complete before
the graph traversal exits the current node, the child nodes are allowed to
reference the current node during that traversal. The classic example is the
graph created by the following Python fragment:</p>

<pre class="python">
a = ([],)
a[0].append((a,))
</pre>

<p>To handle these cases, the TupleUnslicer installs a Deferred into the
object table when it begins unserializing (in the .start method). When the
tuple is finally complete, the object table is updated and the Deferred is
fired with the new tuple.</p>

<p>Containers (both mutable and immutable) are required to pay attention to
the types of their incoming children and notice when they receive Deferreds
instead of normal objects. These containers are not complete (in the sense
described above) until those Deferreds have been replaced with referenceable
objects. When the container receives the Deferred, it should attach a
callback to it which will perform the replacement. In addition, immutable
containers should check after each update to see if all the Deferreds have
been cleared, and if so, complete their own object (and fire their own
Deferreds so any containers <em>they</em> are a child of may be updated
and/or completed).</p>


<h3>Security Model</h3>

<p>Having the whole Slicer stack particpate in Tasting on the sending side
seems to make a lot of sense. It might be better to have a way to push new
Taster objects onto a separate stack. This would certainly help with
performance, as the common case (where most Slicers ignore .taste) does a
pointless method call to every Slice for every object sent. The trick is to
make sure that exception cases don't leave a taster stranded on the stack
when the object that put it there has gone away.</p>

<p>On the receiving side, each object has a corresponding .taste method,
which receives tokens instead of complete objects. This makes sense, because
you want to catch the dangerous data before it gets turned into an object,
but tokens are a pretty low-level place to do security checks. It might be
more useful to have some kind of <q>instance taster stack</q>, with tasters
that are asked specifically about (class,state) pairs and whether they
should be turned into objects or not.</p>

<p>Because the Unslicers receive their data one token at a time, things like
InstanceUnslicer can perform security checks one attribute at a time.
<q>traits</q>-style attribute constraints (see the Chaco project or the
PyCon-2003 presentation for details) can be implemented by having a
per-class dictionary of tests that attribute values must pass before they
will be accepted. The instance will only be created if all attributes fit
the constraints. The idea is to catch violations before any code is run on
the receiving side. Typical checks would be things like <q>.foo must be a
number</q>, <q>.bar must not be an instance</q>, <q>.baz must implement the
IBazzer interface</q>.</p>

<p>Using the stack instead of a single Taster object means that the rules
can be changed depending upon the context of the object being processed. A
class that is valid as the first argument to a method call may not be valid
as the second argument, or inside a list provided as the first argument. The
PBMethodArgumentsUnslicer could change the way its .taste method behaves as
its state machine progresses through the argument list.</p>

<p>There are several different ways to implement this Taster stack:</p>

<ul>
  <li> Each object in the Unslicer stack gets to raise an exception if they
  don't like what they see: unanimous consent is required to let the token or
  object pass</li>
  
  <li> The top-most unslicer is asked, and it has the option of asking the
  next slice down. It might not, allowing local <q>I'm sure this is safe</q>
  classes to override higher-level paranoia.</li>

  <li> Unslicer objects may add and remove Taster objects on a separate
  stack. This is undoubtedly faster but must be done carefully to make sure
  Tasters and Unslicers stay in sync.</li>
</ul>

<p>Of course, all this holds true for the sending side as well. A Slicer
could enforce a policy that no objects of type Foo will be sent while it is
on the stack.</p>

<p>It is anticipated that something like the current Jellyable/Unjellyable
classes will be created to offer control over the Slicer/Unslicers used to
handle instance of that class.</p>

<p>One eventual goal is to allow PB to implement E-like argument
constraints.</p>


<h3>Streaming Slices</h3>

<p>The big change from the old Jelly scheme is that now
serialization/unserialization is done in a more streaming format. Individual
tokens are the basic unit of information. The basic tokens are just numbers
and strings: anything more complicated (starting at lists) involves
composites of other tokens.</p>

<p>The serialization side will be reworked to be a bit more
producer-oriented. Objects should be able to defer their serialization
temporarily (TODO: really??) like twisted.web resources can do NOT_DONE_YET
right now. The big goal here is that large objects which can't fit into the
socket buffers should not consume lots of memory, sitting around in a
serialized state with nowhere to go. This must be balanced against the
confusion caused by time-distributed serialization. PB method calls must
retain their current in-order execution, and it must not be possible to
interleave serialized state (big mess).</p>

<p>To actually accomplish this, objects should be able to provide their own
Slicers. It may be convenient to do this entirely with Adapters, so Banana
registers ListSlicer, etc as an ISlicer adapter for the fundamental types;
pb.Copyable implements ISlicer by having a method to return a new Slicer;
etc.</p>

<p>Likewise on the receiving end, the Unslicer is created when the OPEN
token is received, and then receives all the tokens destined for that
object.</p>

<h3>CBanana, CBananaRun, RunBananaRun</h3>

<p>Another goal of the Jelly+Banana-&gt;JustBanana change is the hope of
writing Slicers and Unslicers in C. The CBanana module should have C objects
(structs with function pointers) that can be looked up in a registry table
and run to turn python objects into tokens and vice versa. This ought to be
faster than running python code to implement the slices, at the cost of less
flexibility. It would be nice if the resulting tokens could be sent directly
to the socket at the C level without surfacing into python; barring this it
is probably a good idea to accumulate the tokens into a large buffer so the
code can do a few large writes instead of a gazillion small ones.</p>

<p>It ought to be possible to mix C and Python slices here: if the C code
doesn't find the slice in the table, it can fall back to calling a python
method that does a lookup in an extensible registry.</p>

<h2>Beyond Banana</h2>

<p>Random notes and wild speculations: take everything beyond here with
<em>two</em> grains of salt</p>

<h3>Oldbanana usage</h3>

<p>The oldbanana usage model has the layer above banana written in one of
two ways. The simple form is to use the <code
class="python">banana.encode</code> and <code
class="python">banana.decode</code> functions to turn an object into a
bytestream. This is used by twisted.spread.publish . The more flexible model
is to subclass Banana. The largest example of this technique is, of course,
twisted.spread.pb.Broker, but others which use it are twisted.trial.remote
and twisted.scripts.conch (which appears to use it over unix-domain
sockets).</p>

<p>Banana itself is a Protocol. The Banana subclass would generally override
the <code>expressionReceived</code> method, which receives s-expressions
(lists of lists). These are processed to figure out what method should be
called, etc (processing which only has to deal with strings, numbers, and
lists). Then the serialized arguments are sent through Unjelly to produce
actual objects.</p>

<p>On output, the subclass usually calls <code>self.sendEncoded</code> with
some set of objects. In the case of PB, the arguments to the remote method
are turned into s-expressions with jelly, then combined with the method
meta-data (object ID, method name, etc), then the whole request is sent to
<code>sendEncoded</code>.</p>

<h3>Newbanana</h3>

<p>Newbanana moves the Jelly functionality into a stack of Banana Slices,
and the lowest-level token-to-bytestream conversion into the new Banana
object. Instead of overriding <code>expressionReceived</code>, users could
push a different root Unslicer. to get more control over the receive
process.

Currently, Slicers call Banana.sendOpen/sendToken/sendClose/sendAbort, which
then creates bytes and does transport.write .

To move this into C, the transport should get to call CUnbanana.receiveToken
There should be CBananaUnslicers. Probably a parent.addMe(self) instead of
banana.stack.append(self), maybe addMeC for the C unslicer.

The Banana object is a Protocol, and has a dataReceived method. (maybe in
some C form, data could move directly from a CTransport to a CProtocol). It
parses tokens and hands them to its Unslicer stack. The root Unslicer is
probably created at connectionEstablished time. Subclasses of Banana could
use different RootUnslicer objects, or the users might be responsible for
setting up the root unslicer.

The Banana object is also created with a RootSlicer. Banana.writeToken
serializes the token and does transport.write . (a C form could have CSlicer
objects which hand tokens to a little CBanana which then hands bytes off to
a CTransport).

Doing the bytestream-to-Token conversion in C loses a lot of utility when
the conversion is done token at a time. It made more sense when a whole mess
of s-lists were converted at once.

All Slicers currently have a Banana pointer.. maybe they should have a
transport pointer instead? The Banana pointer is needed to get to top of the
stack.

want to be able to unserialize lists/tuples/dicts/strings/ints (<q>basic
types</q>) without surfacing into python. want to deliver the completed
object to a python function.


</p>

<h3>Streaming Methods</h3>

<p>It would be neat if a method could indicate that it would like to receive
its arguments in a streaming fashion. This would involve calling the method
early (as soon as the objectID and method name were known), then somehow
feeding objects to it as they arrive. The object could return a handler or
consumer sub-object which would be fed as tokens arrive over the wire. This
consumer should have a way to enforce a constraint on its input.</p>

<p>This consumer object sounds a lot like an Unslicer, so maybe the method
schema should indicate that the method will would like to be called right
away so it can return an Unslicer to be pushed on the stack. That Unslicer
could do whatever it wanted with the incoming tokens, and could enforce
constraints with the usual checkToken/doOpen/receiveChild/receiveClose
methods.</p>

<p>On the sending side, it would be neat to let a callRemote() invocation
provide a Producer or a generator that will supply data as the network
buffer becomes available. This could involve pushing a Slicer. Maybe Slicers
should be generators. </p>



<h2>Common token sequences</h2>

<h3>Base Python Types</h3>

<p>The basic python types are considered <q>safe</q>: the code which is
invoked by their receipt is well-understood and there is no way to cause
unsafe behavior during unserialization. Resource consumption attacks are
mitigated by Constraints imposed by the receiving schema.</p>

<p>Note that the OPEN(dict) slicer is implemented with code that sorts the
list of keys before serializing them. It does this to provide deterministic
behavior and make testing easier.</p>

<table border="" width="">
  <TR><TD>IntType, LongType (small+)</TD><TD>INT(value)</TD></TR>
  <TR><TD>IntType, LongType (small-)</TD><TD>NEG(value)</TD></TR>
  <TR><TD>IntType, LongType (large+)</TD><TD>LONGINT(value)</TD></TR>
  <TR><TD>IntType, LongType (large-)</TD><TD>LONGNEG(value)</TD></TR>
  <TR><TD>FloatType</TD><TD>FLOAT(value)</TD></TR>
  <TR><TD>StringType</TD><TD>STRING(value)</TD></TR>
  <TR><TD>StringType (tokenized)</TD><TD>VOCAB(tokennum)</TD></TR>
  <TR><TD>UnicodeType</TD>
      <TD>OPEN(unicode) STRING(str.encode('UTF-8')) CLOSE</TD></TR>
  <TR><TD>ListType</TD><TD>OPEN(list) elem.. CLOSE</TD></TR>
  <TR><TD>TupleType</TD><TD>OPEN(tuple) elem.. CLOSE</TD></TR>
  <TR><TD>DictType, DictionaryType</TD>
      <TD>OPEN(dict) (key,value).. CLOSE</TD></TR>
  <TR><TD>NoneType</TD><TD>OPEN(none) CLOSE</TD></TR>
  <TR><TD>BooleanType</TD><TD>OPEN(boolean) INT(0/1) CLOSE</TD></TR>
</table>

<h3>Extended (unsafe) Python Types</h3>

<p>To serialize arbitrary python object graphs (including instances)
requires that we allow more types in. This begins to get dangerous: with
complex graphs of inter-dependent objects, instances may need to be used (by
referencing objects) before they are fully initialized. A schema can be used
to make assertions about what object types live where, but in general the
contents of those objects are difficult to constrain.</p>

<p>For this reason, these types should only be used in places where you
trust the creator of the serialized stream (the same places where you would
be willing to use the standard Pickle module). Saving application state to
disk and reading it back at startup time is one example.</p>

<table border="" width="">
  <TR><TD colspan="2">Extended (unsafe) Python Types</TD></TR>

  <TR><TD>InstanceType</TD><TD>OPEN(instance) STRING(reflect.qual(class))
    (attr,value).. CLOSE</TD></TR>
  <TR><TD>ModuleType</TD><TD>OPEN(module) STRING(__name__) CLOSE</TD></TR>
  <TR><TD>ClassType</TD>
      <TD>OPEN(class) STRING(reflect.qual(class)) CLOSE</TD></TR>
  <TR><TD>MethodType</TD>
      <TD>OPEN(method) STRING(__name__) im_self im_class CLOSE</TD></TR>
  <TR><TD>FunctionType</TD>
      <TD>OPEN(function) STRING(module.__name__) CLOSE</TD></TR>
</table>

<h3>PB Sequences</h3>

<p>The following sequences are used to support the RPC mechanisms of
Perspective Broker.</p>

<table border="" width="">
  <TR><TD colspan="2">PB (method call) Sequences</TD></TR>

  <TR><TD>pb.Referenceable</TD>
      <TD>OPEN(remote) INT(ref-id)
        [InterfaceList]
        CLOSE</TD></TR>

  <TR><TD>RemoteReference.__del__</TD>
      <TD>OPEN(decref) INT(refID) CLOSE</TD></TR>

  <TR><TD>pb.Copyable</TD><TD>see InstanceType above</TD></TR>

  <TR><TD>method call</TD>
      <TD>OPEN(call) INT(request-id) INT(obj-id) STRING(methodname)
          (STRING(argname),argvalue).. CLOSE</TD></TR>

  <TR><TD>method call++</TD>
      <TD>OPEN(call2)
        INT(reqID) INT(objID) STRING(interfacename) INT(interfaceversion)
        STRING(methodname) (STRING(argname),argvalue)..
        CLOSE
        (this could be folded into 'call' with only 2 bytes of overhead)
      </TD></TR>

  <TR><TD>method response (success)</TD>
      <TD>OPEN(answer) INT(request-id) value CLOSE</TD></TR>
  <TR><TD>method response (exception)</TD>
      <TD>OPEN(error) INT(request-id) value CLOSE</TD></TR>
  
</table>

<p>The first time a <code>pb.Referenceable</code> is sent, the second object
is an InterfaceList, which is a list of tuples of (interfacename,
versionnum), and therefore constrainable by a schema of
ListOf(TupleOf(str,int)) with some appropriate maximum-length restrictions.
This InterfaceList describes all the Interfaces that the corresponding
<code>pb.Referenceable</code> implements. The receiver uses this list to
look up local Interfaces (and therefore Schemas) to attach to the
<code>pb.RemoteReference</code>. This is how method schemas are checked on
the sender side.</p>

<p>This implies that Interfaces must be registered, just as classes are for
<code>pb.Copyable</code>. TODO: what happens if an unknown Interface is
received? TODO: are multiple versions of the same interface allowed? If so,
how is that specified?</p>


<h3>Unhandled types</h3>

<p>The following types are not handled by any slicer, and will raise a
KeyError if one is referenced by an object being sliced. This technically
imposes a limit upon the kinds of objects that can be serialized, even by a
<q>unsafe</q> serializer, but in practice it is not really an issue, as many
of these objects have no meaning outside the program invocation which
created them.</p>

<ul>
  <li>- types that might be nice to have</li>
  <li>ComplexType</li>
  <li>SliceType</li>
  <li>TypeType</li>
  <li>XRangeType</li>

  <li>- types that aren't really that useful</li>
  <li>BufferType</li>
  <li>BuiltinFunctionType</li>
  <li>BuiltinMethodType</li>
  <li>CodeType</li>
  <li>DictProxyType</li>
  <li>EllipsisType</li>
  <li>NotImplementedType</li>
  <li>UnboundMethodType</li>

  <li>- types that are meaningless outside the creator</li>
  <li>TracebackType</li>
  <li>FileType</li>
  <li>FrameType</li>
  <li>GeneratorType</li>
  <li>LambdaType</li>
</ul>

<h3>Unhandled (but don't worry about it) types</h3>

<p><code>ObjectType</code> is the root class of all other types. All objects
are known by some other type in addition to <code>ObjectType</code>, so the
fact that it is not handled explicitly does not matter.</p>

<p><code>StringTypes</code> is simply a list of <code>StringType</code> and
<code>UnicodeType</code>, so it does not need to be explicitly handled
either.</p>

<h3>Internal types</h3>

<p>The following sequences are internal.</p>

<p>The OPEN(vocab) sequence is used to update the forward compression
token-to-string table used by the VOCAB token. It is followed by a series of
number/string pairs. All numbers that appear in VOCAB tokens must be
associated with a string by appearing in the most recent OPEN(vocab)
sequence.</p>

<table border="" width="">
  <TR><TD colspan="2">internal types</TD></TR>
  <TR><TD>vocab dict</TD><TD>OPEN(vocab) (num,string).. CLOSE</TD></TR>
</table>

</body> </html>