At the lowest layer, the wire transport takes the form of Tokens. These all take the shape of header/type-byte/body, where the type byte has its high bit set, to distinguish it from the header bytes that precede it. Some tokens have bodies, some do not, and the length of all bodies are determined by data in the header. The bodies are composed of arbitrary bytes. For most tokens, the header is a base-128 number, but for some it may be a UTF-7 encoded string. (?? maybe just strict ASCII. the high bit of all bytes in the header must be zero). The maximum legal header length is 64 bytes, so tokens which treat it as a number have a maximum value of 2**(64*7)-1.
Tokens are described below as [header-TOKEN-body], where either
header
or body
may be empty. For example, [len-LIST-empty]
indicates that the length is put into the header, LIST
is the token
being used, and the body is empty.
The possible Token types are:
0x80: LIST (old): [len-LIST-empty]
This token marks the beginning of a list with LEN elements. It acts as
the open parenthesis
, and the matching close parenthesis
is
implicit, based upon the length of the list. It will be followed by LEN
things, which may be tokens like INTs or STRINGS, or which may be
sublists. Banana keeps a list stack to handle nested sublists.
This token (and the notion of length-prefixed lists in general) is from oldbanana. In newbanana it is only used during the initial dialect negotiation (so that oldbanana peers can be detected). Newbanana requires that LIST(old) tokens be followed exclusively by strings and have a rather limited allowable length (say, 640 dialects long).
0x81: INT: [value-INT-empty]
This token defines a single positive integer. The protocol defines its range as [0, 2**31), so the largest legal value is 2**31-1. The recipient is responsible for choosing an appropriate local type to hold the number. For Python, if the value represented by the incoming base-128 digits grows larger than a regular Python IntType can accomodate, the receiving system will use a LongType or a BigNum as necessary.
Anything larger than this range must be sent with a LONGINT token instead.
(oldbanana compatibility note: a python implementation can accept anything in the range [0, 2**448), limited by the 64-byte maximum header size).
The range was chosen to allow INT values to always fit in C's s_int32_t type, so an implementation that doesn't have a useful bignum type can simply reject LONGINT tokens.
0x82: STRING [len-STRING-chars]
This token defines a string. To be precise, it defines a sequence of bytes. The length is a base-128-encoded integer. The type byte is followed by LEN bytes of data which make up the string. LEN is required to be shorter than 640k: this is intended to reduce the amount of memory that can be consumed on the receiving end before user code gets to decide whether to accept the data or not.
0x83: NEG: [value-NEG-empty]
This token defines a negative integer. It is identical to the
INT
tag except that the results are negated before storage.
The range is defined as [-2**31, 0), again to make an implementation using
s_int32_t easier. Any numbers smaller (more negative) than this range must
be sent with a LONGNEG token.
Implementations should be tolerant when receiving a negative zero
and turn it into a 0, even though they should not send such things.
Note that NEG can represent a number (-2**31) whose absolute value (2**31) is one larger than the greatest number that INT can represent (2**31-1).
0x84: FLOAT [empty-FLOAT-value]
This token defines a floating-point number. There is no header, and the
type byte is followed by 8 bytes which are a 64-bit IEEE double
, as
defined by struct.pack("!d", num)
.
0x85: OLDLONGINT: [value-OLDLONGINT-empty]
0x86: OLDLONGNEG: [value-OLDLONGNEG-empty]
These were used by oldbanana to represent large numbers. Their size was limited by the number of bytes in the header (max 64), so they can represent [0, 2**448).
0x87: VOCAB: [index-VOCAB-empty]
This defines a tokenized string. Banana keeps a mapping of common strings, each one is assigned a small integer. These strings can be sent compressed as a two-byte (index, VOCAB) sequence. They are delivered to Jelly as plain strings with no indication that they were compressed for transit.
The strings in this mapping are populated by the sender when it sends a
special vocab
OPEN sequence. The intention is that this mapping
will be sent just once when the connection is first established, but a
sufficiently ambituous sender could use this to implement adaptive forward
compression.
0x88: OPEN: [[num]-OPEN-empty]
0x89: CLOSE: [[num]-CLOSE-empty]
These tokens are the newbanana parenthesis markers. They carry an optional number in their header: if present, the number counts the appearance of OPEN tokens in the stream, starting at 0 for the first OPEN used for a given connection and incrementing by 1 for each subsequent OPEN. The matching CLOSE token must contain an identical number. These numbers are solely for debugging and may be omitted. They may be removed from the protocol once development has been completed.
In contrast to oldbanana (with the LIST token), newbanana does not use length-prefixed lists. Instead it relies upon the Banana layer to track OPEN/CLOSE tokens.
The token which follows an OPEN marker must be a string: either a STRING token or a VOCAB token. This string indicates what kind of new sub-expression is being started.
0x8A: ABORT: [[num]-ABORT-empty]
This token indicates that something has gone wrong on the sender side, and that the resulting object must not be handed upwards in the unslicer stack. It may be impossible or inconvenient for the sender to stop sending the tokens associated with the unfortunate object, so the receiver must be prepared to silently drop all further tokens up to the matching STOP marker.
The number, if present, will be the same one used by the OPEN token.
0x8B: LONGINT: [len-LONGINT-bytes]
0x8C: LONGNEG: [len-LONGNEG-bytes]
These are processed like STRING tokens, but the bytes form a base-256 encoded number, most-significant-byte first (note that this may require several passes and some intermediate storage). The size is (barely) limited by the length field, so the theoretical range is [0, 2**(2**(64*7)-1)-1), but the receiver can impose whatever length limit they wish.
LONGNEG is handled exactly like LONGINT but the number is negated first.
TODO: Add TRUE, FALSE, and NONE tokens.
When serializing an object, it is useful to view it as a directed graph. The root object is the one you start with, any objects it refers to are children of that root. Those children may point back to other objects that have already been serialized, or which will be serialized later.
Banana, like pickle and other serialization schemes, does a depth-first
traversal of this graph. Serialization is begun on each node before going
down into the child nodes. Banana tracks previously-handled nodes and
replaces them with numbered reference
tokens to break loops in
the graph.
A Banana Slicer is responsible for serializing a single user
object: it slices
that object into a series of Banana tokens. On the
receiving end, there is a corresponding Banana Unslicer which
accepts the incoming tokens and re-creates the user object. There are
different kinds of Slicers and Unslicers for lists, tuples, dictionaries,
and instances. Classes can provide their own Slicers if they want more
control over the serialization process.
The serialization context is stored in a Banana
object (names are
still being decided). This holds a stack of Banana Slicers, one per object
currently being serialized (i.e. one per node in the path from the root
object to the object currently being serialized).
For example, suppose a class instance is being serialized, and this class chose to use a dictionary to hold its instance state. That dictionary holds a list of numbers in one of its values. The Banana Stack would hold the root slicer, an InstanceSlicer, a DictSlicer, and finally a ListSlicer.
(note: it might be possible to move the functionality of the Banana
object entirely into the root slice
(root slicer
?)).
Upon unserialization, the Unbanana
object holds this context. A
stack of Unslicer
objects handle incoming tokens. The Unbanana
is responsible for tracking OPEN and CLOSE tokens, making sure a failure in
an Unslicer doesn't cause a loss of synchronization. Unslicer methods may
raise exceptions: these are caught by the Unbanana and cause the object
currently being unserialized to fail: its parent gets a UnbananaFailure
instead of the dict or list or instance that it would normally have
received.
The stack is used to determine three things:
The default case puts a Parent
slice at the bottom of the stack.
This can also be interpreted as a root object
, if you imagine that
any given user object being serialized is somehow a child of the overall
serialization context. In PB, for example, the root object would be related
to the connection.
In addition, the stack can be queried to find out what path leads from
the root object to the one currently being serialized. If something goes
wrong in the serialization process (an exception is thrown), this path can
make it much easier to find out when the trouble happened, as
opposed to merely where. Knowing which method of your FooObject failed
during serialization isn't very useful when you have 500 of them inside your
data structure and you need to know whether it was bar.thisfoo
or bar.thatfoo
which caused the problem. To this end, each
Slicer has a .describe
method which is supposed to return a
short string that explains how to get to the child node currently being
processed. When an error occurs, these strings are concatenated together and
put into the failure object.
The Parent slice is meant to provide the default behavior for the stack. The default class currently does the following:
TODO: The idea is to let other serialization contexts to other things. The tokens should probably go to the parent slice for handling: turning into bytes and sending over a wire, saving to a file, etc. Having the whole stack participate in the Tasting process means that objects can put restrictions on what is sent on their behalf: objects could refuse to let certain classes be sent as part of their instance state.
Serialization starts with the Parent Slicer being asked to serialize the
given object. The Parent gives the object to Banana. Banana starts by
walking the stack (which, of course, has only the Parent on it at that
point), calling the .taste
method for each Slicer there. If any
of them have a problem with the object being serialized, they express it by
raising an exception (TODO: which one? InsecureBanana?).
If the Taster stack passes the object, Banana's next job is to find a new
Slicer to handle the object. It does this by walking the stack, calling
.newSlicer
on each slice. The first one that returns an object
ends the search. In most cases, this is the Parent slice, which just looks
up the type()
of the object in the SlicerRegistry
.
A type which does not have a Slicer registered for it will cause an
exception to be raised here.
The new Slicer is pushed on the the stack. It is then sent three methods
in succession: .start
, .slice
, and
.finish
. start
defaults to registering the object
with setRefID
and sending a appropriate OPEN token.
slice
is defined on a per-Slicer basis to send all the
necessary tokens. finish
sends the CLOSE token.
Banana keeps strict track of the nesting level. For safety, each OPEN gets a sequence number so it can be matched with its CLOSE token. If a Slicer's .close method fails to send the close token, very bad things will happen (in general, all further objects will become children of the one that didn't CLOSE properly). The sequence numbers are an attempt to minimize the damage.
The Unbanana object has a corresponding stack of Banana Unslicer objects. Each one receives the tokens emitted by the matching Slicer on the sending side. The whole stack is used to determine new Unslicer objects, perform Tasting of incoming tokens, and manage object references.
OPEN tokens have a string (or short list of tokens.. see below) to indicate what kind of object is being started. This is looked up in the UnbananaRegistry just like object types are looked up in the BananaRegistry. The new Unslicer is pushed onto the stack.
ABORT
tokens indicate that something went wrong on the sending
side and that the current object is to be aborted. It causes the receiver to
ignore all tokens until the CLOSE token which closes the current node. This
is implemented by replacing the top-most slice with a DiscardUnslicer.
CLOSE
tokens finish the current node. The slice will pass its
completed object up to the childFinished
method of its parent.
To be precise, OPEN tokens are followed by an arbitrary list of other
tokens which are used to determine which UnslicerFactory should be invoked
to create the new Unslicer. Basic Python types are designated with a simple
string, like (OPEN list
) or (OPEN dict
), but instances are
serialized with two strings (OPEN instance
classname
), and
various exotic PB objects like method calls may involve a list of strings
and numbers (OPEN call
reqID objID methodname). The unbanana code
works with the unslicer stack's designated opener
object to apply
constraints to these indexing tokens and finally obtain the new Unslicer
when enough indexing tokens have been received.
The reason for assembling this list before creating the Unslicer (instead of using a generic InstanceUnslicer which switches behavior depending upon its first received token) is to support classes or PB methods which wish to push custom Unslicers to handle their deserialization process. For example, a class could push a StreamingFileUnslicer that accepts a series of string tokens and appends their contents to a file on disk. This Unslicer could reduce memory consumption (by only holding one chunk at a time) and update some kind of progress indicator as the data arrives. This particular feature was provided by the old StringPager utility, but custom Unslicers offer more flexibility and better efficiency (no additional round-trips).
(note: none of this affects the serialization side: those Slicers emit both their indexing tokens and their state tokens. It is only the receiving side where the index tokens are handled by a different piece of code than the content tokens).
In yet greater detail:
Index phaseand a
Contents phase. The first one (or two or three) tokens are the Index Tokens and the rest are the Content Tokens. The sequence ends with a CLOSE token.
If .inOpen is True, each new token type will be passed (through Banana.getLimit and top.openerCheckToken) to the opener's .openerCheckToken method, along with the current opentype tuple. The opener gets to decide if the token is acceptable (possibly raising a BananaError exception), and may return a length limit (usually for strings). Note that the opener does not maintain state about what phase the decoding process is in, so it may want to condition its response upon the length of the opentype.
After each index token is complete, it is appended to .opentype, then the list is passed (through Banana.handleOpen, top.doOpen, and top.open) to the opener's .open method. This can either return an Unslicer (which will finish the index phase: all further tokens will be sent to the new Unslicer), return None (to continue the index phase), raise a Violation (which causes an UnbananaFailure to be passed to the current top unslicer), or raise another exception (which causes the connection to be abandoned).
Each Unslicer handles a single OPEN sequence
, which starts with an
OPEN token and ends with a CLOSE token (or an ABORT token).
Acceptance of the OPEN token simply sets a flag to indicate that we are
in the Index Phase. (The OPEN token might not be accepted: it is submitted
to checkToken for approval first, as described below). During the Index
Phase, all tokens are appended to the current opentype
list and
handed as a tuple to the top-most Unslicer's doOpen
method.
This method can do one of the following things:
start
method described below. This ends the Index Phase.When a new Unslicer object is pushed on the top of the stack, it has its
.start
method called, in which it has an opportunity to create
whatever internal state is necessary to record the incoming content tokens.
Each created object will have a separate Unslicer instance. (TODO: could
optimize with singleton Unslicer objects, using start/finish methods to do
cleanup). The start method can run normally, or raise a Violation exception,
or some other exception.
This Unslicer is responsible for all incoming tokens until either 1: it pushes a new one on the stack, or 2: it receives a CLOSE token.
Each token starts with a length sequence, up to 64 bytes which are turned
into an integer. This is followed by a single type byte, distinguished from
the length bytes by having the high bit set (the type byte is always 0x80 or
greater). When the typebyte is received, the topmost Unslicer is asked about
its suitability by calling the .checkToken
method. (note that
CLOSE and ABORT tokens are always legal, and are not submitted to
checkToken). This method is expected to do one of the following:
sizelimit
. If the token body is
larger than sizelimit, the token will be rejected. This is used for
STRING, LONGINT, and LONGNEG tokens.Violation
exception to reject the token. This
will cause the remainder of the current OPEN sequence to be discarded (all
tokens through the matching CLOSE token). Unslicers should raise this if
their constraints will not accept the incoming object: for example a
constraint which is expecting a series of integers can accept
INT/NEG/LONGINT/LONGNEG tokens and reject OPEN/STRING/VOCAB/FLOAT tokens.
The topmost Unslicer (the same one which raised Violation) will receive
(through its .receiveChild
method) an UnbananaFailure object
which encapsulates the reason for the rejection If the token sequence is in the index phase
(i.e. it is just after
an OPEN token and a new Unslicer has not yet been pushed), then instead of
.checkToken
the top unslicer is sent
.openerCheckToken
. This method behaves just like checkToken,
but in addition to the type byte it is also given the opentype list (which
is built out of all the index tokens received during this index phase).
If the type byte is accepted, and the size limit is obeyed, then the rest
of the token is read and a finished (primitive) object is created: a string,
number, boolean, or None. This object is handed to the topmost Unslicer's
.receiveChild
method, where again it is has a few options:
self.abort
, which does
protocol.abandonUnslicer
If the child is handed an UnbananaFailure object, and it wishes to pass
it upwards to its parent, then self.abort
is the appropriate
thing to do. Raising an exception will accomplish the same thing, but with a
new UnbananaFailure that describes the exception raised here instead of the
one raised by a child object. It is bad to both call abort
and
raise an exception.
When the CLOSE token arrives, the Unslicer will have its
.receiveClose
method called. This is expected to do:
.receiveChild
of the parent Unslicer.After receiveClose has finished, the child is told to clean up by calling
its .finish
method. This can complete normally or raise a
Violation, BananaError, or something else.
Then, the old top-most Unslicer is popped from the stack and discarded.
Its parent is now the new top-most Unslicer, and the newly-unserialized
object is given to it with the .receiveChild
method. Note that
this method is used to deliver both primitive objects (from raw tokens)
and composite objects (from other Unslicers).
Schemas are enforced by Constraint objects which are given an opportunity
to pass judgement on each incoming token. When they do not like something
they are given, they respond by raising a Violation
exception.
The Violation exception is sometimes created with an argument that describes
the reason for the rejection, but frequently it is just a bare exception.
Most Violations are raised by the checkOpentype
and
checkObject
methods of the various classes in
schema.py
.
Exceptions which occur in an Unslicer (whether Violations or something more serious) can be confined to a single sub-tree of the object graph. The object being deserialized (and all of its children) is abandoned, and all remaining tokens for that object are discarded. However, the parent object (to which the abandoned object would have been given) gets to decide what happens next: it can either fail itself, or absorb the failure (much like an exception handler can choose to re-raise the exception or eat it).
Confinable exceptions are wrapped in an UnbananaFailure
object. The UnbananaFailure behaves like a regular
twisted.python.failure.Failure
object, except that it has an
attribute named .where
which indicate the object-graph pathname
where the problem occurred. The UnbananaFailure is passed up the Unslicer
stack in lieu of an actual object. Most Unslicers have code in their
receiveChild
methods to detect an UnbananaFailure and trigger
an abort, which causes all further tokens of the sub-tree to be discarded.
The connection is not dropped. Unslicers which partition their children's
sub-graphs (like the PBRootUnslicer, for which each child is a separate
operation) can simply ignore the UnbananaFailure, or respond to it by
sending an error message to the other end.
Other exceptions may occur during deserialization. Unrecognized opentypes
will probably cause a KeyError when the opener attempts to map them to an
Unserializer class. Coding erors could cause unexpected exceptions. If these
occur in the context of an Unslicer, they are wrapped in UnbananaFailures
and handled just like Violations. The only difference is that non-Violations
are always logged with log.err
(including a traceback), whereas
Violations are passed through the Banana.logViolation method which may or
may not log it (and will probably not include a traceback).
Exceptions which occur outside of an Unslicer indicate coding errors or
severe protocol violations and cause the connection to be dropped. The
exception is logged on the local side with log.err
, but the
remote end will not be given any reason for the disconnection. Protocol
violations are indicated with a BananaError exception.
The Banana object can also choose to respond to Violations by terminating
the connection. For example, the .hangupOnLengthViolation
flag
causes string-too-long violations to be raised directly instead of being
handled, which will cause the connection to be dropped (as it occurs in the
dataReceived method).
The serialized form of ["foo",(1,2)]
is the
following token sequence: OPEN STRING(list) STRING(foo) OPEN STRING(tuple)
INT(1) INT(2) CLOSE CLOSE. In practice, the STRING(list) would really be
something like VOCAB(7), likewise the STRING(tuple) might be VOCAB(8). Here
we walk through how this sequence is processed.
The initial Unslicer stack consists of the single RootUnslicer
rootun
.
OPEN rootun.checkToken(OPEN) : must not raise Violation enter index phase VOCAB(7) (equivalent to STRING(list)) rootun.openerCheckToken(VOCAB, ()) : must not raise Violation VOCAB token is looked up in .incomingVocabulary, turned into "list" rootun.doOpen(("list",)) : looks in UnslicerRegistry, returns ListUnslicer exit index phase the ListUnslicer is pushed on the stack listun.start() STRING(foo) listun.checkToken(STRING) : must return None or 3 or greater string is assembled listun.receiveChild("foo") : appends to list OPEN listun.checkToken(OPEN) : must not raise Violation enter index phase VOCAB(8) (equivalent to STRING(tuple)) listun.openerCheckToken(VOCAB, ()) : must not raise Violation VOCAB token is looked up, turned into "tuple" listun.doOpen(("tuple",)) : delegates through: BaseUnslicer.open self.opener (usually the RootUnslicer) self.opener.open(("tuple",)) returns TupleUnslicer exit index phase TupleUnslicer is pushed on the stack tupleun.start() INT(1) tupleun.checkToken(INT) : must not raise Violation integer is assembled tupleun.receiveChild(1) : appends to list INT(2) tupleun.checkToken(INT) : must not raise Violation integer is assembled tupleun.receiveChild(2) : appends to list CLOSE tupleun.receiveClose() : creates and returns the tuple (1,2) (could also return a Deferred) TupleUnslicer is popped from the stack and discarded listun.receiveChild((1,2)) CLOSE listun.receiveClose() : creates and returns the list ["foo", (1,2)] ListUnslicer is popped from the stack and discarded rootun.receiveChild(["foo", (1,2)])
Types and classes are roughly classified into containers and non-containers. The containers are further divided into mutable and immutable. Some examples of immutable containers are tuples and bound methods. Lists and dicts are mutable containers. Ints and strings are non-containers. Non-containers are always leaf nodes in the object graph.
During unserialization, objects are in one of three states: uncreated, referenceable (but not complete), and complete. Only mutable containers can be referenceable but not complete: immutable containers have no intermediate referenceable state.
Mutable containers (like lists) are referenceable but not complete during traversal of their child nodes. This means those children can reference the list without trouble.
Immutable containers (like tuples) present challenges when unserializing. The object cannot be created until all its components are referenceable. While it is guaranteed that these component objects will be complete before the graph traversal exits the current node, the child nodes are allowed to reference the current node during that traversal. The classic example is the graph created by the following Python fragment:
a = ([],) a[0].append((a,))
To handle these cases, the TupleUnslicer installs a Deferred into the object table when it begins unserializing (in the .start method). When the tuple is finally complete, the object table is updated and the Deferred is fired with the new tuple.
Containers (both mutable and immutable) are required to pay attention to the types of their incoming children and notice when they receive Deferreds instead of normal objects. These containers are not complete (in the sense described above) until those Deferreds have been replaced with referenceable objects. When the container receives the Deferred, it should attach a callback to it which will perform the replacement. In addition, immutable containers should check after each update to see if all the Deferreds have been cleared, and if so, complete their own object (and fire their own Deferreds so any containers they are a child of may be updated and/or completed).
Having the whole Slicer stack particpate in Tasting on the sending side seems to make a lot of sense. It might be better to have a way to push new Taster objects onto a separate stack. This would certainly help with performance, as the common case (where most Slicers ignore .taste) does a pointless method call to every Slice for every object sent. The trick is to make sure that exception cases don't leave a taster stranded on the stack when the object that put it there has gone away.
On the receiving side, each object has a corresponding .taste method,
which receives tokens instead of complete objects. This makes sense, because
you want to catch the dangerous data before it gets turned into an object,
but tokens are a pretty low-level place to do security checks. It might be
more useful to have some kind of instance taster stack
, with tasters
that are asked specifically about (class,state) pairs and whether they
should be turned into objects or not.
Because the Unslicers receive their data one token at a time, things like
InstanceUnslicer can perform security checks one attribute at a time.
traits
-style attribute constraints (see the Chaco project or the
PyCon-2003 presentation for details) can be implemented by having a
per-class dictionary of tests that attribute values must pass before they
will be accepted. The instance will only be created if all attributes fit
the constraints. The idea is to catch violations before any code is run on
the receiving side. Typical checks would be things like .foo must be a
number
, .bar must not be an instance
, .baz must implement the
IBazzer interface
.
Using the stack instead of a single Taster object means that the rules can be changed depending upon the context of the object being processed. A class that is valid as the first argument to a method call may not be valid as the second argument, or inside a list provided as the first argument. The PBMethodArgumentsUnslicer could change the way its .taste method behaves as its state machine progresses through the argument list.
There are several different ways to implement this Taster stack:
I'm sure this is safeclasses to override higher-level paranoia.
Of course, all this holds true for the sending side as well. A Slicer could enforce a policy that no objects of type Foo will be sent while it is on the stack.
It is anticipated that something like the current Jellyable/Unjellyable classes will be created to offer control over the Slicer/Unslicers used to handle instance of that class.
One eventual goal is to allow PB to implement E-like argument constraints.
The big change from the old Jelly scheme is that now serialization/unserialization is done in a more streaming format. Individual tokens are the basic unit of information. The basic tokens are just numbers and strings: anything more complicated (starting at lists) involves composites of other tokens.
The serialization side will be reworked to be a bit more producer-oriented. Objects should be able to defer their serialization temporarily (TODO: really??) like twisted.web resources can do NOT_DONE_YET right now. The big goal here is that large objects which can't fit into the socket buffers should not consume lots of memory, sitting around in a serialized state with nowhere to go. This must be balanced against the confusion caused by time-distributed serialization. PB method calls must retain their current in-order execution, and it must not be possible to interleave serialized state (big mess).
To actually accomplish this, objects should be able to provide their own Slicers. It may be convenient to do this entirely with Adapters, so Banana registers ListSlicer, etc as an ISlicer adapter for the fundamental types; pb.Copyable implements ISlicer by having a method to return a new Slicer; etc.
Likewise on the receiving end, the Unslicer is created when the OPEN token is received, and then receives all the tokens destined for that object.
Another goal of the Jelly+Banana->JustBanana change is the hope of writing Slicers and Unslicers in C. The CBanana module should have C objects (structs with function pointers) that can be looked up in a registry table and run to turn python objects into tokens and vice versa. This ought to be faster than running python code to implement the slices, at the cost of less flexibility. It would be nice if the resulting tokens could be sent directly to the socket at the C level without surfacing into python; barring this it is probably a good idea to accumulate the tokens into a large buffer so the code can do a few large writes instead of a gazillion small ones.
It ought to be possible to mix C and Python slices here: if the C code doesn't find the slice in the table, it can fall back to calling a python method that does a lookup in an extensible registry.
Random notes and wild speculations: take everything beyond here with two grains of salt
The oldbanana usage model has the layer above banana written in one of
two ways. The simple form is to use the banana.encode
and banana.decode
functions to turn an object into a
bytestream. This is used by twisted.spread.publish . The more flexible model
is to subclass Banana. The largest example of this technique is, of course,
twisted.spread.pb.Broker, but others which use it are twisted.trial.remote
and twisted.scripts.conch (which appears to use it over unix-domain
sockets).
Banana itself is a Protocol. The Banana subclass would generally override
the expressionReceived
method, which receives s-expressions
(lists of lists). These are processed to figure out what method should be
called, etc (processing which only has to deal with strings, numbers, and
lists). Then the serialized arguments are sent through Unjelly to produce
actual objects.
On output, the subclass usually calls self.sendEncoded
with
some set of objects. In the case of PB, the arguments to the remote method
are turned into s-expressions with jelly, then combined with the method
meta-data (object ID, method name, etc), then the whole request is sent to
sendEncoded
.
Newbanana moves the Jelly functionality into a stack of Banana Slices,
and the lowest-level token-to-bytestream conversion into the new Banana
object. Instead of overriding expressionReceived
, users could
push a different root Unslicer. to get more control over the receive
process.
Currently, Slicers call Banana.sendOpen/sendToken/sendClose/sendAbort, which
then creates bytes and does transport.write .
To move this into C, the transport should get to call CUnbanana.receiveToken
There should be CBananaUnslicers. Probably a parent.addMe(self) instead of
banana.stack.append(self), maybe addMeC for the C unslicer.
The Banana object is a Protocol, and has a dataReceived method. (maybe in
some C form, data could move directly from a CTransport to a CProtocol). It
parses tokens and hands them to its Unslicer stack. The root Unslicer is
probably created at connectionEstablished time. Subclasses of Banana could
use different RootUnslicer objects, or the users might be responsible for
setting up the root unslicer.
The Banana object is also created with a RootSlicer. Banana.writeToken
serializes the token and does transport.write . (a C form could have CSlicer
objects which hand tokens to a little CBanana which then hands bytes off to
a CTransport).
Doing the bytestream-to-Token conversion in C loses a lot of utility when
the conversion is done token at a time. It made more sense when a whole mess
of s-lists were converted at once.
All Slicers currently have a Banana pointer.. maybe they should have a
transport pointer instead? The Banana pointer is needed to get to top of the
stack.
want to be able to unserialize lists/tuples/dicts/strings/ints (basic
types
) without surfacing into python. want to deliver the completed
object to a python function.
It would be neat if a method could indicate that it would like to receive its arguments in a streaming fashion. This would involve calling the method early (as soon as the objectID and method name were known), then somehow feeding objects to it as they arrive. The object could return a handler or consumer sub-object which would be fed as tokens arrive over the wire. This consumer should have a way to enforce a constraint on its input.
This consumer object sounds a lot like an Unslicer, so maybe the method schema should indicate that the method will would like to be called right away so it can return an Unslicer to be pushed on the stack. That Unslicer could do whatever it wanted with the incoming tokens, and could enforce constraints with the usual checkToken/doOpen/receiveChild/receiveClose methods.
On the sending side, it would be neat to let a callRemote() invocation provide a Producer or a generator that will supply data as the network buffer becomes available. This could involve pushing a Slicer. Maybe Slicers should be generators.
The basic python types are considered safe
: the code which is
invoked by their receipt is well-understood and there is no way to cause
unsafe behavior during unserialization. Resource consumption attacks are
mitigated by Constraints imposed by the receiving schema.
Note that the OPEN(dict) slicer is implemented with code that sorts the list of keys before serializing them. It does this to provide deterministic behavior and make testing easier.
IntType, LongType (small+) | INT(value) |
IntType, LongType (small-) | NEG(value) |
IntType, LongType (large+) | LONGINT(value) |
IntType, LongType (large-) | LONGNEG(value) |
FloatType | FLOAT(value) |
StringType | STRING(value) |
StringType (tokenized) | VOCAB(tokennum) |
UnicodeType | OPEN(unicode) STRING(str.encode('UTF-8')) CLOSE |
ListType | OPEN(list) elem.. CLOSE |
TupleType | OPEN(tuple) elem.. CLOSE |
DictType, DictionaryType | OPEN(dict) (key,value).. CLOSE |
NoneType | OPEN(none) CLOSE |
BooleanType | OPEN(boolean) INT(0/1) CLOSE |
To serialize arbitrary python object graphs (including instances) requires that we allow more types in. This begins to get dangerous: with complex graphs of inter-dependent objects, instances may need to be used (by referencing objects) before they are fully initialized. A schema can be used to make assertions about what object types live where, but in general the contents of those objects are difficult to constrain.
For this reason, these types should only be used in places where you trust the creator of the serialized stream (the same places where you would be willing to use the standard Pickle module). Saving application state to disk and reading it back at startup time is one example.
Extended (unsafe) Python Types | |
InstanceType | OPEN(instance) STRING(reflect.qual(class)) (attr,value).. CLOSE |
ModuleType | OPEN(module) STRING(__name__) CLOSE |
ClassType | OPEN(class) STRING(reflect.qual(class)) CLOSE |
MethodType | OPEN(method) STRING(__name__) im_self im_class CLOSE |
FunctionType | OPEN(function) STRING(module.__name__) CLOSE |
The following sequences are used to support the RPC mechanisms of Perspective Broker.
PB (method call) Sequences | |
pb.Referenceable | OPEN(remote) INT(ref-id) [InterfaceList] CLOSE |
RemoteReference.__del__ | OPEN(decref) INT(refID) CLOSE |
pb.Copyable | see InstanceType above |
method call | OPEN(call) INT(request-id) INT(obj-id) STRING(methodname) (STRING(argname),argvalue).. CLOSE |
method call++ | OPEN(call2) INT(reqID) INT(objID) STRING(interfacename) INT(interfaceversion) STRING(methodname) (STRING(argname),argvalue).. CLOSE (this could be folded into 'call' with only 2 bytes of overhead) |
method response (success) | OPEN(answer) INT(request-id) value CLOSE |
method response (exception) | OPEN(error) INT(request-id) value CLOSE |
The first time a pb.Referenceable
is sent, the second object
is an InterfaceList, which is a list of tuples of (interfacename,
versionnum), and therefore constrainable by a schema of
ListOf(TupleOf(str,int)) with some appropriate maximum-length restrictions.
This InterfaceList describes all the Interfaces that the corresponding
pb.Referenceable
implements. The receiver uses this list to
look up local Interfaces (and therefore Schemas) to attach to the
pb.RemoteReference
. This is how method schemas are checked on
the sender side.
This implies that Interfaces must be registered, just as classes are for
pb.Copyable
. TODO: what happens if an unknown Interface is
received? TODO: are multiple versions of the same interface allowed? If so,
how is that specified?
The following types are not handled by any slicer, and will raise a
KeyError if one is referenced by an object being sliced. This technically
imposes a limit upon the kinds of objects that can be serialized, even by a
unsafe
serializer, but in practice it is not really an issue, as many
of these objects have no meaning outside the program invocation which
created them.
ObjectType
is the root class of all other types. All objects
are known by some other type in addition to ObjectType
, so the
fact that it is not handled explicitly does not matter.
StringTypes
is simply a list of StringType
and
UnicodeType
, so it does not need to be explicitly handled
either.
The following sequences are internal.
The OPEN(vocab) sequence is used to update the forward compression token-to-string table used by the VOCAB token. It is followed by a series of number/string pairs. All numbers that appear in VOCAB tokens must be associated with a string by appearing in the most recent OPEN(vocab) sequence.
internal types | |
vocab dict | OPEN(vocab) (num,string).. CLOSE |