newpb-jobs.txt [plain text]

-*- outline -*-

Reasonably independent newpb sub-tasks that need doing.

* implement schema.maxSize()

In newpb, schemas serve two purposes:

 a) make programs safer by reducing the surprises that can appear in their
    arguments (i.e. factoring out argument-checking in a useful way)

 b) remove memory-consumption DoS attacks by putting an upper bound on the
    memory consumed by any particular message.

Each schema has a pair of methods named maxSize() and maxDepth() which
provide this upper bound. While the schema is in effect (say, during the
receipt of a particular named argument to a remotely-invokable method), at
most X bytes and Y slicer frames will be in use before either the object is
accepted and processed or the schema notes the violation and the object is
rejected (whereupon the temporary storage is released and all further bytes
in the rejected object are simply discarded). Strictly speaking, the number
returned by maxSize() is the largest string on the wire which has not yet
been rejected as violating the constraint, but it is also a reasonable
metric to describe how much internal storage must be used while processing
it. (To achieve greater accuracy would involve knowing exactly how large
each Python type is; not a sensible thing to attempt).

The idea is that someone who is worried about an attacker throwing a really
long string or an infinitely-nested list at them can ask the schema just what
exactly their current exposure is. The tradeoff between flexibility ("accept
any object whatsoever here") and exposure to DoS attack is then user-visible
and thus user-selectable.

To implement maxSize() for a basic schema (like a string), you simply need
to look at banana.xhtml and see how basic tokens are encoded (you will also
need to look at banana.py and see how deserialization is actually
implemented). For a schema.StringConstraint(32) (which accepts strings <= 32
characters in length), the largest serialized form that has not yet been
either accepted or rejected is:

  64 bytes (header indicating 0x000000..0020 with lots of leading zeros)
 + 1 byte (STRING token)
 + 32 bytes (string contents)
 = 97

If the header indicates a conforming length (<=32) then just after the 32nd
byte is received, the string object is created and handed to up the stack, so
the temporary storage tops out at 97. If someone is trying to spam us with a
million-character string, the serialized form would look like:

  64 bytes (header indicating 1-million in hex, with leading zeros)
+  1 byte (STRING token)
= 65

at which point the receive parser would check the constraint, decide that
1000000 > 32, and reject the remainder of the object.

So (with the exception of pass/fail maxSize values, see below), the following
should hold true:

 schema.StringConstraint(32).maxSize() == 97

Now, schemas which represent containers have size limits that are the sum of
their contents, plus some overhead (and a stack level) for the container
itself. For example, a list of two small integers is represented in newbanana
as:

 OPEN(list)
  INT
  INT
 CLOSE()

which really looks like:

 opencount-OPEN
  len-STRING-"list"
  value-INT
  value-INT
 opencount-CLOSE

This sequence takes at most:

 opencount-OPEN: 64+1
 len-STRING-"list": 64+1+1000  (opentypes are confined to be <= 1k long)
 value-INT: 64+1
 value-INT: 64+1
 opencount-CLOSE: 64+1

or 5*(64+1)+1000 = 1325, or rather:

  3*(64+1)+1000 + N*(IntConstraint().maxSize())

So ListConstraint.maxSize is computed by doing some math involving the
.maxSize value of the objects that go into it (the ListConstraint.constraint
attribute). This suggests a recursive algorithm. If any constraint is
unbounded (say a ListConstraint with no limit on the length of the list),
then maxSize() raises UnboundedSchema to indicate that there is no limit on
the size of a conforming string. Clearly, if any constraint is found to
include itself, UnboundedSchema must also be raised.

This is a loose upper bound. For example, one non-conforming input string
would be:

 opencount-OPEN: 64+1
 len-STRING-"x"*1000: 64+1+1000

The entire string would be accepted before checking to see which opentypes
were valid: the ListConstraint only accepts the "list" opentype and would
reject this string immediately after the 1000th "x" was received. So a
tighter upper bound would be 2*65+1000 = 1130.

In general, the bound is computed by walking through the deserialization
process and identifying the largest string that could make it past the
validity checks. There may be later checks that will reject the string, but
if it has not yet been rejected, then it still represents exposure for a
memory consumption DoS.

** pass/fail sizes

I started to think that it was necessary to have each constraint provide two
maxSize numbers: one of the largest sequence that could possibly be accepted
as valid, and a second which was the largest sequence that could be still
undecided. This would provide a more accurate upper bound because most
containers will respond to an invalid object by abandoning the rest of the
container: i.e. if the current active constraint is:

 ListConstraint(StringConstraint(32), maxLength=30)

then the first thing that doesn't match the string constraint (say an
instance, or a number, or a 33-character string) will cause the ListUnslicer
to go into discard-everything mode. This makes a significant difference when
the per-item constraint allows opentypes, because the OPEN type (a string) is
constrained to 1k bytes. The item constraint probably imposes a much smaller
limit on the set of actual strings that would be accepted, so no
kilobyte-long opentype will possibly make it past that constraint. That means
there can only be one outstanding invalid object. So the worst case (maximal
length) string that has not yet been rejected would be something like:

  OPEN(list)
   validthing [0]
   validthing [1]
    ...
   validthing [n-1]
   long-invalid-thing

because if the long-invalid thing had been received earlier, the entire list
would have been abandoned.

This suggests that the calculation for ListConstraint.maxSize() really needs
to be like
  overhead
  +(len-1)*itemConstraint.maxSize(valid)
  +(1)*itemConstraint.maxSize(invalid)

I'm still not sure about this. I think it provides a significantly tighter
upper bound. The deserialization process itself does not try to achieve the
absolute minimal exposure (i.e., the opentype checker could take the set of
all known-valid open types, compute the maximum length, and then impose a
StringConstraint with that length instead of 1000), because it is, in
general, a inefficient hassle. There is a tradeoff between computational
efficiency and removing the slack in the maxSize bound, both in the
deserialization process (where the memory is actually consumed) and in
maxSize (where we estimate how much memory could be consumed).

Anyway, maxSize() and maxDepth() (which is easier: containers add 1 to the
maximum of the maxDepth values of their possible children) need to be
implemented for all the Constraint classes. There are some tests (disabled)
in test_schema.py for this code: those tests assert specific values for
maxSize. Those values are probably wrong, so they must be updated to match
however maxSize actually works.

* finish testing of LONGINT/LONGNEG

test_banana.InboundByteStream.testConstrainedInt needs implementation

* add constraints to Slicers (outbound constraints)

This is much easier than the inbound side. Each constraint just gets a single
checkObject() call with the candidate object before it is sent.

* implement CopyableSlicer/CopyableUnslicer

This is pretty straightforward, the main issue is how to determine the
constraint and in what way to allow the use of Adapters.

* nail down some useful schema syntaxes

This has two parts: parsing something like a __schema__ class attribute (see
the sketches in schema.xhtml) into a tree of FooConstraint objects, and
deciding how to retrieve schemas at runtime from things like the object being
serialized or the object being called from afar. To be most useful, the
syntax needs to mesh nicely (read "is identical to") things like formless and
(maybe?) atop or whatever has replaced the high-density highly-structured
save-to-disk scheme that twisted.world used to do.

Some lingering questions in this area:

 When an object has a remotely-invokable method, where does the appropriate
 MethodConstraint come from? Some possibilities:

  an attribute of the method itself: obj.method.__schema__

  from inside a __schema__ attribute of the object's class

  from inside a __schema__ attribute of an Interface (which?) that the object
  implements

 Likewise, when a caller holding a RemoteReference invokes a method on it, it
 would be nice to enforce a schema on the arguments they are sending to the
 far end ("be conservative in what you send"). Where should this schema come
 from? It is likely that the sender only knows an Interface for their
 RemoteReference.

 When PB determines that an object wants to be copied by value instead of by
 reference (pb.Copyable subclass, Copyable(obj), schema says so), where
 should it find a schema to define what exactly gets copied over? A class
 attribute of the object's class would make sense: most objects would do
 this, some could override jellyFor to get more control, and others could
 override something else to push a new Slicer on the stack and do streaming
 serialization. Whatever the approach, it needs to be paralleled by the
 receiving side's unjellyableRegistry.

* decide upon what the "Shared" constraint should mean

The idea of this one was to avoid some vulnerabilities by rejecting arbitrary
object graphs. Fundamentally Banana can represent most anything (just like
pickle), including objects that refer to each other in exciting loops and
whorls. There are two problems with this: it is hard to enforce a schema that
allows cycles in the object graph (indeed it is tricky to even describe one),
and the shared references could be used to temporarily violate a schema.

I think these might be fixable (the sample case is where one tuple is
referenced in two different places, each with a different constraint, but the
tuple is incomplete until some higher-level node in the graph has become
referenceable, so [maybe] the schema can't be enforced until somewhat after
the object has actually finished arriving).

However, Banana is aimed at two different use-cases. One is kind of a
replacement for pickle, where the goal is to allow arbitrary object graphs to
be serialized but have more control over the process (in particular we still
have an unjellyableRegistry to prevent arbitrary constructors from being
executed during deserialization). In this mode, a larger set of Unslicers are
available (for modules, bound methods, etc), and schemas may still be useful
but are not enforced by default.

PB will use the other mode, where the set of conveyable objects is much
smaller, and security is the primary goal (including putting limits on
resource consumption). Schemas are enforced by default, and all constraints
default to sensible size limits (strings to 1k, lists to [currently] 30
items). Because complex object graphs are not commonly transported across
process boundaries, the default is to not allow any Copyable object to be
referenced multiple times in the same serialization stream. The default is to
reject both cycles and shared references in the object graph, allowing only
strict trees, making life easier (and safer) for the remote methods which are
being given this object tree.

The "Shared" constraint is intended as a way to turn off this default
strictness and allow the object to be referenced multiple times. The
outstanding question is what this should really mean: must it be marked as
such on all places where it could be referenced, what is the scope of the
multiple-reference region (per- method-call, per-connection?), and finally
what should be done when the limit is violated. Currently Unslicers see an
Error object which they can respond to any way they please: the default
containers abandon the rest of their contents and hand an Error to their
parent, the MethodCallUnslicer returns an exception to the caller, etc. With
shared references, the first recipient sees a valid object, while the second
and later recipient sees an error.


* figure out Deferred errors for immutable containers

Somewhat related to the previous one. The now-classic example of an immutable
container which cannot be created right away is the object created by this
sequence:

        t = ([],)
        t[0].append((t,))

This serializes into (with implicit reference numbers on the left):

[0] OPEN(tuple)
[1]  OPEN(list)
[2]   OPEN(tuple)
[3]    OPEN(reference #0)
      CLOSE
     CLOSE
    CLOSE

In newbanana, the second TupleUnslicer cannot return a fully-formed tuple to
its parent (the ListUnslicer), because that tuple cannot be created until the
contents are all referenceable, and that cannot happen until the first
TupleUnslicer has completed. So the second TupleUnslicer returns a Deferred
instead of a tuple, and the ListUnslicer adds a callback which updates the
list's item when the tuple is complete.

The problem here is that of error handling. In general, if an exception is
raised (perhaps a protocol error, perhaps a schema violation) while an
Unslicer is active, that Unslicer is abandoned (all its remaining tokens are
discarded) and the parent gets an Error object. (the parent may give up too..
the basic Unslicers all behave this way, so any exception will cause
everything up to the RootUnslicer to go boom, and the RootUnslicer has the
option of dropping the connection altogether). When the error is noticed, the
Unslicer stack is queried to figure out what path was taken from the root of
the object graph to the site that had an error. This is really useful when
trying to figure out which exact object cause a SchemaViolation: rather than
being told a call trace or a description of the *object* which had a problem,
you get a description of the path to that object (the same series of
dereferences you'd use to print the object: obj.children[12].peer.foo.bar).

When references are allowed, these exceptions could occur after the original
object has been received, when that Deferred fires. There are two problems:
one is that the error path is now misleading, the other is that it might not
have been possible to enforce a schema because the object was incomplete.

The most important thing is to make sure that an exception that occurs while
the Deferred is being fired is caught properly and flunks the object just as
if the problem were caught synchronously. This may involve discarding an
otherwise complete object graph and blaming the problem on a node much closer
to the root than the one which really caused the failure.

* constrain ReferenceUnslicer properly

The schema can use a ReferenceConstraint to indicate that the object must be
a RemoteReference, and can also require that the remote object be capable of
handling a particular Interface.

This needs to be implemented. slicer.ReferenceUnslicer must somehow actually
ask the constraint about the incoming tokens.

An outstanding question is "what counts". The general idea is that
RemoteReferences come over the wire as a connection-scoped ID number and an
optional list of Interface names (strings and version numbers). In this case
it is the far end which asserts that its object can implement any given
Interface, and the receiving end just checks to see if the schema-imposed
required Interface is in the list.

This becomes more interesting when applied to local objects, or if a
constraint is created which asserts that its object is *something* (maybe a
RemoteReference, maybe a RemoteCopy) which implements a given Interface. In
this case, the incoming object could be an actual instance, but the class
name must be looked up in the unjellyableRegistry (and the class located, and
the __implements__ list consulted) before any of the object's tokens are
accepted.

* decide on a version negotiation scheme

* write oldbanana compatibility code?

An oldbanana peer can be detected because the server side sends its dialect
list from connectionMade, and oldbanana lists are sent with OLDLIST tokens
(the explicit-length kind).