-*- outline -*- Reasonably independent newpb sub-tasks that need doing. * implement schema.maxSize() In newpb, schemas serve two purposes: a) make programs safer by reducing the surprises that can appear in their arguments (i.e. factoring out argument-checking in a useful way) b) remove memory-consumption DoS attacks by putting an upper bound on the memory consumed by any particular message. Each schema has a pair of methods named maxSize() and maxDepth() which provide this upper bound. While the schema is in effect (say, during the receipt of a particular named argument to a remotely-invokable method), at most X bytes and Y slicer frames will be in use before either the object is accepted and processed or the schema notes the violation and the object is rejected (whereupon the temporary storage is released and all further bytes in the rejected object are simply discarded). Strictly speaking, the number returned by maxSize() is the largest string on the wire which has not yet been rejected as violating the constraint, but it is also a reasonable metric to describe how much internal storage must be used while processing it. (To achieve greater accuracy would involve knowing exactly how large each Python type is; not a sensible thing to attempt). The idea is that someone who is worried about an attacker throwing a really long string or an infinitely-nested list at them can ask the schema just what exactly their current exposure is. The tradeoff between flexibility ("accept any object whatsoever here") and exposure to DoS attack is then user-visible and thus user-selectable. To implement maxSize() for a basic schema (like a string), you simply need to look at banana.xhtml and see how basic tokens are encoded (you will also need to look at banana.py and see how deserialization is actually implemented). For a schema.StringConstraint(32) (which accepts strings <= 32 characters in length), the largest serialized form that has not yet been either accepted or rejected is: 64 bytes (header indicating 0x000000..0020 with lots of leading zeros) + 1 byte (STRING token) + 32 bytes (string contents) = 97 If the header indicates a conforming length (<=32) then just after the 32nd byte is received, the string object is created and handed to up the stack, so the temporary storage tops out at 97. If someone is trying to spam us with a million-character string, the serialized form would look like: 64 bytes (header indicating 1-million in hex, with leading zeros) + 1 byte (STRING token) = 65 at which point the receive parser would check the constraint, decide that 1000000 > 32, and reject the remainder of the object. So (with the exception of pass/fail maxSize values, see below), the following should hold true: schema.StringConstraint(32).maxSize() == 97 Now, schemas which represent containers have size limits that are the sum of their contents, plus some overhead (and a stack level) for the container itself. For example, a list of two small integers is represented in newbanana as: OPEN(list) INT INT CLOSE() which really looks like: opencount-OPEN len-STRING-"list" value-INT value-INT opencount-CLOSE This sequence takes at most: opencount-OPEN: 64+1 len-STRING-"list": 64+1+1000 (opentypes are confined to be <= 1k long) value-INT: 64+1 value-INT: 64+1 opencount-CLOSE: 64+1 or 5*(64+1)+1000 = 1325, or rather: 3*(64+1)+1000 + N*(IntConstraint().maxSize()) So ListConstraint.maxSize is computed by doing some math involving the .maxSize value of the objects that go into it (the ListConstraint.constraint attribute). This suggests a recursive algorithm. If any constraint is unbounded (say a ListConstraint with no limit on the length of the list), then maxSize() raises UnboundedSchema to indicate that there is no limit on the size of a conforming string. Clearly, if any constraint is found to include itself, UnboundedSchema must also be raised. This is a loose upper bound. For example, one non-conforming input string would be: opencount-OPEN: 64+1 len-STRING-"x"*1000: 64+1+1000 The entire string would be accepted before checking to see which opentypes were valid: the ListConstraint only accepts the "list" opentype and would reject this string immediately after the 1000th "x" was received. So a tighter upper bound would be 2*65+1000 = 1130. In general, the bound is computed by walking through the deserialization process and identifying the largest string that could make it past the validity checks. There may be later checks that will reject the string, but if it has not yet been rejected, then it still represents exposure for a memory consumption DoS. ** pass/fail sizes I started to think that it was necessary to have each constraint provide two maxSize numbers: one of the largest sequence that could possibly be accepted as valid, and a second which was the largest sequence that could be still undecided. This would provide a more accurate upper bound because most containers will respond to an invalid object by abandoning the rest of the container: i.e. if the current active constraint is: ListConstraint(StringConstraint(32), maxLength=30) then the first thing that doesn't match the string constraint (say an instance, or a number, or a 33-character string) will cause the ListUnslicer to go into discard-everything mode. This makes a significant difference when the per-item constraint allows opentypes, because the OPEN type (a string) is constrained to 1k bytes. The item constraint probably imposes a much smaller limit on the set of actual strings that would be accepted, so no kilobyte-long opentype will possibly make it past that constraint. That means there can only be one outstanding invalid object. So the worst case (maximal length) string that has not yet been rejected would be something like: OPEN(list) validthing [0] validthing [1] ... validthing [n-1] long-invalid-thing because if the long-invalid thing had been received earlier, the entire list would have been abandoned. This suggests that the calculation for ListConstraint.maxSize() really needs to be like overhead +(len-1)*itemConstraint.maxSize(valid) +(1)*itemConstraint.maxSize(invalid) I'm still not sure about this. I think it provides a significantly tighter upper bound. The deserialization process itself does not try to achieve the absolute minimal exposure (i.e., the opentype checker could take the set of all known-valid open types, compute the maximum length, and then impose a StringConstraint with that length instead of 1000), because it is, in general, a inefficient hassle. There is a tradeoff between computational efficiency and removing the slack in the maxSize bound, both in the deserialization process (where the memory is actually consumed) and in maxSize (where we estimate how much memory could be consumed). Anyway, maxSize() and maxDepth() (which is easier: containers add 1 to the maximum of the maxDepth values of their possible children) need to be implemented for all the Constraint classes. There are some tests (disabled) in test_schema.py for this code: those tests assert specific values for maxSize. Those values are probably wrong, so they must be updated to match however maxSize actually works. * finish testing of LONGINT/LONGNEG test_banana.InboundByteStream.testConstrainedInt needs implementation * add constraints to Slicers (outbound constraints) This is much easier than the inbound side. Each constraint just gets a single checkObject() call with the candidate object before it is sent. * implement CopyableSlicer/CopyableUnslicer This is pretty straightforward, the main issue is how to determine the constraint and in what way to allow the use of Adapters. * nail down some useful schema syntaxes This has two parts: parsing something like a __schema__ class attribute (see the sketches in schema.xhtml) into a tree of FooConstraint objects, and deciding how to retrieve schemas at runtime from things like the object being serialized or the object being called from afar. To be most useful, the syntax needs to mesh nicely (read "is identical to") things like formless and (maybe?) atop or whatever has replaced the high-density highly-structured save-to-disk scheme that twisted.world used to do. Some lingering questions in this area: When an object has a remotely-invokable method, where does the appropriate MethodConstraint come from? Some possibilities: an attribute of the method itself: obj.method.__schema__ from inside a __schema__ attribute of the object's class from inside a __schema__ attribute of an Interface (which?) that the object implements Likewise, when a caller holding a RemoteReference invokes a method on it, it would be nice to enforce a schema on the arguments they are sending to the far end ("be conservative in what you send"). Where should this schema come from? It is likely that the sender only knows an Interface for their RemoteReference. When PB determines that an object wants to be copied by value instead of by reference (pb.Copyable subclass, Copyable(obj), schema says so), where should it find a schema to define what exactly gets copied over? A class attribute of the object's class would make sense: most objects would do this, some could override jellyFor to get more control, and others could override something else to push a new Slicer on the stack and do streaming serialization. Whatever the approach, it needs to be paralleled by the receiving side's unjellyableRegistry. * decide upon what the "Shared" constraint should mean The idea of this one was to avoid some vulnerabilities by rejecting arbitrary object graphs. Fundamentally Banana can represent most anything (just like pickle), including objects that refer to each other in exciting loops and whorls. There are two problems with this: it is hard to enforce a schema that allows cycles in the object graph (indeed it is tricky to even describe one), and the shared references could be used to temporarily violate a schema. I think these might be fixable (the sample case is where one tuple is referenced in two different places, each with a different constraint, but the tuple is incomplete until some higher-level node in the graph has become referenceable, so [maybe] the schema can't be enforced until somewhat after the object has actually finished arriving). However, Banana is aimed at two different use-cases. One is kind of a replacement for pickle, where the goal is to allow arbitrary object graphs to be serialized but have more control over the process (in particular we still have an unjellyableRegistry to prevent arbitrary constructors from being executed during deserialization). In this mode, a larger set of Unslicers are available (for modules, bound methods, etc), and schemas may still be useful but are not enforced by default. PB will use the other mode, where the set of conveyable objects is much smaller, and security is the primary goal (including putting limits on resource consumption). Schemas are enforced by default, and all constraints default to sensible size limits (strings to 1k, lists to [currently] 30 items). Because complex object graphs are not commonly transported across process boundaries, the default is to not allow any Copyable object to be referenced multiple times in the same serialization stream. The default is to reject both cycles and shared references in the object graph, allowing only strict trees, making life easier (and safer) for the remote methods which are being given this object tree. The "Shared" constraint is intended as a way to turn off this default strictness and allow the object to be referenced multiple times. The outstanding question is what this should really mean: must it be marked as such on all places where it could be referenced, what is the scope of the multiple-reference region (per- method-call, per-connection?), and finally what should be done when the limit is violated. Currently Unslicers see an Error object which they can respond to any way they please: the default containers abandon the rest of their contents and hand an Error to their parent, the MethodCallUnslicer returns an exception to the caller, etc. With shared references, the first recipient sees a valid object, while the second and later recipient sees an error. * figure out Deferred errors for immutable containers Somewhat related to the previous one. The now-classic example of an immutable container which cannot be created right away is the object created by this sequence: t = ([],) t[0].append((t,)) This serializes into (with implicit reference numbers on the left): [0] OPEN(tuple) [1] OPEN(list) [2] OPEN(tuple) [3] OPEN(reference #0) CLOSE CLOSE CLOSE In newbanana, the second TupleUnslicer cannot return a fully-formed tuple to its parent (the ListUnslicer), because that tuple cannot be created until the contents are all referenceable, and that cannot happen until the first TupleUnslicer has completed. So the second TupleUnslicer returns a Deferred instead of a tuple, and the ListUnslicer adds a callback which updates the list's item when the tuple is complete. The problem here is that of error handling. In general, if an exception is raised (perhaps a protocol error, perhaps a schema violation) while an Unslicer is active, that Unslicer is abandoned (all its remaining tokens are discarded) and the parent gets an Error object. (the parent may give up too.. the basic Unslicers all behave this way, so any exception will cause everything up to the RootUnslicer to go boom, and the RootUnslicer has the option of dropping the connection altogether). When the error is noticed, the Unslicer stack is queried to figure out what path was taken from the root of the object graph to the site that had an error. This is really useful when trying to figure out which exact object cause a SchemaViolation: rather than being told a call trace or a description of the *object* which had a problem, you get a description of the path to that object (the same series of dereferences you'd use to print the object: obj.children[12].peer.foo.bar). When references are allowed, these exceptions could occur after the original object has been received, when that Deferred fires. There are two problems: one is that the error path is now misleading, the other is that it might not have been possible to enforce a schema because the object was incomplete. The most important thing is to make sure that an exception that occurs while the Deferred is being fired is caught properly and flunks the object just as if the problem were caught synchronously. This may involve discarding an otherwise complete object graph and blaming the problem on a node much closer to the root than the one which really caused the failure. * constrain ReferenceUnslicer properly The schema can use a ReferenceConstraint to indicate that the object must be a RemoteReference, and can also require that the remote object be capable of handling a particular Interface. This needs to be implemented. slicer.ReferenceUnslicer must somehow actually ask the constraint about the incoming tokens. An outstanding question is "what counts". The general idea is that RemoteReferences come over the wire as a connection-scoped ID number and an optional list of Interface names (strings and version numbers). In this case it is the far end which asserts that its object can implement any given Interface, and the receiving end just checks to see if the schema-imposed required Interface is in the list. This becomes more interesting when applied to local objects, or if a constraint is created which asserts that its object is *something* (maybe a RemoteReference, maybe a RemoteCopy) which implements a given Interface. In this case, the incoming object could be an actual instance, but the class name must be looked up in the unjellyableRegistry (and the class located, and the __implements__ list consulted) before any of the object's tokens are accepted. * decide on a version negotiation scheme * write oldbanana compatibility code? An oldbanana peer can be detected because the server side sends its dialect list from connectionMade, and oldbanana lists are sent with OLDLIST tokens (the explicit-length kind).