Cooperative Data Flows (via generators)

Background

Handling more than one Request

When creating services to handle simultaneous requests, processing state must be managed. A common approach, threading, applies the method used for single user operations: keeping processing state within local variables and function arguments. In this case, the operating environment handles variable allocation and cleanup. Unfortunately, threading requires much preemption and locking magic to keep each thread's execution independent, yet allowing data to be shared between threads. This magic can lead to subtle programming errors, and, in the case of Python, slowness as every changes to a mutable object must be synchronized with the global interpreter lock.

An alternative is to use a event-driven approach, where each operation is broken down into distinct steps, each step scheduled for operation in an delayed execution queue. In this approach, an object is often used used to maintain state information between each step, allowing other operations to be performed. Often each step is a member function, queued within Twisted using reactor.CallLater.

from twisted.internet import reactor

class Request:
    def __init__(self, name):
        self.name = name
        reactor.callLater(0, self.step_one)
    def step_one(self):
        print self.name, ": Step One"
        reactor.callLater(0, self.step_two)
    def step_two(self):
        print self.name, ": Step Two"
        reactor.callLater(0, self.step_done)
    def step_done(self):
        # cleanup, and don't call reactor again.
        pass

Request("James")
Request("Wendy")

# start and shut down the event loop
reactor.callLater(1, reactor.stop) 
reactor.run()

In the output of this example, one can see that execution alternates between steps for James and Wendy's request. In effect, the requests are done in a parallel, cooperative manner. While this is a good start, it isn't perfect as the event loop is stopped with a hard-coded timeout. It would be better for each Request object to signal when it is completed so that the event loop can be shut down sooner.

Deferred Callbacks

In Twisted, communication between operations is often accomplished with defer.Deferred. The Deferred object decouples an asyncronous operation by providing a temporary storage location for its result. Instead of calling the operation and waiting for a return value, an operation schedules itself to be executed with reactor.callLaterand then returns a defer.Deferred object. At this point, the caller can register a callback function to be executed when the operation has completed. In the example below, defer.DeferredList is used to merge the result of each request into a single notification, which is used to stop the main event loop.

from twisted.internet import reactor, defer

class Request:
    def __init__(self, name):
        self.name = name
        self.d    = defer.Deferred()
        reactor.callLater(0, self.step_one)
    def step_one(self):
        print self.name, ": Step One"
        reactor.callLater(0, self.step_two)
    def step_two(self):
        print self.name, ": Step Two"
        reactor.callLater(0, self.step_done)
    def step_done(self):
        self.d.callback(None) # notify callback that we are done

james = Request("James")
wendy = Request("Wendy")

# start and shut down the event loop
d = defer.DeferredList([james.d, wendy.d])
d.addCallback(lambda _: reactor.stop())
reactor.run()

While this Deferred approach is very good, it can get quickly complicated, especially if the request is not a simple linear sequence of steps, or when results must flow between steps incrementally. The flow module addresses these shortcomings using python generators.

Iterators and generators

An iterator is basically an object which produces a sequence of values. Python's iterators are simply objects with an __iter__() member function which returns an object (usually itself) which has a next() member function. The next() method is then invoked till it raises a StopIteration exception.

from twisted.python.compat import iter, StopIteration

class Counter:
    def __init__(self, count):
        self.count = count
    def __iter__(self):
        return self
    def next(self):
        ret = self.count
        self.count -= 1
        if ret: return ret
        raise StopIteration
        return ret

import sys
if sys.version_info < (2,2):
    def list(it):
        ret = []
        it = iter(it)
        try:
            while 1:
                ret.append(it.next())
        except StopIteration: pass
        return ret

print list(Counter(3))

# prints: [3, 2, 1]

State pattern

Often times it is useful for an iterator to change state during its production of values. This can be done nicely with the 'state' pattern.

class States:
    def __iter__(self):
        self.next = self.state_one
        return self
    def state_one(self):
        self.next = self.state_two
        return "one"
    def state_two(self):
        self.next = self.state_stop
        return "two"
    def state_stop(self):
        raise StopIteration

print list(States())

# prints: ['one', 'two']

Generators

With Python 2.2, there is a wonderful syntax sugar for creating iterators... generators. When a generator is first executed, an iterator is returned. And from there on, each invocation of next() gives the subsequent value produced by the yield statement. With generators, the two iterators above become very easy to express.

from __future__ import generators 

def Counter(count):
    while count > 0:
        yield count
        count -= 1

def States():
    yield "one"
    yield "two"

print list(Counter(3))
print list(States())

# prints:
#    [3, 2, 1]
#    ['one', 'two']

One technical difference between iterators and generators, is that raising an exception from a generator permanently halts the generator, while raising an exception from an iterator's next() method does not invalidate the iterator, that is, one could call the next() method again and possibly get results. From here on, we use the generator syntax for building iterators.

Chaining Generators

It is often useful to view an operation information as a flow between stages, where each stage may have several states or steps. This can be coded where the output of one generator is consumed by another. In this view, the last generator in the chain 'pulls' data from previous stages.

from __future__ import generators 

def Counter(count):
    while count > 0:
        yield count
        count -= 1

def Consumer():
    producer = Counter(3)
    for result in producer:
        if 2 != result:
            yield result

print list(Consumer())

# prints: [3, 1]

While this is a very clean syntax for creating a multi-stage operation, it would block all other operations. Therefore, some mechanism for pausing the generator and resuming it later is required.

Introducing Flow

The flow module provides this ability to cooperate with other tasks. This is accomplished by wrapping iterables in a flow stage object and following an alternating yield pattern. That is, before each value pulled from the stage, the operation must yield the wrapper object. During this yield bookkeeping is done to prepare the next value, or, if the next value is not available, re-scheduling the operation to be executed later.

from __future__ import generators
from twisted.flow import flow

def Counter(count):
    while count > 0:
        yield count
        count -= 1

def Consumer():
    producer = flow.wrap(Counter(3))
    yield producer
    for result in producer:
        if 2 != result:
            yield result
        yield producer

print list(flow.Block(Consumer))

# prints: [3, 1]

Equivalent Forms

In the above code, producer.next() is called implicitly, and thus the generator above is equivalent to...

from __future__ import generators
def Consumer():
    producer = flow.wrap(Counter(3))
    while True:
        yield producer
        result = producer.next()
        if 2 != result:
            yield result

The next() method of the wrapper object does several things. First, it checks to see if there are results ready, if so it returns the next one. If not, it looks for a failure, raising it. And finally, checking to see if the end of the input has been reached. More concretely...

from __future__ import generators
def Consumer():
    producer = flow.wrap(Counter(3))
    while True:
        yield producer
        if producer.results:
            result = producer.results.pop(0)
            yield result
            continue
        if producer.failure:
            producer.stop = 1
            producer.failure.trap()
        if producer.stop:
            break

Handling failures

Another difference between plain old iterables and one wrapped with the flow module is that exceptions caught are wrapped with a twisted.python.failure.Failure object for later delivery. There are two basic ways to recover from exceptions. One way is to list expected exceptions in the call to flow.wrap. Alternatively, a try/except block can be used, catching flow.Failure objects.

from __future__ import generators
from twisted.flow import flow

def Producer(throw):
    yield 1
    yield 2
    raise throw
    yield 3

def Consumer(producer):
    producer = flow.wrap(producer, IOError)
    yield producer
    try:
        for result in producer:
            if result is IOError:
                # handle trapped error
                yield "trapped"
            else:
                yield result
            yield producer
    except AssertionError, err:
        # handle assertion error
        yield str(err)

print list(flow.Block(Consumer(Producer(IOError("trap")))))
print list(flow.Block(Consumer(Producer(AssertionError("notrap")))))

# prints: [1, 2, 'trapped']
# prints: [1, 2, 'notrap']

Cooperate

This seems like quite the effort, wrapping each iterator and then having to alter the calling sequence. Why? The answer is that it allows for a flow.Cooperate object to be returned. When this happens, the entire call chain can be paused so that other flows can use the call stack. For flow.Block, the implementation of Cooperate simply puts the call chain to sleep.

from __future__ import generators
from twisted.flow import flow

def gen():
    yield 'immediate'
    yield flow.Cooperate(2)
    yield 'delayed'

for x in flow.Block(gen):
   print x

# prints:
#  immediate
#  delayed

Merge and Zip

Cooperate can be demonstrated with flow.Merge and flow.Zip components. These two stages join two or more generators into a single stream. The Merge operation does this by rotating between any of its input streams which are available. The Zip operation, on the other hand, waits for a result from each stream before it produces a result.

from __future__ import generators
from twisted.flow import flow

def Right():
    yield "one"
    yield "two"
    yield flow.Cooperate()
    yield "three"

def Left():
    yield 1
    yield 2
    yield 3

print "Zip", list(flow.Block(flow.Zip(Right,Left)))
print "Merge", list(flow.Block(flow.Merge(Right,Left)))

# Zip [('one', 1), ('two', 2), ('three', 3)]
# Merge ['one', 1, 'two', 2, 3, 'three']

Integrating with Twisted

While flow.Block is useful for understanding how flow works, it undermines the whole purpose by sleeping during Cooperate and blocking other operations. Following are numerous examples of how to integrate flow with the Twisted framework in a non-blocking manner.

Deferred Flow

For starters, the long example of Wendy and James, with its numerous calls to reactor.callLater to schedule each step of the operation can be rewritten using flow.Deferred.

from __future__ import generators
from twisted.internet import reactor, defer
from twisted.flow import flow

def request(name):
    print name, ": Step One"
    yield flow.Cooperate()
    print name, ": Step Two"

james = flow.Deferred(request("James"))
wendy = flow.Deferred(request("Wendy"))

# start and shut down the event loop
d = defer.DeferredList([wendy, james])
d.addCallback(lambda _: reactor.stop())
reactor.run()

Under the sheets, when flow.Deferred encounters a flow.Cooperate event, it reschedules itself to be resumed at a later time, allowing other asyncronous operations to proceed. Once again, defer.DeferredList is only used here to stop the reactor after all operations are completed.

Flow Resources

By using flow.Deferred it is easy to make up a web resource which is both long running, but also can serve more than one customer at a time. Run the example below, and with two browsers, view the webpage. Notice that both web pages are being created at the same time.

from __future__ import generators
from twisted.internet import reactor
from twisted.web import server, resource
from twisted.flow import flow

def cooperative(count):
    """ simulate a cooperative resource, that not block """
    from random import random
    idx = 0
    while idx < count:
        val = random()
        yield flow.Cooperate(val)
        yield str(val)[-5:]
        idx += 1

def flowRender(req):
    count = int(req.args.get("count",["30"])[0])
    req.write("<html><body>")
    req.write(" %s Random numbers: <ol>\n" % count)
    source = flow.wrap(cooperative(count))
    yield source
    for itm in source:
        req.write("<li>%s</li>\n" % itm)
        yield source
    req.write("</ol></body></html>\n")

class FlowResource(resource.Resource):
    def __init__(self, gen):
        resource.Resource.__init__(self)
        self.gen = gen
    def isLeaf(self): return true
    def render(self, req):
        self.d = flow.Deferred(self.gen(req))
        self.d.addCallback(lambda _: req.finish())
        return server.NOT_DONE_YET

print "visit http://localhost:8081/ to view the example"
root = FlowResource(flowRender)
site = server.Site(root)
reactor.listenTCP(8081,site)
reactor.run()

Flow Protocols

The flow module can also be used to construct protocols easily, following is an echo client and server. For each protocol, one must yield the connection before reading from it. When the generator finishes, the connection is automatically closed.

from __future__ import generators
from twisted.flow import flow
from twisted.internet import protocol, reactor
PORT = 8392

def echoServer(conn):
    yield conn
    for data in conn:
        conn.write(data)
        yield conn                                   
    reactor.callLater(0,reactor.stop)

server = protocol.ServerFactory()
server.protocol = flow.makeProtocol(echoServer)
reactor.listenTCP(PORT,server)

def echoClient(conn):
    conn.write("Hello World")
    yield conn
    print conn.next()
    conn.write("Another Line")
    yield conn
    print conn.next()

client = protocol.ClientFactory()
client.protocol = flow.makeProtocol(echoClient)
reactor.connectTCP("localhost", PORT, client)
reactor.run()

Threading

While the Flow module allows for multiple cooperative tasks to work in a single thread, sometimes it is necessary to have the output of another thread be consumed within a flow. This can be done with twisted.flow.threads.Threaded, which takes an iterable object and executes it in another thread. Following is a sample iterable, countSleep which simulates a blocking producer which must be put into a thread. To show that twisted.flow.threads.Threaded does not block other operations, a similar, cooperative count is included.

from __future__ import generators
from twisted.internet import reactor, defer
from twisted.flow import flow
from twisted.flow.threads import Threaded

def countSleep(index):
    from time import sleep
    for index in range(index):
        sleep(.3)
        print "sleep", index
        yield index

def countCooperate(index):
    for index in range(index):
        yield flow.Cooperate(.1)
        print "cooperate", index
        yield "coop %s" % index

d = flow.Deferred( flow.Merge(
        Threaded(countSleep(5)),
        countCooperate(5)))

# # alternatively
# d1 = flow.Deferred(Threaded(countSleep(5)))
# d2 = flow.Deferred(countCooperate(10))
# d = defer.DeferredList([d1,d2])

def prn(x):
    print x
    reactor.stop()
d.addCallback(prn)
reactor.run()

Using database connections

Since most standard database drivers are thread based, the flow builds on the twisted.flow.threads.Threaded by providing a QueryIterator, which takes an sql query and a ConnectionPool.

from __future__         import generators
from twisted.enterprise import adbapi
from twisted.internet   import reactor
from twisted.flow import flow
from twisted.flow.threads import QueryIterator, Threaded

dbpool = adbapi.ConnectionPool("SomeDriver",host='localhost', 
             db='Database',user='User',passwd='Password')

# # I test with...
# from pyPgSQL import PgSQL
# dbpool = PgSQL

sql = """
  (SELECT 'one')
UNION ALL
  (SELECT 'two')
UNION ALL
  (SELECT 'three')
"""
def consumer():
    print "executing"
    query = Threaded(QueryIterator(dbpool, sql))
    print "yielding"
    yield query
    print "done yeilding"
    for row in query:
        print "Processed result : ", row
        yield query

from twisted.internet import reactor
def finish(result): 
    print "Deferred Complete : ", result
    reactor.stop()
f = flow.Deferred(consumer())
f.addBoth(finish)
reactor.run()