SCHEDULER_README [plain text]

PPoossttffiixx QQuueeuuee SScchheedduulleerr

-------------------------------------------------------------------------------

WWhhaatt tthhiiss ffiillee iiss aabboouutt

This is the beginning of documentation for the clever queue manager scheduling
algorithm by Patrik Rak. For a long time, this code was made available under
the name "nqmgr(8)" (new queue manager), as an optional module. As of Postfix
2.1 this is the default queue manager, which is always called "qmgr(8)". The
old queue manager will for some time will be available under the name of "oqmgr
(8)".

WWhhyy tthhee oolldd PPoossttffiixx qquueeuuee mmaannaaggeerr wwaass rreeppllaacceedd

The old Postfix scheduler had several limitations due to unfortunate choices in
its design.

 1. Round-robin selection by destination for mail that is delivered via the
    same message delivery transport. The round-robin strategy was chosen with
    the intention to prevent a single (destination) site from using up too many
    mail delivery resources. However, that strategy penalized inbound mail on
    bi-directional gateways. The poor suffering inbound destination would be
    selected only 1/number-of-destinations of the time, even when it had more
    mail than other destinations, and thus mail could be delayed.

    Victor Duchovni found a workaround: use different message delivery
    transports, and thus avoid the starvation problem. The Patrik Rak scheduler
    solves this problem by using FIFO selection.

 2. A second limitation of the old Postfix scheduler was that delivery of bulk
    mail would block all other deliveries, causing large delays. Patrik Rak's
    scheduler allows mail with fewer recipients to slip past bulk mail in an
    elegant manner.

HHooww tthhee qquueeuuee mmaannaaggeerr sscchheedduulleerr wwoorrkkss

The following text is from Patrik Rak and should be read together with the
postconf(5) manual that describes each configuration parameter in detail.

From user's point of view, oqmgr(8) and qmgr(8) are both the same, except for
how next message is chosen when delivery agent becomes available. You already
know that oqmgr(8) uses round-robin by destination while qmgr(8) uses simple
FIFO, except for some preemptive magic. The postconf(5) manual documents all
the knobs the user can use to control this preemptive magic - there is nothing
else to the preemption than the quite simple conditions described below.

As for programmer-level documentation, this will have to be extracted from all
those emails we have exchanged with Wietse [rats! I hoped that Patrik would do
the work for me -- Wietse] But I think there are no missing bits which we have
not mentioned in our conversations.

However, even from programmer's point of view, there is nothing more to add to
the message scheduling idea itself. There are few things which make it look
more complicated than it is, but the algorithm is the same as the user
perceives it. The summary of the differences of the programmer's view from the
user's view are:

 1. Simplification of terms for users: The user knows about messages and
    recipients. The program itself works with jobs (one message is split among
    several jobs, one per each transport needed to deliver the message) and
    queue entries (each entry may group several recipients for same
    destination). Then there is the peer structure introduced by qmgr(8) which
    is simply per-job analog of the queue structure.

 2. Dealing with concurrency limits: The actual implementation is complicated
    by the fact that the messages (resp. jobs) may not be delivered in the
    exactly scheduled order because of the concurrency limits. It is necessary
    to skip some "blocker" jobs when the concurrency limit is reached and get
    back to them again when the limit permits.

 3. Dealing with resource limits: The actual implementation is complicated by
    the fact that not all recipients may be read in-core. Therefore each
    message has some recipients in-core and some may remain on-file. This means
    that a) the preemptive algorithm needs to work with recipient count
    estimates instead of exact counts, b) there is extra code which needs to
    manipulate the per-transport pool of recipients which may be read in-core
    at the same time, and c) there is extra code which needs to be able to read
    recipients into core in batches and which is triggered at appropriate
    moments.

 4. Doing things efficiently: All important things I am aware of are done in
    the minimum time possible (either directly or at least when amortized
    complexity is used), but to choose which job is the best candidate for
    preempting the current job requires linear search of up to all transport
    jobs (the worst theoretical case - the reality is much better). As this is
    done every time the next queue entry to be delivered is about to be chosen,
    it seemed reasonable to add cache which minimizes the overhead. Maintenance
    of this candidate cache slightly obfuscates things.

The points 2 and 3 are those which made the implementation (look) complicated
and were the real coding work, but I believe that to understand the scheduling
algorithm itself (which was the real thinking work) is fairly easy.