STRESS_README   [plain text]


PPoossttffiixx SSttrreessss--DDeeppeennddeenntt CCoonnffiigguurraattiioonn

-------------------------------------------------------------------------------

OOvveerrvviieeww

This document describes the symptoms of Postfix SMTP server overload. It
presents permanent main.cf changes to avoid overload during normal operation,
and temporary main.cf changes to cope with an unexpected burst of mail. This
document makes specific suggestions for Postfix 2.5 and later which support
stress-adaptive behavior, and for earlier Postfix versions that don't.

Topics covered in this document:

  * Symptoms of Postfix SMTP server overload
  * Service more SMTP clients at the same time
  * Spend less time per SMTP client
  * Disconnect suspicious SMTP clients
  * Temporary measures for older Postfix releases
  * Automatic stress-adaptive behavior
  * Detecting support for stress-adaptive behavior
  * Forcing stress-adaptive behavior on or off
  * Other measures to off-load zombies
  * Credits

SSyymmppttoommss ooff PPoossttffiixx SSMMTTPP sseerrvveerr oovveerrllooaadd

Under normal conditions, the Postfix SMTP server responds immediately when an
SMTP client connects to it; the time to deliver mail is noticeable only with
large messages. Performance degrades dramatically when the number of SMTP
clients exceeds the number of Postfix SMTP server processes. When an SMTP
client connects while all Postfix SMTP server processes are busy, the client
must wait until a server process becomes available.

SMTP server overload may be caused by a surge of legitimate mail (example: a
DNS registrar opens a new zone for registrations), by mistake (mail explosion
caused by a forwarding loop) or by malice (worm outbreak, botnet, or other
illegitimate activity).

Symptoms of Postfix SMTP server overload are:

  * Remote SMTP clients experience a long delay before Postfix sends the "220
    hostname.example.com ESMTP Postfix" greeting.

      o NOTE: Broken DNS configurations can also cause lengthy delays before
        Postfix sends "220 hostname.example.com ...". These delays also exist
        when Postfix is NOT overloaded.

      o NOTE: To avoid "overload" delays for end-user mail clients, enable the
        "submission" service entry in master.cf (present since Postfix 2.1),
        and tell users to connect to this instead of the public SMTP service.

  * The Postfix SMTP server logs an increased number of "lost connection after
    CONNECT" events. This happens because remote SMTP clients disconnect before
    Postfix answers the connection.

      o NOTE: A portscan for open SMTP ports can also result in "lost
        connection ..." logfile messages.

  * Postfix 2.3 and later logs a warning that all server ports are busy:

    Oct  3 20:39:27 spike postfix/master[28905]: warning: service "smtp"
     (25) has reached its process limit "30": new clients may experience
     noticeable delays
    Oct  3 20:39:27 spike postfix/master[28905]: warning: to avoid this
     condition, increase the process count in master.cf or reduce the
     service time per client

Legitimate mail that doesn't get through during an episode of Postfix SMTP
server overload is not necessarily lost. It should still arrive once the
situation returns to normal, as long as the overload condition is temporary.

SSeerrvviiccee mmoorree SSMMTTPP cclliieennttss aatt tthhee ssaammee ttiimmee

One measure to avoid the "all server processes busy" condition is to service
more SMTP clients simultaneously. For this you need to increase the number of
Postfix SMTP server processes. This will improve the responsiveness for remote
SMTP clients, as long as the server machine has enough hardware and software
resources to run the additional processes, and as long as the file system can
keep up with the additional load.

  * You increase the number of SMTP server processes either by increasing the
    default_process_limit in main.cf (line 3 below), or by increasing the SMTP
    server's "maxproc" field in master.cf (line 10 below). Either way, you need
    to issue a "postfix reload" command to make the change effective.

  * Process limits above 1000 require Postfix version 2.4 or later, and an
    operating system that supports kernel-based event filters (BSD kqueue(2),
    Linux epoll(4), or Solaris /dev/poll).

  * More processes use more memory. You can reduce the Postfix memory footprint
    by using cdb: lookup tables instead of Berkeley DB's hash: or btree:
    tables.

     1 /etc/postfix/main.cf:
     2     # Raise the global process limit, 100 since Postfix 2.0.
     3     default_process_limit = 200
     4
     5 /etc/postfix/master.cf:
     6     # =============================================================
     7     # service type  private unpriv  chroot  wakeup  maxproc command
     8     # =============================================================
     9     # Raise the SMTP service process limit only.
    10     smtp      inet  n       -       n       -       200     smtpd

  * NOTE: older versions of the SMTPD_POLICY_README document contain a mistake:
    they configure a fixed number of policy daemon processes. When you raise
    the SMTP server's "maxproc" field in master.cf, SMTP server processes will
    report problems when connecting to policy server processes, because there
    aren't enough of them. Examples of errors are "connection refused" or
    "operation timed out".

    To fix, edit master.cf and specify a zero "maxproc" field in all policy
    server entries; see line 6 in the example below. Issue a "postfix reload"
    command to make the change effective.

    1 /etc/postfix/master.cf:
    2     # =============================================================
    3     # service type  private unpriv  chroot  wakeup  maxproc command
    4     # =============================================================
    5     # Disable the policy service process limit.
    6     policy    unix  -       n       n       -       0       spawn
    7         user=nobody argv=/some/where/policy-server

SSppeenndd lleessss ttiimmee ppeerr SSMMTTPP cclliieenntt

When increasing the number of SMTP server processes is not practical, you can
improve Postfix server responsiveness by eliminating delays. When Postfix
spends less time per SMTP session, the same number of SMTP server processes can
service more clients in a given amount of time.

  * Eliminate non-functional RBL lookups (blocklists that are no longer in
    operation). These lookups can degrade performance. Postfix logs a warning
    when an RBL server does not respond.

  * Eliminate redundant RBL lookups (people often use multiple Spamhaus RBLs
    that include each other). To find out whether RBLs include other RBLs, look
    up the websites that document the RBL's policies.

  * Eliminate header_checks and body_checks, and keep just a few emergency
    patterns to block the latest worm explosion or backscatter mail. See
    BACKSCATTER_README for examples of the latter.

  * Group your header_checks and body_checks patterns to avoid unnecessary
    pattern matching operations:

     1  /etc/postfix/header_checks:
     2      if /^Subject:/
     3      /^Subject: virus found in mail from you/ reject
     4      /^Subject: ..other../ reject
     5      endif
     6
     7      if /^Received:/
     8      /^Received: from (postfix\.org) / reject forged client name in
    received header: $1
     9      /^Received: from ..other../ reject ....
    10      endif

DDiissccoonnnneecctt ssuussppiicciioouuss SSMMTTPP cclliieennttss

Under conditions of overload you can improve Postfix SMTP server responsiveness
by hanging up on suspicious clients, so that other clients get a chance to talk
to Postfix.

  * Use "521" SMTP reply codes (Postfix 2.6 and later) or "421" (Postfix 2.3-
    2.5) to hang up on clients that that match botnet-related RBLs (see next
    bullet) or that match selected non-RBL restrictions such as SMTP access
    maps. The Postfix SMTP server will reject mail and disconnect without
    waiting for the remote SMTP client to send a QUIT command.

  * To hang up connections from blacklisted zombies, you can set specific
    Postfix SMTP server reject codes for specific RBLs, and for individual
    responses from specific RBLs. We'll use zen.spamhaus.org as an example; by
    the time you read this document, details may have changed. Right now, their
    documents say that a response of 127.0.0.10 or 127.0.0.11 indicates a
    dynamic client IP address, which means that the machine is probably running
    a bot of some kind. To give a 521 response instead of the default 554
    response, use something like:

     1  /etc/postfix/main.cf:
     2      smtpd_client_restrictions =
     3         permit_mynetworks
     4         reject_rbl_client zen.spamhaus.org=127.0.0.10
     5         reject_rbl_client zen.spamhaus.org=127.0.0.11
     6         reject_rbl_client zen.spamhaus.org
     7
     8      rbl_reply_maps = hash:/etc/postfix/rbl_reply_maps
     9
    10  /etc/postfix/rbl_reply_maps:
    11      # With Postfix 2.3-2.5 use "421" to hang up connections.
    12      zen.spamhaus.org=127.0.0.10 521 4.7.1 Service unavailable;
    13       $rbl_class [$rbl_what] blocked using
    14       $rbl_domain${rbl_reason?; $rbl_reason}
    15
    16      zen.spamhaus.org=127.0.0.11 521 4.7.1 Service unavailable;
    17       $rbl_class [$rbl_what] blocked using
    18       $rbl_domain${rbl_reason?; $rbl_reason}

    Although the above example shows three RBL lookups (lines 4-6), Postfix
    will only do a single DNS query, so it does not affect the performance.

  * With Postfix 2.3-2.5, use reply code 421 (521 will not cause Postfix to
    disconnect). The down-side of replying with 421 is that it works only for
    zombies and other malware. If the client is running a real MTA, then it may
    connect again several times until the mail expires in its queue. When this
    is a problem, stick with the default 554 reply, and use
    "smtpd_hard_error_limit = 1" as described below.

  * You can automatically turn on the above overload measure with Postfix 2.5
    and later, or with earlier releases that contain the stress-adaptive
    behavior source code patch from the mirrors listed at http://
    www.postfix.org/download.html. Simply replace line above 8 with:

     8      rbl_reply_maps = ${stress?hash:/etc/postfix/rbl_reply_maps}

More information about automatic stress-adaptive behavior is in section
"Automatic stress-adaptive behavior".

TTeemmppoorraarryy mmeeaassuurreess ffoorr oollddeerr PPoossttffiixx rreelleeaasseess

See the next section, "Automatic stress-adaptive behavior", if you are running
Postfix version 2.5 or later, or if you have applied the source code patch for
stress-adaptive behavior from the mirrors listed at http://www.postfix.org/
download.html.

The following measures can be applied temporarily during overload. They still
allow mmoosstt legitimate clients to connect and send mail, but may affect some
legitimate clients.

  * Reduce smtpd_timeout (default: 300s). Experience on the postfix-users list
    from a variety of sysadmins shows that reducing the "normal" smtpd_timeout
    to 60s is unlikely to affect legitimate clients. However, it is unlikely to
    become the Postfix default because it's not RFC compliant. Setting
    smtpd_timeout to 10s (line 2 below) or even 5s under stress will still
    allow mmoosstt legitimate clients to connect and send mail, but may delay mail
    from some clients. No mail should be lost, as long as this measure is used
    only temporarily.

  * Reduce smtpd_hard_error_limit (default: 20). Setting this to 1 under stress
    (line 3 below) helps by disconnecting clients after a single error, giving
    other clients a chance to connect. However, this may cause significant
    delays with legitimate mail, such as a mailing list that contains a few no-
    longer-active user names that didn't bother to unsubscribe. No mail should
    be lost, as long as this measure is used only temporarily.

  * Use an smtpd_junk_command_limit of 1 instead of the default 100. This
    prevents clients from keeping idle connections open by repeatedly sending
    NOOP or RSET commands.

    1  /etc/postfix/main.cf:
    2      smtpd_timeout = 10
    3      smtpd_hard_error_limit = 1
    4      smtpd_junk_command_limit = 1

With these measures, no mail should be lost, as long as these measures are used
only temporarily. The next section of this document introduces a way to
automate this process.

AAuuttoommaattiicc ssttrreessss--aaddaappttiivvee bbeehhaavviioorr

Postfix version 2.5 introduces automatic stress-adaptive behavior. This is also
available as a source code patch for Postfix versions 2.4 and 2.3 from the
mirrors listed at http://www.postfix.org/download.html.

It works as follows. When a "public" network service such as the SMTP server
runs into an "all server ports are busy" condition, the Postfix master(8)
daemon logs a warning, restarts the service (without interrupting existing
network sessions), and runs the service with "-o stress=yes" on the server
process command line:

    80821  ??  S      0:00.24 smtpd -n smtp -t inet -u -c -o stress=yes

Normally, the Postfix master(8) daemon runs such a service with "-o stress=" on
the command line (i.e. with an empty parameter value):

    83326  ??  S      0:00.28 smtpd -n smtp -t inet -u -c -o stress=

Services that have local access only never have "-o stress" parameters on the
command line. This includes services internal to Postfix such as the queue
manager, and services that listen on a loopback interface only, such as after-
filter SMTP services.

The "stress" parameter value is the key to making main.cf parameter settings
stress adaptive. The following settings are the default with Postfix 2.6 and
later. With earlier Postfix versions that have stress-adaptive support, append
the lines below to the main.cf file and issue a "postfix reload" command:

    1 smtpd_timeout = ${stress?10}${stress:300}s
    2 smtpd_hard_error_limit = ${stress?1}${stress:20}
    3 smtpd_junk_command_limit = ${stress?1}${stress:100}

Translation:

  * Line 1: under conditions of stress, use an smtpd_timeout value of 10
    seconds instead of the default 300 seconds. Experience on the postfix-users
    list from a variety of sysadmins shows that reducing the "normal"
    smtpd_timeout to 60s is unlikely to affect legitimate clients. However, it
    is unlikely to become the Postfix default because it's not RFC compliant.
    Setting smtpd_timeout to 10s (line 2 below) or even 5s under stress will
    still allow most legitimate clients to connect and send mail, but may delay
    mail from some clients. No mail should be lost, as long as this measure is
    used only temporarily.

  * Line 2: under conditions of stress, use an smtpd_hard_error_limit of 1
    instead of the default 20. This helps by disconnecting clients after a
    single error, giving other clients a chance to connect. However, this may
    cause significant delays with legitimate mail, such as a mailing list that
    contains a few no-longer-active user names that didn't bother to
    unsubscribe. No mail should be lost, as long as this measure is used only
    temporarily.

  * Line 3: under conditions of stress, use an smtpd_junk_command_limit of 1
    instead of the default 100. This prevents clients from keeping idle
    connections open by repeatedly sending NOOP or RSET commands.

The syntax of ${name?value} and ${name:value} is explained at the beginning of
the postconf(5) manual page.

NOTE: Please keep in mind that the stress-adaptive feature is a fairly
desperate measure to keep ssoommee legitimate mail flowing under overload
conditions. If a site is reaching the SMTP server process limit when there
isn't an attack or bot flood occurring, then either the process limit needs to
be raised or more hardware needs to be added.

DDeetteeccttiinngg ssuuppppoorrtt ffoorr ssttrreessss--aaddaappttiivvee bbeehhaavviioorr

To find out if your Postfix installation supports stress-adaptive behavior, use
the "ps" command, and look for the smtpd processes. Postfix has stress-adaptive
support when you see "-o stress=" or "-o stress=yes" command-line options.
Remember that Postfix never enables stress-adaptive behavior on servers that
listen on local addresses only.

The following example is for FreeBSD or Linux. On Solaris, HP-UX and other
System-V flavors, use "ps -ef" instead of "ps ax".

    $ ps ax|grep smtpd
    83326  ??  S      0:00.28 smtpd -n smtp -t inet -u -c -o stress=
    84345  ??  Ss     0:00.11 /usr/bin/perl /usr/libexec/postfix/smtpd-
    policy.pl

You can't use postconf(1) to detect stress-adaptive support. The postconf(1)
command ignores the existence of the stress parameter in main.cf, because the
parameter has no effect there. Command-line "-o parameter" settings always take
precedence over main.cf parameter settings.

If you configure stress-adaptive behavior in main.cf when it isn't supported,
nothing bad will happen. The processes will run as if the stress parameter
always has an empty value.

FFoorrcciinngg ssttrreessss--aaddaappttiivvee bbeehhaavviioorr oonn oorr ooffff

You can manually force stress-adaptive behavior on, by adding a "-o stress=yes"
command-line option in master.cf. This can be useful for testing overrides on
the SMTP service. Issue "postfix reload" to make the change effective.

Note: setting the stress parameter in main.cf has no effect for services that
accept remote connections.

    1 /etc/postfix/master.cf:
    2     # =============================================================
    3     # service type  private unpriv  chroot  wakeup  maxproc command
    4     # =============================================================
    5     #
    6     smtp      inet  n       -       n       -       -       smtpd
    7         -o stress=yes
    8         -o . . .

To permanently force stress-adaptive behavior off with a specific service,
specify "-o stress=" on its master.cf command line. This may be desirable for
the "submission" service. Issue "postfix reload" to make the change effective.

Note: setting the stress parameter in main.cf has no effect for services that
accept remote connections.

    1 /etc/postfix/master.cf:
    2     # =============================================================
    3     # service type  private unpriv  chroot  wakeup  maxproc command
    4     # =============================================================
    5     #
    6     submission inet n       -       n       -       -       smtpd
    7         -o stress=
    8         -o . . .

OOtthheerr mmeeaassuurreess ttoo ooffff--llooaadd zzoommbbiieess

OpenBSD spamd implements a daemon that handles all connections from "new"
clients. Only well-behaved mail clients are allowed to talk to the mail server.
Other clients are tarpitted, and will never get a chance to affect mail server
performance.

At some point in the future, Postfix may come with a simple front-end daemon
that does basic greylisting and pipelining detection to keep zombies and other
ratware away from Postfix itself. This would use the "pass" service type which
has been available in stable Postfix releases since Postfix 2.5.

CCrreeddiittss

  * Thanks to the postfix-users mailing list members for sharing early
    experiences with the stress-adaptive feature.
  * The RBL example and several other paragraphs of text were adapted from
    postfix-users postings by Noel Jones.
  * Wietse implemented stress-adaptive behavior as the smallest possible patch
    while he should be working on other things.