STRESS_README   [plain text]


PPoossttffiixx SSttrreessss--DDeeppeennddeenntt CCoonnffiigguurraattiioonn

-------------------------------------------------------------------------------

OOvveerrvviieeww

This document describes the symptoms of Postfix SMTP server overload, and how
to avoid the condition under normal conditions. When the condition is caused by
botnets or other malware, the document suggests configuration settings that
help to minimize the impact on legitimate mail. Finally, the document
introduces stress-adaptive behavior, introduced with Postfix 2.5, and how it
can be used to automatically switch configuration settings under overload.

Topics covered in this document:

  * Symptoms of Postfix SMTP server overload
  * Service more SMTP clients at the same time
  * Spend less time per SMTP client
  * Disconnect suspicious SMTP clients
  * Take desperate measures
  * Make Postfix behavior stress-adaptive
  * Detecting support for stress-adaptive behavior
  * Forcing stress-adaptive behavior on or off
  * Credits

SSyymmppttoommss ooff PPoossttffiixx SSMMTTPP sseerrvveerr oovveerrllooaadd

Under normal conditions, Postfix responds immediately when a remote SMTP client
connects. The time needed to deliver mail should be noticeable only with very
large messages. Performance degrades more dramatically when the number of
remote SMTP clients exceeds the number of Postfix SMTP server processes. When a
client connects while all server processes are busy, the client must wait until
a server process becomes available.

Overload may be caused by a legitimate mail (example: a DNS registrar opens a
new zone for registrations), by mistake (mail explosion caused by a forwarding
loop) or by illegitimate mail (worm outbreak, botnet, or other malware
activity). Symptoms of Postfix SMTP mail server overload are:

  * Remote SMTP clients experience a long delay before Postfix sends the "220
    hostname.example.com ESMTP Postfix" greeting. If this affects end-user mail
    clients, enable the "submission" service entry in master.cf (present since
    Postfix 2.1), and tell users to connect to this instead of the public SMTP
    service.

  * The Postfix SMTP server logs an increased number of "lost connection after
    CONNECT" events. This happens because remote SMTP clients disconnect before
    Postfix answers the connection.

  * Postfix 2.3 and later logs a warning that all server ports are busy:

    Oct  3 20:39:27 spike postfix/master[28905]: warning: service "smtp"
     (25) has reached its process limit "30": new clients may experience
     noticeable delays
    Oct  3 20:39:27 spike postfix/master[28905]: warning: to avoid this
     condition, increase the process count in master.cf or reduce the
     service time per client

NOTE: The first two symptoms may also happen without overload, for example:

  * Broken DNS also causes lengthy delays before "220 hostname.example.com ..."
    while the Postfix SMTP server tries to look up the client's hostname.

  * A portscan for open SMTP ports also results in "lost connection ..."
    logfile messages.

Legitimate mail that doesn't get through during an episode of overload is not
necessarily lost. It should still arrive once the situation returns to normal,
as long as the overload condition is temporary.

SSeerrvviiccee mmoorree SSMMTTPP cclliieennttss aatt tthhee ssaammee ttiimmee

To service more SMTP clients simultaneously, you need to increase the number of
SMTP server processes. This will improve the responsiveness for remote SMTP
clients, as long as the server machine has enough hardware and software
resources to run the additional processes, and as long as the file system can
keep up with the additional load.

  * You increase the number of SMTP server processes either by increasing the
    default_process_limit in main.cf (line 3 below), or by increasing the SMTP
    server's "maxproc" field in master.cf (line 10 below). Either way, you need
    to issue a "postfix reload" command to make the change effective.

  * Process limits above 1000 require Postfix version 2.4 or later, and an
    operating system that supports kernel-based event filters (BSD kqueue(2),
    Linux epoll(4), or Solaris /dev/poll).

  * You can reduce the Postfix memory footprint by using cdb: lookup tables
    instead of Berkeley DB.

     1 /etc/postfix/main.cf:
     2     # Raise the global process limit, 100 since Postfix 2.0.
     3     default_process_limit = 200
     4
     5 /etc/postfix/master.cf:
     6     # =============================================================
     7     # service type  private unpriv  chroot  wakeup  maxproc command
     8     # =============================================================
     9     # Raise the SMTP service process limit only.
    10     smtp      inet  n       -       n       -       200     smtpd

  * NOTE: older versions of the SMTPD_POLICY_README document contain a mistake:
    they configure a fixed number of policy daemon processes. When you raise
    the SMTP server's "maxproc" field in master.cf, SMTP server processes will
    report problems when connecting to policy server processes, because there
    aren't enough of them. Examples of errors are "connection refused" or
    "operation timed out". To fix, edit master.cf and specify a zero "maxproc"
    field in all policy server entries; see line 6 in the example below. Issue
    a "postfix reload" command to make the change effective.

    1 /etc/postfix/master.cf:
    2     # =============================================================
    3     # service type  private unpriv  chroot  wakeup  maxproc command
    4     # =============================================================
    5     # Disable the policy service process limit.
    6     policy    unix  -       n       n       -       0       spawn
    7         user=nobody argv=/some/where/policy-server

SSppeenndd lleessss ttiimmee ppeerr SSMMTTPP cclliieenntt

When increasing the number of SMTP server processes is not practical, you can
improve Postfix server responsiveness by eliminating unnecessary work. When
Postfix spends less time per SMTP session, the same number of SMTP server
processes can service more clients in the same amount of time.

  * Eliminate non-functional RBL lookups (blocklists that are no longer in
    operation). These lookups can degrade performance. Postfix logs a warning
    when an RBL server does not respond.

  * Eliminate redundant RBL lookups (people often use multiple Spamhaus RBLs
    that include each other). To find out whether RBLs include other RBLs, look
    up the websites that document the RBL's policies.

  * Eliminate header_checks and body_checks, and keep just a few emergency
    patterns to block the latest worm explosion or backscatter mail. See
    BACKSCATTER_README for examples of the latter.

  * Group your header_checks and body_checks patterns to avoid unnecessary
    pattern matching operations.

     1  /etc/postfix/header_checks:
     2      if /^Subject:/
     3      /^Subject: virus found in mail from you/ reject
     4      /^Subject: ..../ ....
     5      endif
     6
     7      if /^Received:/
     8      /^Received: from (postfix\.org) / reject forged client name in
    received header: $1
     9      /^Received: from .../ ....
    10      endif

DDiissccoonnnneecctt ssuussppiicciioouuss SSMMTTPP cclliieennttss

Under conditions of overload you can improve Postfix SMTP server responsiveness
by hanging up on suspicious clients, so that other clients get a chance to talk
to Postfix.

  * Use "421" reply codes for botnet-related RBLs or for selected non-RBL
    restrictions. This causes Postfix 2.3 and later to disconnect immediately
    without waiting for the remote SMTP client to send a QUIT command.

    You can set individual reject codes for RBLs, and for individual responses
    from a specific RBL. We'll use zen.spamhaus.org as an example; by the time
    you read this document, details may have changed. Right now, their
    documents say that a response of 127.0.0.10 or 127.0.0.11 indicates a
    dynamic client IP address, which means that the machine is probably running
    a bot of some kind. To give a 421 response instead of the default 554
    response, use something like:

     1  /etc/postfix/main.cf:
     2      smtpd_client_restrictions =
     3         permit_mynetworks
     4         reject_rbl_client zen.spamhaus.org=127.0.0.10
     5         reject_rbl_client zen.spamhaus.org=127.0.0.11
     6         reject_rbl_client zen.spamhaus.org
     7
     8      rbl_reply_maps = hash:/etc/postfix/rbl_reply_maps
     9
    10  /etc/postfix/rbl_reply_maps:
    11      zen.spamhaus.org=127.0.0.10 421 4.7.1 Service unavailable;
    12       $rbl_class [$rbl_what] blocked using
    13       $rbl_domain${rbl_reason?; $rbl_reason}
    14
    15      zen.spamhaus.org=127.0.0.11 421 4.7.1 Service unavailable;
    16       $rbl_class [$rbl_what] blocked using
    17       $rbl_domain${rbl_reason?; $rbl_reason}

    Although the above shows three RBL lookups (lines 4-6), Postfix will still
    only do a single DNS query, so the performance difference is negligible.

    The down-side of sending 421 instead of the default 554 is that it works
    only for zombies and other malware. If the client is running a real MTA,
    then it may connect again several times until the mail expires in its
    queue. When this is a problem, stick with the default 554 reply, and use
    "smtpd_hard_error_limit = 1" as described below.

    With Postfix 2.5, or with earlier releases that contain the stress-adaptive
    behavior patch, you can turn on the above under overload by replacing line
    8 with:

     8      rbl_reply_maps = ${stress?hash:/etc/postfix/rbl_reply_maps}

    More information about automatic stress-adaptive behavior is at the end of
    this document.

TTaakkee ddeessppeerraattee mmeeaassuurreess

The following measures will still allow mmoosstt legitimate clients to connect and
send mail, but may affect some legitimate clients.

  * Reduce smtpd_timeout (default: 300s). Experience on the postfix-users list
    from a variety of sysadmins shows that reducing the "normal" smtpd_timeout
    to 60s is unlikely to affect legitimate clients. However, it is unlikely to
    become the Postfix default because it's not RFC compliant. Setting
    smtpd_timeout to 10s (line 2 below) or even 5s under stress will still
    allow mmoosstt legitimate clients to connect and send mail, but may delay mail
    from some clients. No mail should be lost, as long as this measure is used
    only temporarily.

  * Reduce smtpd_hard_error_limit (default: 20). Setting this to 1 under stress
    (line 3 below) helps by disconnecting clients after a single error, giving
    other clients a chance to connect. However, this may cause significant
    delays with legitimate mail, such as a mailing list that contains a few no-
    longer-active user names that didn't bother to unsubscribe. No mail should
    be lost, as long as this measure is used only temporarily.

  * Disable remote SMTP client hostname lookups, so that all SMTP client
    hostnames become "unknown" (line 5 below). This feature was introduced with
    Postfix 2.3. Unfortunately, this measure is more problematic than the other
    ones proposed sofar. First, this will result in loss of mail when you use
    hostname-based access rules that reject mail from "unknown" SMTP clients
    (examples: reject_unknown_client_hostname,
    reject_unknown_reverse_client_hostname). Second, this may result in loss of
    mail when you subject "unknown" SMTP clients to additional restrictions
    such as reject_unverified_sender.

    1  /etc/postfix/main.cf:
    2      smtpd_timeout = 10
    3      smtpd_hard_error_limit = 1
    4      # Caution: line 5 may trigger REJECTs by hostname-based access rules

    5      smtpd_peername_lookup = no

Except with the last measure, no mail should be lost, as long as these measures
are used only temporarily. The next section of this document introduces a way
to automate this process.

MMaakkee PPoossttffiixx bbeehhaavviioorr ssttrreessss--aaddaappttiivvee

Postfix version 2.5 introduces automatic stress-adaptive behavior. This is also
available as an add-on patch for Postfix versions 2.4 and 2.3 from the mirrors
listed at http://www.postfix.org/download.html.

It works as follows. When a "public" network service runs into an "all server
ports are busy" condition, the master(8) daemon logs a warning, restarts the
service (without interrupting existing network sessions), and runs the service
with "-o stress=yes" on the command line. Normally, it runs a stress-adaptive
service with "-o stress=" on the command line (i.e. with an empty parameter
value). Other services never have "-o stress" parameters on the command line,
including services that listen on a loopback interface only.

The stress pseudo-parameter value is the key to making main.cf parameter
settings stress adaptive:

    1  /etc/postfix/main.cf:
    2      smtpd_timeout = ${stress?10}${stress:300}
    3      smtpd_hard_error_limit = ${stress?1}${stress:20}

Translation:

  * Line 2: under conditions of stress, use an smtpd_timeout value of 10
    seconds instead of the default 300 seconds,

  * Line 3: under conditions of stress, use an smtpd_hard_error_limit of 1
    instead of the default 20.

The syntax of ${name?value} and ${name:value} is explained at the beginning of
the postconf(5) manual page.

NOTE: Please keep in mind that the stress-adaptive feature is a fairly
desperate measure to keep ssoommee legitimate mail flowing under overload
conditions. If a site is reaching the SMTP server process limit when there
isn't an attack or bot flood occurring, then either the process limit needs to
be raised or more hardware needs to be added.

DDeetteeccttiinngg ssuuppppoorrtt ffoorr ssttrreessss--aaddaappttiivvee bbeehhaavviioorr

To find out if your Postfix installation supports stress-adaptive behavior, use
the "ps" command, and look for the smtpd processes. Postfix has stress-adaptive
support when you see "-o stress=" or "-o stress=yes" command-line options.
Remember that Postfix never enables stress-adaptive behavior on servers that
listen on local addresses only.

The following example is for FreeBSD or Linux. On Solaris, HP-UX and other
System-V flavors, use "ps -ef" instead of "ps ax".

    $ ps ax|grep smtpd
    83326  ??  S      0:00.28 smtpd -n smtp -t inet -u -c -o stress=
    84345  ??  Ss     0:00.11 /usr/bin/perl /usr/libexec/postfix/smtpd-
    policy.pl

You can't use postconf(1) to detect stress-adaptive support. The postconf(1)
command ignores the existence of the stress parameter in main.cf, because the
parameter has no effect there. Command-line "-o parameter" settings always take
precedence over main.cf parameter settings.

If you configure stress-adaptive behavior in main.cf when it isn't supported,
nothing bad will happen. The processes will run as if the stress parameter
always has an empty value.

FFoorrcciinngg ssttrreessss--aaddaappttiivvee bbeehhaavviioorr oonn oorr ooffff

You can manually force stress-adaptive behavior on, by adding a "-o stress=yes"
command-line option in master.cf. This can be useful for testing overrides on
the SMTP service. Issue "postfix reload" to make the change effective.

Note: setting the stress parameter in main.cf has no effect for services that
accept remote connections.

    1 /etc/postfix/master.cf:
    2     # =============================================================
    3     # service type  private unpriv  chroot  wakeup  maxproc command
    4     # =============================================================
    5     #
    6     smtp      inet  n       -       n       -       -       smtpd
    7         -o stress=yes
    8         -o . . .

To permanently force stress-adaptive behavior off with a specific service,
specify "-o stress=" on its master.cf command line. This may be desirable for
the "submission" service. Issue "postfix reload" to make the change effective.

Note: setting the stress parameter in main.cf has no effect for services that
accept remote connections.

    1 /etc/postfix/master.cf:
    2     # =============================================================
    3     # service type  private unpriv  chroot  wakeup  maxproc command
    4     # =============================================================
    5     #
    6     submission inet n       -       n       -       -       smtpd
    7         -o stress=
    8         -o . . .

CCrreeddiittss

  * Thanks to the postfix-users mailing list members for sharing early
    experiences with the stress-adaptive feature.
  * The RBL example and several other paragraphs of text were adapted from
    postfix-users postings by Noel Jones.
  * Wietse implemented stress-adaptive behavior as the smallest possible patch
    while he should be working on other things.