mailbox format

intro

This is an attempt to document the cyrus mailbox format. It should not be considered authoritative and is subject to change.

No external tools should make use of this information. The only supported method of access to the mail store is through the standard interfaces: IMAP, POP, NNTP, LMTP, etc.

A cyrus mailbox is a directory in the filesystem. It contains the following files:

message files

The message files are named by their UID, followed by a ".", so UID 423 would be named "423.". They are stored in wire-format: lines are terminated by CRLF and binary data is not allowed.

cyrus.header

This file contains mailbox-wide information that does not change that often. Its format:

<Mailbox Header Magic String>
<Quota Root>\t<Mailbox Unique ID String>\n
<Space-separated list of user flags>\n
<Mailbox ACL>\n

cyrus.index, cyrus.expunge and cyrus.cache

xxx not just caches; the index file stores stuff not present in the message file!

These files cache frequently accessed information on a per-message basis. The index file holds fixed-length records on a per-message basis (and a header for the mailbox of related metadata), while the cache file holds variable-length information.

Any binary data in these files is stored in network byte order. All of the binary data is also 4-byte aligned. Strings in the cyrus.cache are stored NUL-terminated (this only applies to cyrus.cache). To ensure alignment of following data, the end of strings may be NUL-padded by up to 4 bytes.

The cyrus.expunge file has the exact same format as cyrus.index, and holds the records of expunged messages which have yet to have their corresponding cache records and messages files deleted.

The overall format of these files looks sort of like this:

cyrus.index:
+----------------+
| Mailbox Header |
+----------------+
| Msg: Seq Num 1 |
+----------------+
| Msg: Seq Num 2 |
+----------------+
|     ...        |
+----------------+

The basic idea being that there is one header, and then all the message records are evenly spaced throughout the file. All of the message records are at well-known offsets, making any part of the file accessable at roughly equal speed.

cyrus.cache:

+------------------------------------------------------------------------+
|Gen # (32bits)|Size 1 (32bits)|Data 1                                   |
+------------------------------------------------------------------------+
|           |Size 2 (32bits)|Data 2            |Size 3 (32bits)| Data 3  |
+------------------------------------------------------------------------+
| .....                                                                  |
+------------------------------------------------------------------------+

The cache file is different from the index file. It starts with a 4 byte header (the generation number—more on that later), then it has a whole bunch of entries in (size)(data) format. The entries for each message are always consecutive, and in the same order (i.e. for any given message, the envelope is always the first bit of data), but there is no way to tell (without use of an offset from the index file) what message starts where.

detail of cyrus.index header

The index header contains the following information, in order:

Generation Number (4 bytes)
A number that is basically the "revision number" of the mailbox. It must match between the cache and index files. This is to ensure that if we fail to sync both the cache and index files and a crash happens (so that only one is synced), we do not provide bad data to the user.
Format (4 bytes)
Basically obsolete (indicates netnews or regular).
Minor Version (4 bytes)
Indicates the version number of the index file. This can be used for on-the-fly upgrades of the index and cache files.
Start Offset (4 bytes)
Size of index header.
Record Size (4 bytes)
Size of an index record.
Exists (4 bytes)
How many messages are in the mailbox.
Last Appenddate (4 bytes)
(time_t) of the last time a message was appended
Last UID (4 bytes)
Highest UID of all messages in the mailbox (UIDNEXT - 1).
Quota Mailbox Used (8 bytes)
Total amount of storage used by all of the messages in the mailbox. Platforms that don't support 64-bit integers only use the last 4 bytes.
POP3 Last Login (4 bytes)
(time_t) of the last pop3 login to this INBOX, used to enforce the "poptimeout" imapd.conf option.
UIDvalidity (4 bytes)
The UID validitiy of this mailbox. Cyrus currently uses the time() when this mailbox was created.
Deleted, Answered, and Flagged (4 bytes each)
Counts of how many messages have each flag.
Mailbox Options (4 bytes)
Bitmask of mailbox options, consisting of any combination of the following:
POP3_NEW_UIDL
Flag signalling that we're using "uidvalidity.uid" instead of just "uid" for the output of the POP3 UIDL command.
IMAP_CONDSTORE
Flag signalling that we're supporting the IMAP CONDSTORE extension on the mailbox.
Leaked Cache (4 bytes)
Number of leaked records in the cache file.
HighestModSeq (8 bytes)
Highest Modification Sequence of all the messages in the mailbox (CONDSTORE).

There are also spare fields in the index header, to allow for future expansion without forcing an upgrade of the file.

detail of cyrus.index records

These records start immediately following the cyrus.index header, and are all fixed size. They are in-order by sequence number of the message.

UID (4 bytes)
UID of the message
INTERNALDATE (4 bytes)
INTERNALDATE of the message
SENTDATE (4 bytes)
Contents of the Date: header normalized to a Unix time_t.
SIZE (4 bytes)
Size of the whole message (in octets)
HEADER SIZE (4 bytes)
Size of the message header (in octets)
CONTENT_OFFSET (4 bytes)
Offset into the message file where the message content begins.
CACHE_OFFSET (4 bytes)
Offset into the cache file for the beginning of this message's cache entry.
LAST UPDATED (4 bytes)
(time_t) of the last time this record was changed
SYSTEM FLAGS (4 bytes)
Bitmask showing which system flags are set/unset
USER FLAGS (MAX_USER_FLAGS / 32 bytes)
Bitmask showing which user flags are set/unset
CONTENT_LINES (4 bytes)
Number of text lines contained in the message content (body).
CACHE_VERSION (4 bytes)
Indicates the version number of the cache record for the message (determines which headers are cached).
UUID (MESSAGE_UUID_SIZE bytes)
Universal UID of the message (used by replication code).
MODSEQ (8 bytes)
Modification Sequence of the message (CONDSTORE).

cyrus.cache file format detail

The order of fields per record in the cache file is as follows: (keep in mind that they are all preceeded by a 4 byte network byte order size).

Envelope Response
Raw IMAP response for a request for the envelope.
Bodystructure Response
Raw IMAP response for a request for the bodystructure.
Body Response
Raw IMAP response for an (old style) request for the body.
Binary Bodystructure

Offsets into the message file to pull out various body parts. Because of the nature of MIME parts, this is somewhat recursive.

This looks like the following (starting the octet following the cache field size). All of the fields are bit32s.

  [
   [Number of message parts+1 for the rfc822 header if present]
   [
    [Offset in the message file of the header of this part]
    [Size (octets) of the header of this part]
    [Offset in the message file of the content of this part]
    [Size (octets) of the content of this part]
    [Encoding Type of this part]
   ]
      (repeat for each part as well as once for the headers)
   [zero *or* number of sub-parts in the case of a multipart.
    if nonzero, this is a recursion into the top structure]
      (repeat for each part)
  ] 

Note if this is not a message/rfc822, than the values for the sizes of the part 0 are -1 (to indicate that it doesn't exist). Sub-parts are not possible for a part 0, so they aren't included when finding recursive entries.

The offset and size info for both the mime header and content part are useful in order to do fast indexing on the appropriate parts of the message file when a client does a FETCH request for BODY[HEADER], or BODY[2.MIME].

Note that the top level RFC822 headers are a treated as a separate part from their body text ("0" or "HEADER").

In the case of a multipart/alternative, the content size & offset refers to the size of the entire mime part.

A very simple message (with a single text/plain part) would therefore look like:

  [[2][rfc822 header][text/plain body part info][0]]

A simple multipart/alternative message might look like:

  [[3][rfc822 header][text/plain message part info]
      [second message part info][0][0]]

A message with an attachment that has two subparts:

  [[3][rfc822 header info][rfc822 first body part info][attachment info][0][
	[3][NIL header info][sub part 1 info][sub part 2 info][0][0]]]

A message with an attached message/rfc822 message with the following total structure:

    message/rfc822
      0 headers; content-type: multipart/mixed
      1 text/plain
      2 message/rfc822
        0 headers; content-type: multipart/alternative
        1 text/plain
        2 text/html
  [[3][rfc822 header part 0][text/plain part 1][overall attachment info][0][
       [3][rfc822 header part 2.0][text/plain part 2.1][text/html part 2.2]
          [0][0]]]
Cache Header

Any cached header fields. These are in the same format they would appear in the message file:

  HeaderName: headerdata\r\n

Examples include: References, In-Reply-To, etc.

From
The from header.
To
The to header.
Cc
The CC header.
Bcc
The BCC header.
Subject
The Subject header.

notes

Future considerations