Two weeks ago, we were plagued by several unnerving IMAP and POP performance incidents. That is euphemism for “too slow to be usable”. The worst one happened at Monday morning, coincidentally at the same time as the whole country tends to check their mail. Most likely we had hit a secret capacity limit. But where? Our capacity monitoring displayed happy green lights exclusively. Low CPU levels, low network throughput, low disk wait times. Time for some investigation and counter measures.
TL;DR: we achieved jaw-dropping profit in network and NFS usage levels by switching from Courier to Dovecot. It took some efforts to do this transparently, though.
We have some 40K mailboxes and 2TB of mail storage. This is all handled by a bunch of servers called “mail-fetchX” with Courier IMAP, POP and auth daemons. They have NFS shares mounted from a big ass ZFS Nexenta fileserver. Other servers (“mail-in”) handle the delivery of mail. Because we use the Maildir storage format, we don’t need any sophisticated locking services, because every message is a distinct file with a (globally) unique name. Major disadvantage of Maildir with ZFS: readdir operations are expensive, especially remote. This means that a basic index request (happens at the start of every POP/IMAP login) triggers an expensive readdir for a folder with up to 100K of messages.
With some basic logical deduction, we could pinpoint the bottleneck in the NFS/ZFS subsystem. To avert the first wave of incidents, our primary focus was to scale out the centralized storage. We used a custom Solaris dtrace script to analyze the per-mailbox I/O load on the ZFS server. With a spare ZFS beast and some quick perl and rsync wizardry, we could transparently migrate the top-40 most demanding mailboxes, which accounted for roughly 40% of the I/O load. (Thanks to rsync and the Maildir model, we could do this almost atomically, guaranteeing that no mail would be lost between synchronisations. In the very worst (and rare) case, some messages could be delivered twice). Pfew, this cooker lost some pressure!
However, we didn’t feel comfortable with this preliminary solution. Given that a single mailbox with 100K messages would eat significant slice of capacity from the ZFS system, what would happen when 3 of suchs mailboxes were checked simultaneously? After lots of Googling, we decided to create a test setup with Dovecot.
Dovecot is supposedly more experimental than Courier (demonstrated by the higher release and development velocity). This yields a higher risk for unknown bugs. However, Dovecot has the killer feature of maintaining index files for Maildir folders. Effectively: when IMAP/POP command frequency exceeds the new mail delivery rate (which is almost always the case), this eliminates (some to most) readdir calls.
In a matter of a few days, we build a multi-server test platform. Thanks to our previous efforts of integrated black box testing, we could quickly identify behaviourial differences between Dovecot and Courier. Apart from RFC interpretation differences (luckily neglectible but to be on the safe side, incorporated in our tests), we hit the problem that both systems maintain their own, incompatible cache of new/read/deleted flags. Without intervention, all of our POP users would re-download their whole mailbox upon their first encounter with Dovecot (or vice-versa, in case of an emergency rollback). There is a conversion script but converting all our 40K mailboxes would require a freeze (shutdown) of our mailservices for up to 2 hours. Our engineering heroes Vincent, Robin and Jeroen took up the challenge that we would be able to convert without downtime, and they set up a system where cache conversion was executed on the fly when a user logged in for the first time. Hurray!
Upon release, we did encounter a flaw in our configuration that slipped through our acceptance test and made Dovecot crumble after 400 concurrent connections. This was quickly fixed and within the course of 24 hours, all of our mailboxes were succesfully converted. An issue with the Roundcube webmail system was quickly resolved by Jeroen (by disabling the internal Roundcube cache). And all that was left, was excitedly waiting for the first live performance graphs to come in.
After a week:
As you can see, we have sliced the load on our network and NFS subsystem. Both traffic and NFS calls have shrunk 5-fold, especially in the number of readdir calls.
Time for a celebration drink!