Adventures of buffer cache and I/O bottlenecks
Earlier this year, we noticed an increasing stability problem with our main mailserver, mail1.bytemine.net. First we blamed it on the increased load due to the ever increasing volume in spam. We deciced to move the biggest ressource hog, SpamAssassin away from this machine.
Doing mostly processing of text and binary content in e-mails SpamAssassin eats CPU for breakfast. For a bunch of weeks this seemed to have helped quite a bit, but then our problems returned even worse. Thanks to the nice systems data graphing suite symon Bernd noticed that our mailserver was constantly doing roughly 3 to 4 MB/s disk IO. While this does not seem that much at first, in average that meant, that the disks were always busy and data meant to go to disk was waiting around in buffer cache.
Additionally we were now watching the machine with systat(5), where you can watch the I/O stats very nice in realtime.
Around the same time we got a new machine with a LSI MegaRaid controller (MegaRAID SAS 8408, in OpenBSD driven by the mfi(4) driver) and while debugging performance issues with the raid, we figured out that it came with very safe, but slow default settings:
- all disk caches were disabled
- the controller cache was disabled
- the controller was in Write-Through mode
Then we checked the manual for the LSI MegaRaid controller present in our mailserver (LSI MegaRAID SATA 150-4 driven by the ami(4) driver) and discovered, that the same options where present there and assumed they would have the very same defaults. The mailserver was initially deployed in 2005 and until november 2007 never showed any signs of having I/O congestion. Yet it definitely looked to us as if we were having problems with i/o bigtime.
By mid-february we were having resource starvation symptons every other day, with highly sluggish behavior during peek hours. Out of nowhere userland would freeze, the kernel yet still answering to icmp packets.
Once we went down to frankfurt to look at the controller and saw, that the very same defaults were used it became quite clear. Bernd reconfigured the controller to use more sane defaults and within minutes we saw a totally differeny I/O behavior. The following graph showing the days before and after the changed controller settings displays this very well (click images to see large version).
Looking at the week after we adjusted controller settings, the difference in I/O seems enormous. What happened was:
- with the default controller settings data could not get to the disks quick enough, so we constantly dragged pending I/O with us. With the adjusted settings, data gets to the data quickly, so that we don’t start to build up such a congested situation:
We then took some time to look at some of the recored data from last year. And suddenly the picture became all clear. Since November 2007 we had a n increase of I/O of roughly 700%. No wonder, that the “safe” (but slow) default settings didn’t hinder us before, the machine was bored, yet with more and more I/O coming it suddenly became a bottleneck. As the big increase of I/O is reading from the disk, we suspect that more and more of our customers have switched to using IMAP, which does cause a bigger perfomance hit that POP3. The following graph shows I/O over the last year:





