Making Gravatar fast again

26 Oct

As Matt blogged, Automattic recently purchased Gravatar. The first thing we did was move the service onto the WordPress.com infrastructure. Since the application is very different from WordPress.com what this really means is using what we have learned from scaling WordPress.com to increase both speed and reliability of the service, as well as leveraging our existing hardware and network infrastructure to stabilize the service. The current infrastructure is laid out as follows:

  • 2 application servers (in 2 different data centers for redundancy). One of these servers primarily handles the main Gravatar website which is Ruby on Rails while the other serves the images themselves. If either of these servers or data centers were to fail, we could easily switch things around to work around the outage.
  • 2 cache servers (1 in each datacenter). These servers are running Varnish. They cache requested images for a period of 10 minutes, so frequently requested images are not repeatedly requested from the application servers. We are seeing about a 65% cache hit rate and about 1000 requests/second at peak times, although as adoption of the service increases, we expect this number to go up significantly. A single server running Varnish can serve many thousands of requests/sec. The amount of data we are caching is small enough to fit in RAM, so disk I/O is not currently an issue.

On the hardware side, for those of you who are curious, we are using HP DL365s for the application servers, and HP DL145s for the caching servers. 4GB of RAM and 2 x AMD Opteron 2218s all around. The application servers have 4 x 73GB 15k SAS drives in a RAID 5, while the caching servers are just single 80GB SATA drives. We use the same hardware configurations extensively for WordPress.com and they work well.

Previously, the service was using Apache2 + Mongrel to serve the main site and lighttpd + mod_magnet to serve the images. We decided to simplify this and we are currently using lighttpd to serve everything and it is working well for the most part. We seem to have a memory usage issue with lighttpd, which may be related to this long-standing bug.  For now, we are just monitoring memory usage of the application with monit, and restarting the service before memory usage gets too high.

