Barry on WordPress

Making Gravatar fast again

As Matt blogged, Automattic recently purchased Gravatar. The first thing we did was move the service onto the WordPress.com infrastructure. Since the application is very different from WordPress.com what this really means is using what we have learned from scaling WordPress.com to increase both speed and reliability of the service, as well as leveraging our existing hardware and network infrastructure to stabilize the service. The current infrastructure is laid out as follows:

2 application servers (in 2 different data centers for redundancy). One of these servers primarily handles the main Gravatar website which is Ruby on Rails while the other serves the images themselves. If either of these servers or data centers were to fail, we could easily switch things around to work around the outage.
2 cache servers (1 in each datacenter). These servers are running Varnish. They cache requested images for a period of 10 minutes, so frequently requested images are not repeatedly requested from the application servers. We are seeing about a 65% cache hit rate and about 1000 requests/second at peak times, although as adoption of the service increases, we expect this number to go up significantly. A single server running Varnish can serve many thousands of requests/sec. The amount of data we are caching is small enough to fit in RAM, so disk I/O is not currently an issue.

On the hardware side, for those of you who are curious, we are using HP DL365s for the application servers, and HP DL145s for the caching servers. 4GB of RAM and 2 x AMD Opteron 2218s all around. The application servers have 4 x 73GB 15k SAS drives in a RAID 5, while the caching servers are just single 80GB SATA drives. We use the same hardware configurations extensively for WordPress.com and they work well.

Previously, the service was using Apache2 + Mongrel to serve the main site and lighttpd + mod_magnet to serve the images. We decided to simplify this and we are currently using lighttpd to serve everything and it is working well for the most part. We seem to have a memory usage issue with lighttpd, which may be related to this long-standing bug. For now, we are just monitoring memory usage of the application with monit, and restarting the service before memory usage gets too high.

Barry

October 26, 2007

scaling

gravatar

12 responses to “Making Gravatar fast again”

john allspaw

October 27, 2007 at 7:11 pm

Great stuff, Barry. I’m surprised, though, that the working set of images can fit into RAM on your varnish boxes. I would have guess it would be bigger.

Since varnish now has at least an LRU eviction policy, what would stop you from lifting the 10 minute expiry and just caching everything ‘forever’ ?

Reply
james

October 27, 2007 at 7:45 pm

I assume you’re are getting the stats from ‘varnishstat’. Are you using home made scripts to save that information and graph it?

Reply
ArtLung Blog » Misc, Misc, everywhere… and not a drop to drink

October 29, 2007 at 3:48 pm

[…] moved Gravatar to their infrastructure which has gone well, the blog High Scalability pointed out Making Gravatar Fast Again. Cool stuff, and will help them avoid “crashing hard” moments. The Gravatar article […]

Reply
Links for Mon 29 Oct 2007 | Joseph Scott’s Blog

October 30, 2007 at 12:27 am

[…] Making Gravatar fast again | Barry on WordPress – Details from Barry on what Automattic is using to run Gravatar. Tags: gravatar […]

Reply
Barry

October 30, 2007 at 4:05 am

John,

The working set is only about 1GB (lucky us!)

Having an infinite expiry would require cache invalidation which is not something that we are currently doing on Gravatar.

Reply
Barry

October 30, 2007 at 4:08 am

James,

We are using Munin to graph the data. There are Varnish graphing plugins available.

Reply
Donncha’s Tuesday Links at Holy Shmoly!

October 30, 2007 at 9:41 am

[…] reveals some of the details behind what powers Gravatar […]

Reply
Bruno

November 5, 2007 at 1:30 am

Hi, interesting post.

So, you have a cache & application server in each of your datacenter.

Do you manually fallback to the other one if one datacenter is in trouble by changing the DNS entry?

Or you have something more automatic?

Reply
Barry

November 6, 2007 at 10:26 am

Bruno,

Currently it is manual because the automatic failover portion is not complete yet, but it should be finished this week. The basic idea is that each datacenter will also have a server that serves DNS requests. Each datacenter’s DNS server will only return its own IP when queried. There are some additional monitoring scripts that check the application and cache server and make sure all is functioning normally, otherwise stop the DNS service. In the case of a server or datacenter outage, the IP of the failed node will not be returned via DNS so traffic will automatically failover to the other location. There are DNS TTLs to deal with, but they can be set very low so impact is minimal.

I will probably write a separate post on this setup with more details once its complete.

Reply
Trent

November 10, 2007 at 11:15 pm

The IT world is pretty lucky that you put up the architecture and strategies that you use Barry on all these different setups! Sometimes though I think you might need a glossary of terms for us people who don’t speak wrangler, but great stuff, none the less!

Reply
kenpem

March 23, 2008 at 3:36 pm

Now we just need the rest of the world to adopt gravatar as a standard and we’re happy 😉

Reply
varnish reverse proxy server by Video Sharing Script

April 9, 2008 at 7:19 am

[…] Read about wordpress.com scaling at https://barry.wordpress.com/2007/10/26/making-gravatar-fast-again/ […]

Reply