Archive for the ‘servers’ Category

h1

Dell MD3000 Multipath on Debian

December 16, 2009

We are in the process of deploying some new infrastructure to store the 150+GB of new content (media only, not including text) uploaded to WordPress.com daily.

WordPress.com data in GB

After some searching and testing, we have decided to use the open source software MogileFS developed in part by our friends at Six Apart. Our initial deployment is going to be 180TB of storage in a single data center and we plan to expand this to include multiple data centers in early 2010. In order to get that amount of storage affordably, the options are limited. We thought about building Backblaze devices, but decided that the ongoing management of these in our hosting environment would be prohibitively complicated. We eventually settled on Dell’s MD PowerVault series. Our configuration consists of:

  • 4 x Dell R710 ( 32GB RAM/2 x Intel E5540/2 x 146GB SAS RAID 1)
  • 4 x Dell MD3000 (15 x 1TB 7200 RPM HDD each)
  • 8 x Dell MD1000 (15 x 1TB 7200 RPM HDD each)

Each Dell R710 is connected to a MD3000 and then 2 MD1000s are connected to each MD3000. The end result is 4 self-contained units, each providing 45TB of storage for a total of 180TB.

Illustration by Joe Rodriguez

Our proof of concept was deployed on a single Dell 2950 connected to a MD1000 and things worked relatively flawlessly. We could use all of our existing tools to monitor, manage, and configure the devices when needed. Little did I know the MD3000s were so much of a pain :) Since we are using MogileFS which handles the distribution of files across various hosts and devices, we wanted these devices setup in what I thought was a relatively simple JBOD configuration. Each drive would be exported as a device to the OS, then we would mount 45 devices per machine and MogileFS would take care of the rest. Didn’t exactly work that way.

When the hardware was initially deployed to us, they were configured in a high availability (HA) setup, with each controller on the MD3000 connected to a controller on the R710. This way, if a controller fails, in theory the storage is still accessible. The problem with this type of setup is that in order to make it work flawlessly, you need to use the Dell multi-path proxy and mpt drivers, not the ones provided by the Linux kernel. Dell’s provided stuff doesn’t work on Debian. Initially, without multipath configured, some confusing stuff happens — we had 90 devices detected by the OS (/dev/sdb through /dev/sdcn), but every other device was un-reachable. After some trial and error with various multipath configurations, and some help I ended up with this:

apt-get install multipath-tools

Our multipath.conf:

defaults {
        getuid_callout "/lib/udev/scsi_id -g -u -s /block/%n"
		user_friendly_names on
}
devices {
        device {
                vendor DELL*
                product MD3000*
                path_grouping_policy failover
                getuid_callout "/lib/udev/scsi_id -g -u --device=/dev/%n"
                features "1 queue_if_no_path"
                path_checker rdac
                prio_callout "/sbin/mpath_prio_rdac /dev/%n"
                hardware_handler "1 rdac"
                failback immediate
        }
}
blacklist {
       device {
               vendor DELL.*
               product Universal.*
       }
       device {
               vendor DELL.*
               product Virtual.*
       }
}

multipath -F
multipath -v2
/etc/init.d/multipath-tools start

This gave me a bunch of device names in /dev/mapper/* which I could access, partition, format, and mount. A few things to note:

  • user_friendly_names doesn’t seem to work. The devices were all still labeled by their WWID even with that option enabled
  • The status of the paths as shown by multipath -ll seemed to change over time (from active to ghost). Not sure why.
  • Even with all of this set up and working, I still was seeing the occasional I/O error and path failure reported in the logs

After a few hours of “fun” with this, I decided that it wasn’t worth the hassle or complexity and since we have redundant storage devices anyway, we would just configure the devices in “single path” mode and mount them directly and forego multipath. Not so fast…according to Dell engineers, “single path mode” is not supported. Easy enough, lets un-plug one of the controllers, creating our own “single path mode” and everything should work, right? Sort of.

If you just go and unplug the controller while everything is running, nothing works. The OS needs to re-scan the devices in order to address them properly. The easiest way for this to happen is to reboot (sure this isn’t Windows?). After a reboot, the OS properly saw 45 devices (/dev/sdb – /dev/sdau) which is what I would have expected. The only problem was that every other device was inaccessible! It turns out, that the MD3000 tries to balance the devices across the 2 controllers, and 1/2 of the drives had been assigned a preferred path of controller 1 which was unplugged. After some additional MD3000 configuration, we were able to force all of the devices to prefer controller 0 and everything was accessible once again.

Only other thing worth noting here is that the MD3000 exports an addition device that you may not recognize:

scsi 1:0:0:31: Direct-Access DELL Universal Xport 0735 PQ: 0 ANSI: 5

For us this was LUN 31 and the number doesn’t seem user configurable, but I suppose other hardware may assign a different LUN. This is a management device for the MD3000 and not a device that you can or should partition, format, or mount. We just made sure to skip it in our setup scripts.

I suppose if we were running Red Hat Enterprise Linux, CentOS, SUSE, or Windows, this would have all worked a bit more smoothly, but I don’t want to run any of those. We have over 1000 Debian servers deployed and I have no plans on switching just because of Dell. I really wish Dell would make their stuff less distro-specific — it would make things easier for everyone.

Is anyone else successfully running this type of hardware configuration on Debian using multipath? Have you tested a failure? Do you have random I/O errors in your logs? Would love to hear stories and tips.

I have some more posts to write about our adventures in Dell MD land. The next one will be about getting Dell’s SMcli working on Debian, and then after that a post with some details of our MogileFS implementation.

* Thanks to the fine folks at Layered Tech for helping us tweak the MD3000 configuration throughout this process.

h1

AMD Barcelona vs. Intel Nehalem

May 22, 2009

We are looking at switching some of our servers from AMD Opteron Barcelona quad-core processors to the new Intel 5520 Nehalem processors. These are both 4 core CPUs, but the Intels utilize hyper-threading, so the OS sees 8 cores per CPU.  It wasn’t that long ago that the first thing you did with a hyper-threading-enabled CPU was switch it off in the BIOS, but I have heard good things about Intel’s reincarnation of hyper-threading, so I decided to give it a shot.  

I ran some real-world stress tests against these servers, adding them into the WordPress.com web pool and seeing how many requests per second they could serve before becoming 100% CPU bound effectively falling over. The types of requests served are varied; a lot are rendering web pages, but there are also quite a few image resizing operations thrown in here as well, as we spread this image work evenly over the 2500 cores in our web tier.  Everything is php executed via fastcgi.  I was a bit skeptical that there would be much of a difference between the two processors, but the numbers proved me wrong — the Nehalem’s are impressive.

2 x AMD Opteron 2356 Barcelona Quad-core 2.3Ghz
40 requests/second at 87.5% CPU utilization

2 x Intel 5520 Nehalem Quad-core 2.26Ghz
78 requests/second at 94% CPU utilization

Few things that I thought were interesting:

  • On a per request basis, there isn’t much of a difference between the two. They both generate a given page in roughly the same amount of time.
  • As CPU utilization approaches 100%, The Intel’s scale rather linearly, while the AMDs seem to struggle over the 85% range.
  • The load averages were pretty high during these tests (35+ on the Intel box), but request times didn’t seem to suffer.

Has anyone else seen the same sort of results or maybe something to the contrary?   These 2 configurations are roughly the same price, making it seem like a no-brainer to choose the Intels for web applications.

h1

New Datacenter for WordPress.com

February 16, 2009

Towards the end of 2008, we brought online a new datacenter to serve the over 5.5 million blogs now hosted on the WordPress.com platform.  Adding the data center in Chicago, IL gives us a total of 3 data centers across the US which serve live content at any given time.  We have decommissioned one of our facilities in the Dallas, TX area.  Our friends at Layered Technologies were kind enough to shoot this footage for us (think The Blair Witch Project) and the always awesome Michael Pick took care of the editing.  Here’s a peak at what a typical WordPress data center installation looks like…

This movie requires Adobe Flash for playback.

For those interested in technical details here is a hardware overview of the installation:

150 HP DL165s dual quad-core AMD 2354 processors 2GB-4GB RAM
50 HP DL365s dual dual-core AMD 2218 processors 4GB-16GB RAM
5 HP DL185s dual quad-core AMD 2354 processors 4GB RAM

And here is a graph of what the current CPU usage looks like across about 700 CPU cores.  As you can see there is plenty of idle CPU for those big spikes or in case one of the other 2 data centers fail and we have to route more traffic to this one.

cpuusage-chicago

h1

Redundancy and power outages

July 25, 2007

Scott Beale reports that many Web 2.0 websites were affected by today’s power outage at 365 Main in San Francisco. While unfortunate, as a systems guy I have to assume things like this are going to happen. They shouldn’t happen, but they can and they will. At the data center level, there should be multiple levels of redundancy that minimize the probability of a power outage. Things such as multiple power circuits, redundant UPSes, and generators are standard. For a complete power outage to occur there should have to be multiple simultaneous system failures. I looked for a statement from 365 Main as to what the problem was, but couldn’t find one.

The system architecture behind WordPress.com and Akismet is designed to take entire data center failures into account. For WordPress.com, we serve live content in real-time from 3 data centers (33% from each data center) and in the event of a data center failure, traffic is automatically re-routed to the 2 remaining data centers. Syncing content in real-time between multiple data centers has not been easy, but at times like this I am sure that we made the right decision.

h1

Keeping track of 300 servers

July 18, 2007

Since WordPress.com broke 10 million pageviews today, I thought it would be a good time to talk a little bit about keeping track of all the servers that run WordPress.com, Akismet, WordPress.org, Ping-o-matic, etc. Currently we have over 300 servers online in 5 different data centers across the country. Some of these are collocated, and others are with dedicated hosting providers, but the bottom line is that we need to keep track of them all as if they were our children! Currently we use Nagios for server health monitoring, Munin for graphing various server metrics, and a wiki to keep track of all the server hardware specs, IPs, vendor IDs, etc. All of these tools have suited us well up until now, but there have been some scaling issues.

  • MediaWiki — Like Wikipedia, we have a MediaWiki page with a table that contains all of our server information, from hardware configuration to physical location, price, and IP information. Unfortunately, MediaWiki tables don’t seem to be very flexible and you cannot perform row or column-based operations. This makes simple things such as counting how many servers we have become somewhat time consuming. Also, when you get to 300 rows, editing the table becomes a very tedious task. It is very easy to make a mistake throwing the entire table out of whack. Even dividing the data into a few tables doesn’t make it much easier. In addition, there is no concept of unique records (nor do I really think there should be) so it is very easy to end up with 2 servers that have the same IP listed or the same hostname.
  • Munin — Munin has become an invaluable tool for us when troubleshooting issues and planning future server expansion. Unfortunately, scaling munin hasn’t been the best experience. At about 100 hosts, we started running into disk IO problems caused by the various data collection, graphing and HTML output jobs munin runs. It seemed the solution was to switch to the JIT graphing model which only drew the graphs when you viewed them. Unfortunately, this only seemed to make the interface excruciatingly slow and didn’t help the IO problems we were having. At about 150 hosts we moved munin to a dedicated server with 15k RPM SCSI drives in a RAID 0 array in an attempt to give it some more breathing room. That worked for a while, but we then started running into problems where the process of polling all the hosts actually took longer than the monitoring interval. The result was that we were missing some data. Since then, we have resorted to removing some of the things we graph on each server in order to lighten the load. Every once in a while, we still run into problems where a server is a little slow to respond and it causes the polling to take longer than 5 minutes. Obviously, better hardware and reducing graphed items isn’t a scalable solution so something is going to have to change. We could put a munin monitoring server in each datacenter, but we currently sum and stack graphs across datacenters. I am not sure if/how that works when the data is on different servers. The other big problem I see with munin is that if one host’s graphs stop updating and that host was part of a totals graph, the totals graph will just stop working. This happened today — very frustrating.
  • Nagios — I feel this has scaled the best of the 3. We have this running on a relatively light server and have no load or scheduling issues. I think it is time, however, to look at moving to Nagios’ distributed monitoring model. The main reason for this is that since we have multiple datacenters, each of which have their own private network, it is important for us to monitor each of these networks independently in addition to the public internet connectivity to each datacenter. The simplest way to do this is to put a nagios monitoring node in each data center which can then monitor all the servers in that facility and report the results back to the central monitoring server. Splitting up the workload should also allow us to scale to thousands of hosts without any problems.

Anyone have recommendations on how to better deal with these basic server monitoring needs? I have looked at Zabbix, Cacti, Ganglia, and some others in the past, but have never been super-impressed. Barring any major revelations in the next couple weeks, I think we are going to continue to scale out Nagios and Munin and replace the wiki page with a simple PHP/MySQL application that is flexible enough to integrate into our configuration management and deploy tools.