iPad Stats Visualization

30 Nov

Evan has a cool post showing some of our internal heat map stats and some interesting points on data visualization.

Speaking at WordCamp San Francisco

20 Jul

After a 3 year speaking hiatus from WordCamp SF, I am excited about speaking again this year.  The most interesting part of my talks is usually the Q&A at the end, so this time we decided to get rid of the talk and go straight to the Q&A. It will focus on running large WordPress installations, but I’m sure there will be time to discuss other WordPress-related things. Bring your questions and make them difficult!  If you have a question but won’t be able to attend, please ask in the comments and I will try to answer it during the session (which I think will be recorded).

HyperDB Replication Lag Detection

20 Jul

Howdy – Iliya here again. Seems like I am taking over Barry’s blog. Hopefully this will motivate him to blog more.

On WordPress.com we have over 218 million tables and perform tens of thousands queries per second. To scale all of this, we shard our 24 million blogs across more than 550 MySQL servers. This allows us to cope with load bursts and to handle database servers failures.

For those who are unfamiliar, MySQL data replication is asynchronous and works as follows:

  1. [Master] Receives a query that modifies database structure or content (INSERT, UPDATE, ALTER etc.)
  2. [Master] The query is written to a log file (aka the binlog).
  3. [Master] The query is executed on the master.
  4. [Slaves] Create a “Slave I/O” thread that connects to the [Master] and requests all new queries from the mater’s binlog.
  5. [Master] Creates a “Binlog dump” thread for each connected slave, that reads the requested events from the binlog and sends them to the slave.
  6. [Slaves] Start a “Slave SQL” thread which reads queries from the log file written by the “Slave I/O” thread and executes them

There are a number of things to be considered in this scenario, which can lead to a condition known as replication lag where the slaves have older data then the master:

  • Since only one thread on the slave executes write queries, and there are many execution threads on the master, there is no guarantee that the slave will be able to execute queries with the same speed as the master.
  • Long running SELECTs or explicit locks on the slave, will cause the “Slave SQL” thread to wait, thus slowing it down.
  • Long running queries on the master would take at least the same amount of time to run on the slave, causing it to fall behind the master
  • I/O (disk or network) issues can prevent or slow down the slave from reading and replaying the binlog events

In order to deal with this, we needed a way to avoid connections to lagged slaves as long as there are slaves that are current. This would allow for the lagged ones to recover faster and avoid returning old data to our users. It also had to be something flexible enough, so we could have different settings for acceptable replication lag per dataset or stop tracking it altogether. Since we use the advanced database class, HyperDB, for all our database connections, it was the obvious place to integrate this.

We implemented it  in the following steps:

  • If a connection modifies data in a given table, then all subsequent SELECTs on the same connection for that table are sent to the master. Chances are replication won’t be fast enough to propagate the changes to the slaves on the same page load.  This logic has existed in HyperDB for a while.
  • Before we make a connection to a slave, we use a callback, to check if we have information for this slave’s lag in the cache and we skip it based on that, unless all slaves in the dataset are considered lagged.  In case replication breaks on all slaves, we would rather return old data then overload the master with read queries and cause an outage.
  • After a successful connection to a slave, if there was nothing in the cache regarding its lag status and not all slaves are considered lagged, we execute a second callback that checks whether this slave is lagged and updates the cache.

A slave is considered lagged when it has a “lag threshold” defined in it’s dataset configuration and the current lag is more than this threshold.

We considered the following options for checking if a slave is lagged.  No MySQL patches are required for any of them:

  • Checking the value of Seconds_Behind_Master from the SHOW SLAVE STATUS statement executed on the slave. It shows the difference between the timestamp of the currently executed query and the latest query we have received from the master. Although it is easy to implement and has low overhead, the main problem with using this option is that it is not completely reliable, as it can be tricked by IO latency and/or master connection problems.
  • Tracking the “File” and “Position” on SHOW MASTER STATUS executed on the master and comparing it to Relay_Master_Log_File and Exec_Master_Log_Pos of SHOW SLAVE STATUS on the slave. This way we can wait until the slave executes the queries from binlog “file” and position “position” before send certain queries to that slave and thus effectively we wait for the data to be replicated to the point where we need it. While very reliable, this option is more complex, has lots of overhead and doesn’t give us clock time value which we can track and set between servers.
  • Tracking the difference between the current time on the slave and the replication of a timestamp update from the master, which runs every second. This is basically what mk-heartbeat does. It requires proper time sync between the master and the slave servers but is otherwise very reliable.

The third option fit our needs best, however the code is flexible enough to easily support any of these. For caching, we decided to go with memcached, since it works well in our distributed, multi-server, multi-datacenter environment, but other methods (APC cache, shared memory, custom daemon etc.) would work just fine.

HyperDB is free, open-source and easy to integrate in your WordPress installation. You can download it here.  We hope you enjoy this new functionality and please let us know if you have any questions in the comments.

Uptime related server crashes

14 Jun

This is a guest post by Iliya Polihronov.  Iliya is the newest member of the global infrastructure, systems, and security team at Automattic and the first ever guest blogger here on barry.wordpress.com.

Hey, my name is Iliya and as a Systems Wrangler at Automattic, I am one of the people handling the server-side issues across the 2000 servers running WordPress.com and other Automattic services.

Last week, within two hours of each other, two of our MogileFS storage servers locked up with the following trace:

The next day, a few more servers crashed with similar traces.

We started searching for a common pattern. All hosts were running Debian kernels ranging from 2.6.32-21 to 2.6.32-24, some of them were in different data centers and had different purposes in our network.

One thing we noticed was that all of the servers crashed after having an uptime of a little more than 200 days. After some research and investigation, we found that the culprit appears to be a quite interesting kernel bug.

As part of the scheduler load balancing algorithm, the kernel searches for the busiest group within a given scheduling domain. In order to do that it has to take into account the average load for all groups. It is calculated in the function find_busiest_group() with:

sds.avg_load = (SCHED_LOAD_SCALE * sds.total_load) / sds.total_pwr;

sds.total_load is the sum of the load on all CPUs in the scheduling domain, based on the run queue tasks and their priority.

SCHED_LOAD_SCALE is a constant used to increase resolution.

sds.total_pwr is the sum of the power of all CPUs in the scheduling domain. This sum ends up being zero and that’s what causing the crash – division by zero.

The “CPU power” is used to take into account how much calculating capabilities a CPU has compared to the other CPUs and the main factors for calculating it are:

1. Whether the CPU is shared, for example by using multithreading.
2. How many real-time tasks the CPU is processing.
3. In newer kernels,  how much time the CPU had spent processing IRQs.

The current suggested fix for this bug is relying on the theory that while taking into account the real-time tasks (#2 above), scale_rt_power() could return negative value, and thus the sum of all CPU powers may end up being zero.

This was merged into the 2.6.32.29 vanilla kernel, together with the IRQ accounting into the cpu_power (#3  above). It is also merged into the Debian 2.6.32-31 kernel.

Alternatively, the scheduling load balancing can be turned off, which would effectively skip the related code. This can be done using control groups, however it should be used with caution as it may cause performance issues:

mount -t cgroup -o cpuset cpuset /cgroups
echo 0 > /cgroups/cpuset.sched_load_balance

As it is yet not absolutely clear if the suggested fix really fixes the problem, we will try to post updates on any new developments as we observe them.

WordPress.com DDoS Details

7 Mar

As you may have heard, on March 3rd and into the 4th, 2011, WordPress.com was targeted by a rather large Distributed Denial of Service Attack. I am part of the systems and infrastructure team at Automattic and it is our team’s responsibility to a) mitigate the attack, b) communicate status updates and details of the attack, and c) figure out how to better protect ourselves in the future.  We are still working on the third part, but I wanted to share some details here.

One of our hosting partners, Peer1, provided us these InMon graphs to help illustrate the timeline. What we saw was not one single attack, but 6 separate attacks beginning at 2:10AM PST on March 3rd. All of these attacks were directed at a single site hosted on WordPress.com’s servers. The first graph shows the size of the attack in bits per second (bandwidth), and the second graph shows packets per second. The different colors represent source IP ranges.

The first 5 attacks caused minimal disruption to our infrastructure because they were smaller in size and shorter in duration. The largest attack began at 9:20AM PST and was mostly blocked by 10:20AM PST. The attacks were TCP floods directed at port 80 of our load balancers. These types of attacks try to fill the network links and overwhelm network routers, switches, and servers  with “junk” packets which prevents legitimate requests from getting through.

The last TCP flood (the largest one on the graph) saturated the links of some of our providers and overwhelmed the core network routers in one of our data centers. In order to block the attack effectively, we had to work directly with our hosting partners and their Tier 1 bandwidth providers to filter the attacks upstream. This process took an hour or two.

Once the last attack was mitigated at around 10:20AM PST, we saw a lull in activity.  On March 4th around 3AM PST, the attackers switched tactics. Rather than a TCP flood, they switched to a HTTP resource consumption attack.  Enlisting a bot-net consisting of thousands of compromised PCs, they made many thousands of simultaneous HTTP requests in an attempt to overwhelm our servers.  The source IPs were completely different than the previous attacks, but mostly still from China.  Fortunately for us, the WordPress.com grid harnesses over 3,600 CPU cores in our web tier alone, so we were able to quickly mitigate this attack and identify the target.

We see denial of service attacks every day on WordPress.com and 99.9% of them have no user impact. This type of attack made it difficult to initially determine the target since the incoming DDoS traffic did not have any identifying information contained in the packets.  WordPress.com hosts over 18 million sites, so finding the needle in the haystack is a challenge. This attack was large, in the 4-6Gbit range, but not the largest we have seen.  For example, in 2008, we experienced a DDoS in the 8Gbit/sec range.

While it is true that some attacks are politically motivated, contrary to our initial suspicions, we have no reason to believe this one was.  We are big proponents of free speech and aim to provide a platform that supports that freedom. We even have dedicated infrastructure for sites under active attack.  Some of these attacks last for months, but this allows us to keep these sites online and not put our other users at risk.

We also don’t put all of our eggs in one basket.  WordPress.com alone has 24 load balancers in 3 different data centers that serve production traffic. These load balancers are deployed across different network segments and different IP ranges.  As a result, some sites were only affected for a couple minutes (when our provider’s core network infrastructure failed) throughout the duration of these attacks.  We are working on ways to improve this segmentation even more.

If you have any questions, feel free to leave them in the comments and I will try to answer them.

Dell MD3000 Multipath on Debian

16 Dec

We are in the process of deploying some new infrastructure to store the 150+GB of new content (media only, not including text) uploaded to WordPress.com daily.

WordPress.com data in GB

After some searching and testing, we have decided to use the open source software MogileFS developed in part by our friends at Six Apart. Our initial deployment is going to be 180TB of storage in a single data center and we plan to expand this to include multiple data centers in early 2010. In order to get that amount of storage affordably, the options are limited. We thought about building Backblaze devices, but decided that the ongoing management of these in our hosting environment would be prohibitively complicated. We eventually settled on Dell’s MD PowerVault series. Our configuration consists of:

  • 4 x Dell R710 ( 32GB RAM/2 x Intel E5540/2 x 146GB SAS RAID 1)
  • 4 x Dell MD3000 (15 x 1TB 7200 RPM HDD each)
  • 8 x Dell MD1000 (15 x 1TB 7200 RPM HDD each)

Each Dell R710 is connected to a MD3000 and then 2 MD1000s are connected to each MD3000. The end result is 4 self-contained units, each providing 45TB of storage for a total of 180TB.

Illustration by Joe Rodriguez

Our proof of concept was deployed on a single Dell 2950 connected to a MD1000 and things worked relatively flawlessly. We could use all of our existing tools to monitor, manage, and configure the devices when needed. Little did I know the MD3000s were so much of a pain :) Since we are using MogileFS which handles the distribution of files across various hosts and devices, we wanted these devices setup in what I thought was a relatively simple JBOD configuration. Each drive would be exported as a device to the OS, then we would mount 45 devices per machine and MogileFS would take care of the rest. Didn’t exactly work that way.

When the hardware was initially deployed to us, they were configured in a high availability (HA) setup, with each controller on the MD3000 connected to a controller on the R710. This way, if a controller fails, in theory the storage is still accessible. The problem with this type of setup is that in order to make it work flawlessly, you need to use the Dell multi-path proxy and mpt drivers, not the ones provided by the Linux kernel. Dell’s provided stuff doesn’t work on Debian. Initially, without multipath configured, some confusing stuff happens — we had 90 devices detected by the OS (/dev/sdb through /dev/sdcn), but every other device was un-reachable. After some trial and error with various multipath configurations, and some help I ended up with this:

apt-get install multipath-tools

Our multipath.conf:

defaults {  
        getuid_callout "/lib/udev/scsi_id -g -u -s /block/%n"  
		user_friendly_names on
}  
devices {  
        device {  
                vendor DELL*  
                product MD3000*  
                path_grouping_policy failover  
                getuid_callout "/lib/udev/scsi_id -g -u --device=/dev/%n"
                features "1 queue_if_no_path"  
                path_checker rdac  
                prio_callout "/sbin/mpath_prio_rdac /dev/%n"  
                hardware_handler "1 rdac"  
                failback immediate  
        }  
}  
blacklist {  
       device {  
               vendor DELL.*  
               product Universal.*  
       }  
       device {  
               vendor DELL.*  
               product Virtual.*  
       }  
}

multipath -F
multipath -v2
/etc/init.d/multipath-tools start

This gave me a bunch of device names in /dev/mapper/* which I could access, partition, format, and mount. A few things to note:

  • user_friendly_names doesn’t seem to work. The devices were all still labeled by their WWID even with that option enabled
  • The status of the paths as shown by multipath -ll seemed to change over time (from active to ghost). Not sure why.
  • Even with all of this set up and working, I still was seeing the occasional I/O error and path failure reported in the logs

After a few hours of “fun” with this, I decided that it wasn’t worth the hassle or complexity and since we have redundant storage devices anyway, we would just configure the devices in “single path” mode and mount them directly and forego multipath. Not so fast…according to Dell engineers, “single path mode” is not supported. Easy enough, lets un-plug one of the controllers, creating our own “single path mode” and everything should work, right? Sort of.

If you just go and unplug the controller while everything is running, nothing works. The OS needs to re-scan the devices in order to address them properly. The easiest way for this to happen is to reboot (sure this isn’t Windows?). After a reboot, the OS properly saw 45 devices (/dev/sdb – /dev/sdau) which is what I would have expected. The only problem was that every other device was inaccessible! It turns out, that the MD3000 tries to balance the devices across the 2 controllers, and 1/2 of the drives had been assigned a preferred path of controller 1 which was unplugged. After some additional MD3000 configuration, we were able to force all of the devices to prefer controller 0 and everything was accessible once again.

Only other thing worth noting here is that the MD3000 exports an addition device that you may not recognize:

scsi 1:0:0:31: Direct-Access DELL Universal Xport 0735 PQ: 0 ANSI: 5

For us this was LUN 31 and the number doesn’t seem user configurable, but I suppose other hardware may assign a different LUN. This is a management device for the MD3000 and not a device that you can or should partition, format, or mount. We just made sure to skip it in our setup scripts.

I suppose if we were running Red Hat Enterprise Linux, CentOS, SUSE, or Windows, this would have all worked a bit more smoothly, but I don’t want to run any of those. We have over 1000 Debian servers deployed and I have no plans on switching just because of Dell. I really wish Dell would make their stuff less distro-specific — it would make things easier for everyone.

Is anyone else successfully running this type of hardware configuration on Debian using multipath? Have you tested a failure? Do you have random I/O errors in your logs? Would love to hear stories and tips.

I have some more posts to write about our adventures in Dell MD land. The next one will be about getting Dell’s SMcli working on Debian, and then after that a post with some details of our MogileFS implementation.

* Thanks to the fine folks at Layered Tech for helping us tweak the MD3000 configuration throughout this process.

WordCamp Presentations

5 Dec

I have uploaded my slides from WordCamp NYC and WordCamp Orlando to Slideshare.  Check ‘em out!

Follow

Get every new post delivered to your Inbox.

Join 846 other followers

%d bloggers like this: