Dell MD3000 Multipath on Debian

We are in the process of deploying some new infrastructure to store the 150+GB of new content (media only, not including text) uploaded to WordPress.com daily.

WordPress.com data in GB

After some searching and testing, we have decided to use the open source software MogileFS developed in part by our friends at Six Apart. Our initial deployment is going to be 180TB of storage in a single data center and we plan to expand this to include multiple data centers in early 2010. In order to get that amount of storage affordably, the options are limited. We thought about building Backblaze devices, but decided that the ongoing management of these in our hosting environment would be prohibitively complicated. We eventually settled on Dell’s MD PowerVault series. Our configuration consists of:

  • 4 x Dell R710 ( 32GB RAM/2 x Intel E5540/2 x 146GB SAS RAID 1)
  • 4 x Dell MD3000 (15 x 1TB 7200 RPM HDD each)
  • 8 x Dell MD1000 (15 x 1TB 7200 RPM HDD each)

Each Dell R710 is connected to a MD3000 and then 2 MD1000s are connected to each MD3000. The end result is 4 self-contained units, each providing 45TB of storage for a total of 180TB.

Illustration by Joe Rodriguez

Our proof of concept was deployed on a single Dell 2950 connected to a MD1000 and things worked relatively flawlessly. We could use all of our existing tools to monitor, manage, and configure the devices when needed. Little did I know the MD3000s were so much of a pain 🙂 Since we are using MogileFS which handles the distribution of files across various hosts and devices, we wanted these devices setup in what I thought was a relatively simple JBOD configuration. Each drive would be exported as a device to the OS, then we would mount 45 devices per machine and MogileFS would take care of the rest. Didn’t exactly work that way.

When the hardware was initially deployed to us, they were configured in a high availability (HA) setup, with each controller on the MD3000 connected to a controller on the R710. This way, if a controller fails, in theory the storage is still accessible. The problem with this type of setup is that in order to make it work flawlessly, you need to use the Dell multi-path proxy and mpt drivers, not the ones provided by the Linux kernel. Dell’s provided stuff doesn’t work on Debian. Initially, without multipath configured, some confusing stuff happens — we had 90 devices detected by the OS (/dev/sdb through /dev/sdcn), but every other device was un-reachable. After some trial and error with various multipath configurations, and some help I ended up with this:

apt-get install multipath-tools

Our multipath.conf:

defaults {  
        getuid_callout "/lib/udev/scsi_id -g -u -s /block/%n"  
		user_friendly_names on
}  
devices {  
        device {  
                vendor DELL*  
                product MD3000*  
                path_grouping_policy failover  
                getuid_callout "/lib/udev/scsi_id -g -u --device=/dev/%n"
                features "1 queue_if_no_path"  
                path_checker rdac  
                prio_callout "/sbin/mpath_prio_rdac /dev/%n"  
                hardware_handler "1 rdac"  
                failback immediate  
        }  
}  
blacklist {  
       device {  
               vendor DELL.*  
               product Universal.*  
       }  
       device {  
               vendor DELL.*  
               product Virtual.*  
       }  
}

multipath -F
multipath -v2
/etc/init.d/multipath-tools start

This gave me a bunch of device names in /dev/mapper/* which I could access, partition, format, and mount. A few things to note:

  • user_friendly_names doesn’t seem to work. The devices were all still labeled by their WWID even with that option enabled
  • The status of the paths as shown by multipath -ll seemed to change over time (from active to ghost). Not sure why.
  • Even with all of this set up and working, I still was seeing the occasional I/O error and path failure reported in the logs

After a few hours of “fun” with this, I decided that it wasn’t worth the hassle or complexity and since we have redundant storage devices anyway, we would just configure the devices in “single path” mode and mount them directly and forego multipath. Not so fast…according to Dell engineers, “single path mode” is not supported. Easy enough, lets un-plug one of the controllers, creating our own “single path mode” and everything should work, right? Sort of.

If you just go and unplug the controller while everything is running, nothing works. The OS needs to re-scan the devices in order to address them properly. The easiest way for this to happen is to reboot (sure this isn’t Windows?). After a reboot, the OS properly saw 45 devices (/dev/sdb – /dev/sdau) which is what I would have expected. The only problem was that every other device was inaccessible! It turns out, that the MD3000 tries to balance the devices across the 2 controllers, and 1/2 of the drives had been assigned a preferred path of controller 1 which was unplugged. After some additional MD3000 configuration, we were able to force all of the devices to prefer controller 0 and everything was accessible once again.

Only other thing worth noting here is that the MD3000 exports an addition device that you may not recognize:

scsi 1:0:0:31: Direct-Access DELL Universal Xport 0735 PQ: 0 ANSI: 5

For us this was LUN 31 and the number doesn’t seem user configurable, but I suppose other hardware may assign a different LUN. This is a management device for the MD3000 and not a device that you can or should partition, format, or mount. We just made sure to skip it in our setup scripts.

I suppose if we were running Red Hat Enterprise Linux, CentOS, SUSE, or Windows, this would have all worked a bit more smoothly, but I don’t want to run any of those. We have over 1000 Debian servers deployed and I have no plans on switching just because of Dell. I really wish Dell would make their stuff less distro-specific — it would make things easier for everyone.

Is anyone else successfully running this type of hardware configuration on Debian using multipath? Have you tested a failure? Do you have random I/O errors in your logs? Would love to hear stories and tips.

I have some more posts to write about our adventures in Dell MD land. The next one will be about getting Dell’s SMcli working on Debian, and then after that a post with some details of our MogileFS implementation.

* Thanks to the fine folks at Layered Tech for helping us tweak the MD3000 configuration throughout this process.

40 responses to “Dell MD3000 Multipath on Debian”

  1. Ah yes, welcome to the wonderful world of Multipath.

    Haven’t worked with the MD3000’s, but have worked with lots of other arrays, and pretty much every one has it’s quirks. 😉 One big thing you need to find out is if the controllers support true active/active or if it’s active/passive – the config can vary significantly based on that.

    Friendly names – your config turns them on, but you need to define them! What it exports by default is the WWID; with friendly names, you can rename that to whatever else you want.

    Also, just to make sure, you are not ever touching /dev/sdX, right? You should -only- touch the DM devices that multipath-tools creates for you.

    Just curious – since you don’t actually take advantage of the RAID controllers at all, why not just get a bunch of raw shelves (MD1000’s?) and plug them in directly?

    Or go to Supermicro and buy a server with 24 drive bays, and then another chassis to support another 24 drives? 😉 (48 drives in 8U!) With this config, you could also do 1.5TB or 2TB drives, and get more space.

    Or go to Sun, and buy a X4540 with 48 internal drives, and then run Debian on it! (48 drives in 4U! But not cheap!)

    😉

    1. Hi Nate!

      Thanks for commenting!

      Supposedly the MD3000 supports active/active (at least according to Dell’s docs).

      Friendly names – I would happy with the defaults of mpath0 mpath1, etc. from my understanding this should not require additional configuration. The bindings file has this comment in it:

      # Multipath bindings, Version : 1.0
      # NOTE: this file is automatically maintained by the multipath program.
      # You should not need to edit this file in normal circumstances.
      

      When multipath was enabled I wasn’t touching the /dev/sd* devices, but now that I ditched it I am accessing the devices directly. If I could get multipath working reliably, then I would be open to going back to it 🙂

      Sure, there are other hardware options, for us this was the one we chose for various reasons (cost, maintainability, provider availability, serviceability, etc). We lease all the hardware, so exact configs aren’t always 100% our choice.

      1. Got’cha on hardware selection. 😉 One trick with some arrays is to plug in on the cascade side instead of the controller side, which will bypass the controllers altogether and give you direct access to all the disks. Not sure if that’s an option with the MD3000.

        Aliases – interesting, from my recollection it didn’t actually set up friendly names per default, unless you define them.. but it’s been a long time since I haven’t defined them. 😉 If you set up a ‘multipaths’ section (for each wwid) in the config file, you can specify the wwid/alias combinations.. however, in your case, that would be a royal pain in the rear! At some point I’ll have to play around with this again.

        If you have a spare MD3000 to play with, it’d be interesting to leave it in the HA config, and see how multipath works with that. It may just be acting out since you’re exporting all the disks individually instead of as arrays managed by the controller.. I wouldn’t be surprised if multipath would work properly out of the box without I/O errors if you had a plain’ol RAID6 array on there or something.

        It’d also be interesting to see a performance comparison between exporting all the raw disks to use for MogileFS and using the RAID functionality on the controller so you can take advantage of the cache, etc.

        In any case, thanks for the post.. it’s always fun to play with new stuff. 😉 I’m looking forward to actually getting to deploy MogileFS in production at some point.

        1. One cool thing about these Dells is that you can access the raw devices for reads, but still send writes through the cache (at least according to Don @ Smugmug). Still need to test it.

  2. […] the 150+GB of new content (media only, not including text) uploaded to WordPress.com daily. More here When the hardware was initially deployed to us, they were configured in a high availability (HA) […]

  3. Thanks for this most informative post Barry, I have recently been doing some head scratching with dell arrays and this has at least answered some of the problems that I have been struggling with.
    Thanks again and have a great Christmas.

  4. Barry,

    I understand the desire for consistency, but a Debian shop shouldn’t have much trouble supporting a few Red Hat servers. The question then would be whether Red Hat is worth the pain. (I once switched our production network from FreeBSD to Red Hat due to multipath.)

    I am curious if this setup will have enough bandwidth to the 7200RPM/1TB backend devices once you get to scale. Hopefully will duplicate hot content to multiple nodes. I interviewed at a few shops selling access to streaming video content, and the common story was than network bandwidth wasn’t the problem: balancing streaming video requests through disk access bottlenecks was the issue.

    Anyway, I look forward to reading more about this. I hope that supporting and managing this backend doesn’t suck up too much staff time. Good luck!

    Sincerely,
    -daniel

    1. HI Daniel,

      Thanks for stopping by! Supporting multiple distros is a pain, especially when everything is automated like our setup is. We rely heavily in our deployment scripts on things like apt and Debian-specific package names, etc. Supporting Red Hat wouldn’t be impossible, but not desirable. Dell should officially support “Linux” not “Red Hat and SUSE” As far as the bandwidth – I think we will be ok. These boxes will serve as warm and cold storage for us, all of our hot content is cached in varnish. I think our main bottleneck is going to be seeks if anything.

  5. Just wanted to point out that your ‘code’ block encodings are bad (for some reason, & and ‘ are being encoded as & and "e; when they should be plain text instead).

    1. Wow, that was an annoying bug to track down (it was an issue with my SyntaxHighlighter plugin). Fixed though! 🙂

      1. Thanks for fixing Alex!

  6. […] the 150+GB of new content (media only, not including text) uploaded to WordPress.com daily. More here When the hardware was initially deployed to us, they were configured in a high availability (HA) […]

  7. I assume if an individual disk dies you just clone the data from another replica?

    And what about your uplink? Is it just 1GbE port or are you using trunking?

    80MB/s over Ethernet doesn’t seem super awesome when you have 48 disks behind it…..

    1. Mogile handles the re-replication of data when a drive fails. The uplink is currently 1Gbit per R710. In our tests, we aren’t even going to use 1Gbit between them all. We could trunk easily though – each R710 has 4 onboard GigE NICs

      1. yeah… I guess if they’re blogs and immutable then re-replicating them is easy…

        Trunking is a decent answer which is probably what I would do…

        1. Yeah, we are only talking about images/videos/other media here. No text or anything else that we would store in a database.

  8. Hi Barry,

    first of all thanks a lot for this post – it gave me a lot of fresh ideas.

    Try:

    user_friendly_names yes

    (instead of “on”) this works for me on Debian 5.0 and gives me the desired names (mpath0, mpath1)

    And yes, I get tons of I/O errors, too. Imho these are caused by the round robin policy of the multipath setup. In my setup, the errors only appear on the sd* devices that are part of the multipath device.

    Therefore it would be a good workaround to suppress the logging for these Devices…

    I’m still struggelung with my MD3000 – 2950 – 1950 setup. I’ll let you know if I find out something new 🙂

    cheers,

    David

    1. Hey David –

      user_friendly_names yes

      worked! Thanks for that. Going to give running multipath in production another shot.

  9. Just wanted to let you know that it’s not showing up properly on the BlackBerry Browser (I have a Pearl). Anyway, I’m now on the RSS feed on my laptop, so thanks!

  10. Great write up!

    Although I did not use your article to setup my multipath on Ubuntu with PowerVault 3000i I found it informative.

    I have also run into problems where I’m seeing the management LUN31. You say that you managed to avoid these in your setup? Please share. I would really like to get rid of all those ugly I/O errors in my log files.

    Thanks again.

    Terry

    1. Nice write up.

      Wish Dell was more like HP. The HP gear (MSA line, 585’s etc) and their controllers all have awesome debian support.

      A home made backblaze pod scares me a bit, I don’t wanna have to deal with failures and troubleshooting due to rotational vibration/oscillation/loose rubber-bands/whatever. Just my .02 euro’s. If SiliconMechanics or someone along those line releases a cheap one…that be a different story.

  11. LUN 31: Use your storage management software and remove the “Access” mapping (Modify Tab -> Edit Host-to-Virtual Disk Mapping). If you see “Access” mapped to LUN 31, you can most likely remove it.

    The failover solution that works on RedHat and SUSE will behave a lot better than any alternative out there (with the dual-controller configuration). I would highly recommend giving it a try if you have the time.

    I’m also looking forward to your SMcli comments.

    1. To add, you do not want to remove that LUN 31 if you are using the SMagent on any of your host systems connected to the MD3000. If you don’t use SMagent, the Access LUN is not needed.

  12. Unfortunately your guess that things are better in RHEL are grass-is-greener. Multipath(d) is a remarkably convoluted crapshoot over here too. My favorite is the diagram on the upstream’s website:

  13. Thanks for the doc. Still bussy to get it to work. We get a lot of errors. On think i can help you on friendly names is not “on” bud “yes”
    with me that works.

  14. Hey Barry,

    It sounded like you ended up bypassing some of the MD3000 “benefits” during this process and it is more similar to a setup we use for backups:

    1 710 or 2950 loaded up appropriately
    2 * 512MB Perc6/E (4 channels)
    4 * MD1000

    All MD1000s directly connected to its own channel on the Perc6s.

    60TB minus RAID requirements

    We did have some trouble getting all the MD1000s running just the way we wanted as well – 2 were really simple, and I forget what we had to do for the next 2. If this is interesting to you, let me know and will dig up my notes.

    Hope that might help you/save you some headache the next time around on this.

    Todd

  15. I had issues with the open-iscsi variant in Ubuntu 8.04 and debian – but if you use the latest from http://www.open-iscsi.org it works well. I had success with 2.0.869, 2.0.870 and 2.0.871. With the ubuntu/debian default (2.0.865) I had lots of wierd issues (random hangs on boot/shutdown).

    For the management/access drive, you can blacklist it in multipath.conf:
    #blacklist {
    # device {
    # vendor DELL
    # product “PERC|Universal”
    # }
    #}

    Multipath on debian/ubuntu of all versions has been a breeze – its a hassle only if you want to boot from iSCSI. boot from iscsi is supported in Ubuntu 9.10(+) at least, altho I have managed to mangle it into 8.04 (albeit my shutdown is currently something like remount readonly, schedule a reboot for the systems power in 60 seconds (yay remote power), then iscsiadm -m session -u (ie log out the iscsi sessions, which causes the system to hang, but then the power reboots).

    If you just drop the iscsi sessions by halting, it sometimes triggers a path failure on the MD3000i controller, which causes all the LUN paths to flap (read: everything that uses the MD3000i suffers from pauses for a few minutes).

    I use LVM on top of the multipath systems to keep the drive mappings consistent as well.

    //chris

  16. Искал реферат в Яндексе, и набрел на эту страницу. Немного информации по моей теме реферата набрал. Хотелось бы побольше, да и на том спасибо!

  17. Ah yes, welcome to the wonderful world of Multipath.

  18. awesome, this article is very helpful for me. thx

  19. Hey, I found your blog on google and read a few of your other posts. I like what you have to say. I just added you to my Google News Reader. Keep up the good work. Look forward to reading more from you in the future.

  20. […] but no. After exchanging some tweets with @frederik_vl, I quickly looked up some figures. According to Barry on WordPress, at the end of last year, over 150 GB of content were uploaded daily to WordPress. Now, looking at […]

  21. I have had my own headaches with failing paths. If you ever see a “not on preferred path” messages then you might have had the same problem that I had. The solution is very simple. Put a dollar sign on the end of the blacklist like so: blacklist “^sda$” (if you are blacklisting /dev/sda from multipath). The reason is that sda (without wildcard) matches sdaa, sdab, sdac, etc. They cannot participate in multipath if they are blacklisted.

  22. I’m trying to configure an Ubuntu 10.04 server to use an MD3000. In dmesg, I’m getting errors like this:

    [ 27.754987] scsi1 : ioc0: LSISAS1068 B0, FwRev=000a3300h, Ports=1, MaxQ=366, IRQ=38
    [ 27.799440] mptsas: ioc0: attaching ssp device: fw_channel 0, fw_id 0, phy 4, sas_addr 0x5002219482364704
    [ 27.803760] scsi 1:0:0:0: Direct-Access DELL MD3000 0735 PQ: 0 ANSI: 5
    [ 27.827210] sd 1:0:0:0: Attached scsi generic sg2 type 0
    [ 27.827213] sd 1:0:0:0: Embedded Enclosure Device
    [ 27.829438] sd 1:0:0:0: [sdb] 1754664960 512-byte logical blocks: (898 GB/836 GiB)
    [ 27.831059] sd 1:0:0:0: [sdb] Write Protect is off
    [ 27.831064] sd 1:0:0:0: [sdb] Mode Sense: 77 00 10 08
    [ 27.833334] sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, supports DPO and FUA
    [ 27.836356] sdb:
    [ 28.271083] scsi 1:0:0:31: Direct-Access DELL Universal Xport 0735 PQ: 0 ANSI: 5
    [ 28.294607] scsi 1:0:0:31: Attached scsi generic sg3 type 0
    [ 28.294610] scsi 1:0:0:31: Embedded Enclosure Device
    [ 28.360107] sd 1:0:0:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
    [ 28.360115] sd 1:0:0:0: [sdb] Sense Key : Illegal Request [current]
    [ 28.360122] sd 1:0:0:0: [sdb] <> ASC=0x94 ASCQ=0x1ASC=0x94 ASCQ=0x1
    [ 28.360135] sd 1:0:0:0: [sdb] CDB: Read(10): 28 00 00 00 00 00 00 00 08 00
    [ 28.360148] end_request: I/O error, dev sdb, sector 0

    I end up with a defunct /dev/sdb device, and it takes a little while for /dev/sdc to show up. Do you know of a way to fix that? On a RHEL 5 box connected to the same DAS, there is no defunct device, the array shows up correctly as /dev/sdb. We are also not using any sort of mulitpathing, as far as I can tell.

    1. Hi Cam,

      Did you follow all of the instructions in the post? It sounds like multipath is not working. the RHEL drivers support multipath whereas to get it working on Ubuntu/Debian requires some hacking, thus the reason I posted this.

      1. Hi Barry,
        We’re not trying to use multipath as far as I know. We just want to talk to a single array that was created on the MD3000. I made progress on this issue though, by loading the scsi_dh_rdac module. Now I don’t get the IO errors, and performance seems normal, but my array still shows up as sdc instead of sdb.

        1. Ok – the problem is that even thought you don’t “want” to use multipath, the MD3000 is still exporting multiple paths (seen as devices to the OS). You can disable multipath on the MD3000 side I think. We stopped using MD3000s long ago because they were “too smart” for us. We switched to MD1000s. IIRC, you can also unplug the cable providing the redundant path, but I seem to remember we needed a firmware hack to get the MD3000 to operate normally in that state.

          1. Ah. Actually I see that I forgot to mention, I read somewhere that the defunct device is actually some sort of management interface for the array, it’s not even supposed to be a path. The last thing I’m going to try is upgrading to the natty-backports kernel.

          2. I found some servers still using MD3000s 🙂

            In dmesg on our servers with no multipath we see something like this:

            [   34.021111] scsi 3:0:0:31: Direct-Access     DELL     Universal Xport  0735 PQ: 0 ANSI: 5
            [   34.044820] scsi 3:0:0:31: Embedded Enclosure Device
            [   34.066712] scsi 3:0:0:31: FIXME driver has no support for subenclosures (2)
            [   34.066784] scsi 3:0:0:31: Failed to bind enclosure -12
            [   34.066903] scsi 3:0:0:31: Attached scsi generic sg34 type 0
            

            Compared to an actual drive which looks like this:

            [   33.972831] scsi 3:0:0:30: Direct-Access     DELL     MD3000           0735 PQ: 0 ANSI: 5
            [   33.996173] sd 3:0:0:30: Embedded Enclosure Device
            [   33.996657] sd 3:0:0:30: [sdaf] 1952473088 512-byte logical blocks: (999 GB/931 GiB)
            [   34.001681] sd 3:0:0:30: [sdaf] Write Protect is off
            [   34.001751] sd 3:0:0:30: [sdaf] Mode Sense: 77 00 10 08
            [   34.043662]  sdaf1
            [   34.050290] sd 3:0:0:30: [sdaf] Attached SCSI disk
            [   34.007364] sd 3:0:0:30: [sdaf] Write cache: enabled, read cache: enabled, supports DPO and FUA
            [   34.011219]  sdaf:
            

            The MD3000 controller doesn’t show up as a device to the OS that I can see. We are running Debian Lenny, with the lenny-backports kernel (2.6.32) on this machine. On our machines running multipath with the MD3000s we blacklist the MD3000 controller so it also doesn’t show up.

Leave a reply to Simon Bunker Cancel reply

Blog at WordPress.com.