A low point for HighPoint RocketRAID

Alternate title: How kernel module sata_mv held my arms behind my back while module hptmv punched me in the balls.

Upgrading a database server from SuSE 9.1 to openSUSE 10.2, I wasn’t really worried. We had nearly a terabyte of MySQL 4 data stored in RAID5 using a HighPoint RocketRAID 1820A, but I had backed up just about everything – didn’t really expect to actually need those backups. Usually what happens when performing an upgrade like this on Linux is the following:

  1. Upgrade OS (reinstall in this case)
  2. Notice that the array is not showing up as a device (e.g. /dev/sdb1)
  3. Download/compile/install driver
  4. modprobe the new module to make sure the array shows up
  5. Use yast2 to add the module to INITRD_MODULES
  6. Done!

I’ve noticed that with some 3ware controllers you can skip most of those steps and go directly to “Done!” because the drivers are apparently already included.

Anyway, looking over HighPoint’s drivers, I saw tarballs for SuSE versions as new as 10.1, but not for 10.2. I decided to skip the precompiled modules and get right to the action: the open source drivers.

Before compiling, for some reason I decided to update the RocketRAID BIOS from v1.13 to v1.18. Since the source code had the same version (1.18) I assumed they sort of “went together”. Going over the readmes and changelogs, I probably could’ve skipped the BIOS update. As it was, the BIOS came up fine and everything looked ready for compilation.

After installing the kernel sources, the make went smoothly except for a cryptic WARNING about some dot file. I had an hptmv module hot off the presses and was ready to rock.

I launched off a modprobe hptmv… and all hell broke loose.

The modprobe command looked frozen. I started to monitor /var/log/messages and was sort of comforted by what appeared to be an iterative initialization of devices in the array, followed by confusion and worry as I/O errors and other evil signs started scrolling. I tried to unload the module and, failing that, halt the machine.

When I walked into the server room, the RAID card was literally alarming. I thought I had a bad UPS for a second, but it turned out that the fucking card was yelling at me!

I had to power down the server manually. When the thing came back online, the RAID BIOS was screaming something to the effect of: your array is hosed.

And it was at that point that I fell to my knees to thank the Bit Gods for the preciousss, preciousss backup data sitting on the file server.

I soon guessed that the problem was related to something that had been nagging me all along: the fact that each drive in the array had been showing up as an individual SCSI device in lsscsi. Usually the array shows up as a single device and there is no mention in Linux of its individual components. I chalked it up to some weirdness that I would resolve once I got the module up and running. That, uh, obviously wasn’t the best move.

I had to PXE boot into rescue mode to adjust the module configuration and quickly pinned sata_mv as the culprit. I think this module exists to support Marvell products. I hadn’t heard of them before, but I’ll always remember that company for the module conflict that nuked my data.

One thing I made sure to do was add the following to /etc/modprobe.conf.local:

blacklist sata_mv

I left out the lengthy (and bitchy) comment I also added to the file.

I always feel stupid when stuff like this happens to me, but Google reveals that I am not alone. My favorite is this proposed kernel patch to sata_mv.c that apparently spits out this kernel warning message:

“Highpoint RocketRAID BIOS CORRUPTS DATA on all attached drives, regardless of if/how they are configured. BEWARE!”

Anyway, there is probably plenty of blame to go around between the module conflict and my own foolishness. I should have dealt with those SCSI device listings first. On the other hand, I have never had the experience of loading a module result in a disastrous event like trashing a RAID array. Looking at the binary driver packages that HighPoint has for SuSE, I see interesting files like:

sata_mv.ko -> hptmv.ko

There also appears to be code that removes the module entirely. Yet the source code I used returns nothing when I grep sata_mv *. Maybe I should have assumed that their “Open Source” driver was different and more dangerous than the binary versions.

Or maybe I should have just gone with 3ware.

UPDATE: doing a full (as opposed to quick) initialization of the RAID5 is fun. It goes through about 1% every minute or two, then somewhere in the mid 60s jumps to completion. But when the OS boots up, the array hasn’t been created – each disk is separate. I then have to go back into the BIOS, set the array up with quick initialization, and then it appears to “stick”. I actually ran the initialization again and checked every 10-15 minutes to try to catch what was happening at the end, but I apparently missed the window; it must zoom through 30% in like 5 minutes. Guess I’ll copy that terabyte to the array overnight and just see if it works.

Yeah, definitely 3ware next time.

This entry was posted in Linux, Rants. Bookmark the permalink.

4 Responses to A low point for HighPoint RocketRAID

  1. B says:

    I think I might have hit this bug. The RocketRAID utilty shows the RAID5 array as degraded and somehow also managed to take the hot spare and claim it’s an array. However the data seems to be intact, though the ext2 filesystem has errors. I am hoping it can hang in there till I pull of all the data. Did you encounter anything of this sort ?

  2. theoden says:

    This doesn’t sound like the problem I had; in my situation the erroneous loading of the sata_mv module wiped out the array. I couldn’t access the data at all.

  3. Faust says:

    Hi there,

    When would data corruption typically take place?
    I have the RR 2300 with 3x 400GB disks attached. In BIOS they are seen as ‘NEW’.

    The sata_mv module is loaded and I added all 3 disks as a physical volume.
    Then I created a volume group and a logical volume. On this logical volume I created an XFS filesystem of about 1.1TB and mounted it.

    I then copied about 800GB of data on this new filesystem and using sha1sum verified the original files with the copied data. I then rebooted, remounted and ran the sha1 check again. Still all seems well…

    So when can I expect data corruption?

    Thanks!

  4. theoden says:

    Just going on my one experience here, I would think your main concern with data corruption would be after an OS or kernel upgrade. Obviously your array is working fine right now so your module configuration is probably safe. If you do an `lsmod` and find that both sata_mv and hptmv are loaded, then maybe this bug has been fixed. If only sata_mv is loaded, then I wouldn’t be surprised if there is a sata_mv.ko -> hptmv.ko symlink.

Leave a Reply to Faust Cancel reply

Your email address will not be published. Required fields are marked *