Sunday, December 02, 2007

Recently, my 30 year old fridge decided it had had enough, and it promptly died.

In the process, (I think the motor bearings ceased), it overloaded the circuit, and caused the circuit breaker to trip.

Unfortunately, the timing of this caused the power to go out just as a disk write was occurring on one of the disks in my media server, which caused it to get reasonably bad corruption, rendering it unmountable.

I didn't find any of this out until about 10.15, when I woke up, late for work, because my alarm clock hadn't gone off.

During boot, I would get:

[17179594.952000] hdd: dma_intr: status=0x51 { DriveReady SeekComplete Error }
[17179594.952000] hdd: dma_intr: error=0x40 { UncorrectableError }, LBAsect=4349, high=0, low=4349, sector=4343
[17179594.952000] ide: failed opcode was: unknown
[17179594.952000] end_request: I/O error, dev hdd, sector 4343
[17179594.952000] JBD: IO error reading journal superblock
[17179594.952000] EXT3-fs: error loading journal.

I added a "spare" disk I had hanging around (another disk that's on its way out, and I need to send back, since it was the only thing big enough), and attempted to use dd-rescue to get a copy of the contents of the drive.

I figured I could then muck around, and attempt to e2fsck that file, or loop mount it or something. This wasn't very successful, the first 50mb of the drive looks toasted, which took about an hour to read through, and then the rest of the drive was reading at about 1mb/sec, so after 4 1/2 hours or so, I had an 11gb file, which I couldn't do anything with.

I'd previously used Stellar Phoenix for recovering FAT/NTFS partitions in a similar state (since it looks like it was mainly the superblock/journal that was stuffed, and not the data), I found they had a Linux tool, but ironically, it runs on Windows.

I had a disk configured for doing recovery a while ago, but I couldn't find it. I had to get another spare disk (a 120gb I used to use in my tivo), wiped it, and installed windows 98 on it, and the Stellar Phoenix Linux program.

I gave up on the dd_rescue, and moved the drive to the windows machine. I ran Stellar across it, it immediately found the drive, logical drive, and listed stacks of files. This was looking somewhat promising.

The eval version of the software doesn't allow you to recover any files though, so I found an "alternate" version of the software, that does, and I rescanned the disk, and was able to recover a handful of files.

Most of them were corrupt, and when I checked the file listing in Stellar, files were either 0 bytes, or massively huge (a 6gb text file?). This wasn't working.

I rescanned the disk, doing an advanced scan, which I was hoping would find the alternate superblocks, and try reading the file listing from one of them, but after 42 hours, it hadn't finished, and I was fed up with waiting, since I didn't think it would make any difference to the file listing.

In the meantime, I had gone and bought some new hard drives. Plan B was to rebuild the data that was on the disk, from a backup from 3 months ago, and the contents of my ipod, but that would be a real pain, and it would require undeleting some data I moved from another disk to the corrupted one recently.

I had done some googling, and found processes to recover disks using debugfs etc.

So I gave up on windows, and moved the disk back to the linux machine.

Just trying to mount it would get me:

[17180880.456000] VFS: Can't find ext3 filesystem on dev hdd1.

What I ended up doing, was using dumpe2fs on the corrupted partition, to get a list of the alternate superblocks, this worked:

Backup superblock at 32768, Group descriptors at 32769-32783
Backup superblock at 98304, Group descriptors at 98305-98319
Backup superblock at 163840, Group descriptors at 163841-163855

(etc, with 11 more).

I then had to pick out a backup superblock, convert it from the 1k block partition to 4k, since the partition has 4k blocks on it (this is just multiplying it by 4).

819200 x 4 = 3276800

I then pass an option to mount, to use the alternate superblock. The first few times I tried to do this, it looked like it was trying, but then came back telling me things like the magic didn't match, or it was invalid etc.

[17180969.348000] EXT3-fs: Magic mismatch, very weird !

When I found a non corrupt superblock backup, checking dmesg, I saw it was trying to load the journal, which was corrupt:

[17181822.756000] hdd: dma_intr: status=0x51 { DriveReady SeekComplete Error }
[17181822.756000] hdd: dma_intr: error=0x40 { UncorrectableError }, LBAsect=4349, high=0, low=4349, sector=4343
[17181822.756000] ide: failed opcode was: unknown
[17181822.756000] end_request: I/O error, dev hdd, sector 4343
[17181822.756000] JBD: IO error reading journal superblock
[17181822.756000] EXT3-fs: error loading journal.

I tried passing "noload" to skip loading the journal, but that didn't work.

I then tried forcing it to mount as an ext2 partition, and bang, it mounted.

mount -v -t ext2 /dev/hdd1 /disks/mp3 -o sb=3276800

dmesg says:

[17179910.436000] EXT2-fs warning (device hdd1): ext2_fill_super: mounting ext3 filesystem as ext2
[17179910.436000] EXT2-fs warning: mounting unchecked fs, running e2fsck is recommended

I then used rsync to copy selected directories from the corrupted disk to a new disk I had mounted on the machine.

And I'll finish off with a rant, yes, I should backup properly, yes, I should get a UPS (again, the last one caught on fire), but.. CRAPPY SEAGATE DRIVE AGAIN.

Update: Just for fun, I decided to see if I'd get anywhere using dd_rescue, and e2fscking the image on a different drive.

I found and used dd_rhelp, instead of dd_rescue directly, since it works a bit more to my liking.. when dd_rescue hits a hard bit, it skips it, keeps going, and at the end, comes back to have another go at the dodgy reading bits.

This got me about 99.99% of the drive in an image. I ran e2fsck on this, and it said the journal was corrupt, so I trashed it, converting the drive to ext2, it then "fixed" the problems, and recreated the journal, converting back to ext3.

I loop mounted the fixed image, and found that while everything was there, it was all under lost+found, and would have been a bit of a pain to work out what it was, move it out, and rename it, but could have done it, if the alternate superblock mounting hadn't worked.

There is another method again, I found here, using debugfs, but I didn't try it.