Fixing a corrupted encrypted LVM partition

If you are desperate to recover your system skip down a few paragraphs to cut to the chase.

I just went through this ordeal so I thought I would pass these notes along for the next traveler. I have an MSI CR620 running Fedora 14. Everything has pretty much worked flawlessly on this machine (webcam, audio, wireless), except suspend/resume/hibernate has become flaky with the Fedora 14 kernels (2.6.35+). I have therefore been running the last Fedora 13 kernel (2.6.34.mumble), with which I’ve had no suspend/resume/hiberate issues. On the newer kernels, the machine will not resume correctly, and I have to hard-shutdown and start back up. That’s the background.

Enter a new UPS I purchased recently due to a suspicion that I have less than perfect utility input or wiring in the computer room. So far I like this unit alot. It comes with a USB cable and (Windows) software that displays UPS status and allows you to change configuration. Great.

Well, I decided since it’s USB, I’ll just hook it up to my KVM and see what Linux thinks. Initially it was great. Fedora just recognized it, it showed up in the Gnome power management applet with battery status and all sorts of geeky statistics (graphs!).

Well. At some point that changed, and I’m not sure why (possibly because both my Windows and Linux machine were on?). The power applet seemed to always show 0.0% for battery charge. Unfortunately, the power management scheme (at least what is accessible via a UI) has built-in action triggers for UPS status. Want to guess what happens when it thinks your UPS battery is going run out? Yeah. Hibernate. No. You can’t tell it not to (at least via the UI).

Unfortunately I discovered this while Shotwell was importing a bunch of my images (note: this does a massive amount of disk reads/writes). Apparently this completely consumes CPU and memory and was a problem alone, without the machine trying to hibernate every few seconds.

The machine hibernated on me (I was using a more recent kernel because VirtualBox-OSE will not run on the old kernel, apparently only the newer kernel modules are available). I went through the the cold shutdown reboot process (I had been doing it so frequently at this point this did not phase me at all). When it booted, I was presented with this horrifying message:

/var: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
(i.e., without -a or -p options)
[FAILED]
*** An error occurred …
*** Dropping you to a shell …
Give root password …

My face: D:

This would be disconcerting enough. But a little more background. I use whole-drive encryption. Since I dual-boot Windows and Linux, I use Truecrypt for the Windows half of my “whole drive” (I would consider just cannibalizing this partition, but, hey, I paid for that Windows license), and an LVM volume group with a LUKS-encrypted root and home partition. I also set up sudo so that I can sudo to root. I change the root password to a random strong password with mkpasswd (this is in the ‘expect’ package of all places). Which I do not keep. I figure I don’t ever really need to log in as root right? (Needless to say I have reconsidered this).

In any case, I don’t have my root password.

Useful info starts here.

All is not lost, there actually isn’t anything that magical about LVM or LUKS encryption. I owe a debt of gratitude to nteon and hannes in the #fedora channel on Freenode, because it is possible to recover in a fairly straightforward fashion from a situation like this:

Boot into a LiveCD (or some other CD that has LVM and LUKS support). If you don’t have the media, well, you better have a machine with access to the net to get it.
Run ’cryptsetup luksOpen /dev/<encrypted partition>’ to “unlock” the LUKS partition on your LVM volume group. This should prompt you for the encryption password. You can also use the file browser (at least Nautilus on Gnome), or the Gnome Disk Utility app, to “unlock” the partition. I believe these all mean and do the same thing, although I use the command line for to be absolutely sure.
Since the volume group was mounted manually, you will need to run a couple of other arcane commands to tell the system that no, really they are actually there:
’vgscan –mknodes’. This scans the LVM volume and makes device nodes for partitions. You need device nodes to do anything with the partitions (like, say, fsck them).
’vgchange -ay’. This tells the system to make all the volumes/partitions “active” (I don’t totally understand it, but I infer that volumes can be active and inactive)
Now you should have partition devices under ’/dev/<volgroup>/<partition names>’. They should be recognizable, because you set them up and presumably named them. My volume group was named after the machine (msicr620) and my partitions were named conventionally based on what they did (lv_root, lv_home; I think the install did that).

Now you can ’fsck’ your partitions to figure out what is wrong:

fsck /dev/vg_msicr620/lv_root
fsck /dev/vg_msicr620/lv_home
fsck /dev/non_volgroup_partition
etc.

(if you know the file system type, you may want to try using the ’-t <fstype>’ flag. I fscked several times, and I think fsck was able to figure out the type when I omitted it.).

In my case, apparently the sole problem with the partitions, and the cause of this entire ordeal, is that the superblock has a future date as the last mounted date. I assume this was because I had enabled power management tracing (with ’echo 1 > /sys/power/pm_trace’) as described in the Fedora common problems guide, and this is noted to change the clock, some sequence of events must have left the last mounted date with the wrong time.

Once you are done fscking around you can reboot and boot from the HD. If you did not have serious problems with the partitions, everything will be fine and you will boot into your system. If, like me, you either do not have your root password, or have forgotten it, you can reset it by mounting your root volume when booted into the LiveCD, and editing your ‘/etc/shadow’ file. Obviously do this with care. There are some different ways to reset your root password here. Booting into single user mode did not work for me (probably related to corruption on the root volume).

To take forward:

recovering from filesystem corruption of encrypted partitions is really no different than any other type of partition, there are just a few more steps to “open” the encrypted volume. after that it’s pretty much the same process. although it’s no less stressful.
if you set a synthetic root password, record it somewhere (like KeePass) because you may in fact need it
I’m glad I had established a frequent and comprehensive backup schedule (via Deja Dup). It would not have been fun to lose this disk, but at least I had reliable backups.
investigate options that can make ext4 more reliable. there has been some controversy over ext4, and after this I am totally not in a mood to sacrifice reliability for…anything, really
##fedora is good people. without them I probably would have rocked myself to sleep in a corner crying.
For now I have just disabled sleep and hibernate while running F14 kernels. This is not a great situation but it’s better than constantly locking up the machine. Since this worked basically flawlessly in the past, I assume it’s just a matter of time before the regression gets fixed.

Devoured By Lions

the eternal struggle to tame complexity

Fixing a Corrupted Encrypted LVM Partition