Glider
"In het verleden behaalde resultaten bieden geen garanties voor de toekomst"
About this blog

These are the ramblings of Matthijs Kooijman, concerning the software he hacks on, hobbies he has and occasionally his personal life.

Most content on this site is licensed under the WTFPL, version 2 (details).

Questions? Praise? Blame? Feel free to contact me.

My old blog (pre-2006) is also still available.

Sun Mon Tue Wed Thu Fri Sat
         
17 18 19 20 21 22 23
24 25 26 27 28    
Powered by Blosxom &Perl onion
(With plugins: config, extensionless, hide, tagging, Markdown, macros, breadcrumbs, calendar, directorybrowse, entries_index, feedback, flavourdir, include, interpolate_fancy, listplugins, menu, pagetype, preview, seemore, storynum, storytitle, writeback_recent, moreentries)
Valid XHTML 1.0 Strict & CSS
Recovering data from a failing hard disk with HFS+

Recently, a customer asked me te have a look at an external hard disk he was using with his Macbook. It would show up a file listing just fine, but when trying to open actual files, it would start failing. Of course there was no backup, but the files were very precious...

This started out as a small question, but ended up in an adventure that spanned a few days and took me deep into the ddrescue recovery tool, through the HFS+ filesystem and past USB power port control. I learned a lot, discovered some interesting things and produced a pile of scripts that might be helpful to others. Since the journey seems interesting as well as the end result, I will describe the steps I took here, "ter leering ende vermaeck".

I started out confirming the original problem. Plugging in the disk to my Linux laptop, it showed up as expected in dmesg. I could mount the disk without problems, see the directory listing and even open up an image file stored on the disk. Opening other files didn't seem to work.

SMART

As you do with bad disks, you try to get their SMART data. Since smartctl did not support this particular USB bridge (and I wasn't game to try random settings to see if it worked on a failing disk), I gave up on SMART initially. I later opened up the case to bypassing the USB-to-SATA controller (in case the problem was there, and to make SMART work), but found that this particular hard drive had the converter built into the drive itself (so the USB part was directly attached to the drive). Even later, I found out some page online (I have not saved the link) that showed the disk was indeed supported by smartctl and showed the option to pass to smartctl -d to make it work. SMART confirmed that the disk was indeed failing, based on the number of reallocated sectors (2805).

Fast-then-slow copying

Since opening up files didn't work so well, I prepared to make a sector-by-sector copy of the partition on the disk, using ddrescue. This tool has a good approach to salvaging data, where it tries to copy off as much data as possible quickly, skipping data when it comes to a bad area on disk. Since reading a bad sector on a disk often takes a lot of time (before returning failure), ddrescue tries to steer clear of these bad areas and focus on the good parts first. Later, it returns to these bad areas and, in a few passes, tries to get out as much data as possible.

At first, copying data seemed to work well, giving a decent read speed of some 70MB/s as well. But very quickly the speed dropped terribly and I suspected the disk ran into some bad sector and kept struggling with that. I reset the disk (by unplugging it) and did a few more attempts and quickly discovered something weird: The disk would work just fine after plugging it in, but after a while the speed would plummet tot a whopping 64Kbyte/s or less. This happened everytime. Even more, it happened pretty much exactly 30 seconds after I started copying data, regardless of what part of the disk I copied data from.

So I quickly wrote a one-liner script that would start ddrescue, kill it after 45 seconds, wait for the USB device to disappear and reappear, and then start over again. So I spent some time replugging the USB cable about once every minute, so I could at least back up some data while I was investigating other stuff.

Since the speed was originally 70MB/s, I could pull a few GB worth of data every time. Since it was a 2000GB disk, I "only" had to plug the USB connector around a thousand times. Not entirely infeasible, but not quite comfortable or efficient either.

So I investigated ways to further automate this process: Using hdparm to spin down or shutdown the disk, use USB powersaving to let the disk reset itself, disable the USB subsystem completely, but nothing seemed to increase the speed again other than completely powering down the disk by removing the USB plug.

While I was trying these things, the speed during those first 30 seconds dropped, even below 10MB/s at some point. At that point, I could salvage around 200MB with each power cycle and was looking at pulling the USB plug around 10,000 times: no way that would be happening manually.

Automatically pulling the plug

I resolved to further automate this unplugging and planned using an Arduino (or perhaps the GPIO of a Raspberry Pi) and something like a relay or transistor to interrupt the power line to the hard disk to "unplug" the hard disk.

For that, I needed my Current measuring board to easily interrupt the USB power lines, which I had to bring from home. In the meanwhile, I found uhubctl, a small tool that uses low-level USB commands to individually control the port power on some hubs. Most hubs don't support this (or advertise support, but simply don't have the electronics to actually switch power, apparently), but I noticed that the newer raspberry pi's supported this (for port 2 only, but that would be enough).

Coming to the office the next day, I set up a raspberry pi and tried uhubctl. It did indeed toggle USB power, but the toggle would affect all USB ports at the same time, rather than just port 2. So I could switch power to the faulty drive, but that would also cut power to the good drive that I was storing the recovered data on, and I was not quite prepared to give the good drive 10,000 powercycles.

The next plan was to connect the recovery drive through the network, rather than directly to the Raspberry Pi. On Linux, setting up a network drive using SSHFS is easy, so that worked in a few minutes. However, somehow ddrescue insisted it could not write to the destination file and logfile, citing permission errors (but the permissions seemed just fine). I suspect it might be trying to mmap or something else that would not work across SSHFS....

The next plan was to find a powered hub - so the recovery drive could stay powered while the failing drive was powercycled. I rummaged around the office looking for USB hubs, and eventually came up with some USB-based docking station that was externally powered. When connecting it, I tried the uhubctl tool on it, and found that one of its six ports actually supported powertoggling. So I connected the failing drive to that port, and prepared to start the backup.

When trying to mount the recovery drive, I discovered that a Raspberry pi only supports filesystems up to 2TB (probably because it uses a 32-bit architecture). My recovery drive was 3TB, so that would not work on the Pi.

Time for a new plan: do the recovery from a regular PC. I already had one ready that I used the previous day, but now I needed to boot a proper Linux on it (previously I used a minimal Linux image from UBCD, but that didn't have a compiler installed to allow using uhubctl). So I downloaded a Debian live image (over a mobile connection - we were still waiting for fiber to be connected) and 1.8GB and 40 minutes later, I finally had a working setup.

The run.sh script I used to run the backup basically does this:

  1. Run ddrescue to pull of data
  2. After 35 seconds, kill ddrescue
  3. Tell the disk to sleep, so it can spindown gracefully before cutting the power.
  4. Tell the disk to sleep again, since sometimes it doesn't work the first time.
  5. Cycle the USB power on the port
  6. Wait for the disk to re-appear
  7. Repeat from 1.

By now, the speed of recovery had been fluctuating a bit, but was between 10MB/s and 30MB/s. That meant I was looking at some thousands up to ten thousands powercycles and a few days up to a week to backup the complete disk (and more if the speed would drop further).

Selectively backing up

Realizing that there would be a fair chance that the disk would indeed get slower, or even die completely due to all these power cycles, I had to assume I could not backup the complete disk.

Since I was making the backup sector by sector using ddrescue, this meant a risk of not getting any meaningful data at all. Files are typically fragmented, so can be stored anywhere on the disk, possible spread over multiple areas as well. If you just start copying at the start of the disk, but do not make it to the end, you will have backed some data but the data could belong to all kinds of different files. That means that you might have some files in a directory, but not others. Also, a lot of files might only be partially recovered, the missing parts being read as zeroes. Finally, you will also end up backing up all unused space on the disk, which is rather pointless.

To prevent this, I had to figure out where all kinds of stuff was stored on the disk.

The catalog file

The first step was to make sure the backup file could be mounted (using a loopback device). On my first attempt, I got an error about an invalid catalog.

I looked around for some documentation about the HFS+ filesystems, and found a nice introduction by infosecaddicts.com and a more detailed description at dubeiko.com. The catalog is apparently the place where the directory structure, filenames, and other metadata are stored in a single place.

This catalog is not in a fixed location (since its size can vary), but its location is noted in the so-called volume header, a fixed-size datastructure located at 1024 bytes from the start of the partition. More details (including easier to read offsets within the volume header) are provided in this example.

Looking at the volume header inside the backup, gives me:

root@debian:/mnt/recover/WD backup# dd if=backup.img bs=1024 skip=1 count=1 2> /dev/null | hd
00000000  48 2b 00 04 80 00 20 00  48 46 53 4a 00 00 3a 37  |H+.... .HFSJ..:7|
00000010  d4 49 7e 38 d8 05 f9 64  00 00 00 00 d4 49 1b c8  |.I~8...d.....I..|
00000020  00 01 24 7c 00 00 4a 36  00 00 10 00 1d 1a a8 f6  |..$|..J6........|
                                   ^^^^^^^^^^^ Block size: 4096 bytes
00000030  0e c6 f7 99 14 cd 63 da  00 01 00 00 00 01 00 00  |......c.........|
00000040  00 02 ed 79 00 6e 11 d4  00 00 00 00 00 00 00 01  |...y.n..........|
00000050  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000060  00 00 00 00 00 00 00 00  a7 f6 0c 33 80 0e fa 67  |...........3...g|
00000070  00 00 00 00 03 a3 60 00  03 a3 60 00 00 00 3a 36  |......`...`...:6|
00000080  00 00 00 01 00 00 3a 36  00 00 00 00 00 00 00 00  |......:6........|
00000090  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
000000c0  00 00 00 00 00 e0 00 00  00 e0 00 00 00 00 0e 00  |................|
000000d0  00 00 d2 38 00 00 0e 00  00 00 00 00 00 00 00 00  |...8............|
000000e0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000110  00 00 00 00 12 60 00 00  12 60 00 00 00 01 26 00  |.....`...`....&.|
00000120  00 0d 82 38 00 01 26 00  00 00 00 00 00 00 00 00  |...8..&.........|
00000130  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000160  00 00 00 00 12 60 00 00  12 60 00 00 00 01 26 00  |.....`...`....&.|
00000170  00 00 e0 38 00 01 26 00  00 00 00 00 00 00 00 00  |...8..&.........|
00000180  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000400

00000110  00 00 00 00 12 60 00 00  12 60 00 00 00 01 26 00  |.....`...`....&.|
          ^^^^^^^^^^^^^^^^^^^^^^^ Catalog size, in bytes: 0x12600000

00000120  00 0d 82 38 00 01 26 00  00 00 00 00 00 00 00 00  |...8..&.........|
                      ^^^^^^^^^^^ First extent size, in 4k blocks: 0x12600
          ^^^^^^^^^^^ First extent offset, in 4k blocks: 0xd8238
00000130  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

I have annotated the parts that refer to the catalog. The content of the catalog (just like all other files), are stored in "extents". An extent is a single, contiguous block of storage, that contains (a part of) the content of a file. Each file can consist of multiple extents, to prevent having to move file content around each time things change (e.g. to allow fragmentation).

In this case, the catalog is stored only in a single extent (since the subsequent extent descriptors have only zeroes). All extent offsets and sizes are in blocks of 4k byte, so this extent lives at 0xd8238 * 4k = byte 3626205184 (~3.4G) and is 0x12600 * 4k = 294MiB long. So I backed up the catalog by adding -i 3626205184 to ddrescue, making it skip ahead to the location of the catalog (and then power cycled a few times until it copied the needed 294MiB).

After backup the allocation file, I could mount the image file just fine, and navigate the directory structure. Trying to open files would mostly fail, since the most files would only read zeroes now.

I did the same for the allocation file (which tracks free blocks), the extents file (which tracks the content of files that are more fragmented and whose extent list does not fit in the catalog) and the attributes file (not sure what that is, but for good measure).

Afterwards, I wanted to continue copying from where I previously left off, so I tried passing -i 0 to ddrescue, but it seems this can only be used to skip ahead, not back. In the end, I just edited the logfile, which is just a textfile, to set the current position to 0. ddrescue is smart enough to skip over blocks it already backed up (or marked as failed), so it then continued where it previously left off.

Where are my files?

With the catalog backed up, I needed to read it to figure out what file were stored where, so I could make sure the most important files were backed up first, followed by all other files, skipping any unused space on the disk.

I considered and tried some tools for reading the catalog directly, but none of them seemed workable. I looked at hfssh from hfsutils (which crashed), hfsdebug (which is discontinued and no longer available for download), hfsinspect (which calsl itself "quite buggy").

Instead, I found the filefrag commandline utility that uses a Linux filesystem syscall to figure out where the contents of a particular file is stored on disk. To coax the output of that tool into a list of extents usable by ddrescue, I wrote a oneliner shell script called list-extents.sh:

sudo filefrag -e "$@"  | grep  '^   ' |sed 's/\.\./:/g' | awk -F: '{print $4, $6}'

Given any number of filenames, it produces a list of (start, size) pairs for each extent in the listed files (in 4k blocks, which is the Linux VFS native block size).

With the backup image loopback-mounted at /mnt/backup, I could then generate an extent list for a given subdirectory using:

sudo find /mnt/backup/SomeDir -type f -print0 | xargs -0 -n 100 ./list-extents.sh > SomeDir.list

To turn this plain list of extents into a logfile usable by ddrescue, I wrote another small script called post-process.sh, that adds the appropriate header, converts from 4k blocks to 512-byte sectors, converts to hexadecimal and sets the right device size (so if you want to use this script, edit it with the right size). It is called simply like this:

./post-process.sh SomeDir.list

This produces two new files: SomeDir.list.done, in which all of the selected files are marked as "finished" (and all other blocks as "non-tried") and SomeDir.list.notdone which is reversed (all selected files are marked as "non-tried" and all others are marked as "finished").

Backing up specific files

Armed with a couple of these logfiles for the most important files on the disk and one for all files on the disk, I used the ddrescuelog tool to tell ddrescue what stuff to work on first. The basic idea is to mark everything that is not important as "finished", so ddrescue will skip over it and only work on the important files.

ddrescuelog backup.logfile --or-mapfile SomeDir.list.notdone | tee todo.original > todo

This uses the ddrescuelog --or-mapfile option, which takes my existing logfile (backup.logfile) and marks all bytes as finished that are marked as finished in the second file (SomeDir.list.notdone). IOW, it marks all bytes that are not part of SomeDir as done. This generates two copies (todo and todo.original) of the result, I'll explain why in a minute.

With the generated todo file, we can let ddrescue run (though I used the run.sh script instead):

# Then run on the todo file
sudo ddrescue -d /dev/sdd2 backup.img todo -v -v

Since the generation of the todo file effectively threw away information (we can not longer see from the todo file what parts of the non-important sectors were already copied, or had errors, etc.), we need to keep the original backup.logfile around too. Using the todo.original file, we can figure out what the last run did, and update backup.logfile accordingly:

ddrescuelog backup.logfile --or-mapfile <(ddrescuelog --xor-mapfile todo todo.original) > newbackup.logfile

Note that you could also use SomeDir.list.done here, but actually comparing todo and todo.original helps in case there were any errors in the last run (so the error sectors will not be marked as done and can be retried later).

With backup.logfile updated, I could move on to the next subdirectories, and once all of the important stuff was done, I did the same with a list of all file contents to make sure that all files were properly backed up.

But wait, there's more!

Now, I had the contents of all files backed up, so the data was nearly safe. I did however find that the disk contained a number of hardlinks, and/or symlinks, which did not work. I did not dive into the details, but it seems that some of the metadata and perhaps even file content is stored in a special "metadata directory", which is hidden by the Linux filesystem driver. So my filefrag-based "All files"-method above did not back up sufficient data to actually read these link files from the backup.

I could have figured out where on disk these metadata files were stored and do a backup of that, but then I still might have missed some other special blocks that are not part of the regular structure. I could of course back up every block, but then I would be copying around 1000GB of mostly unused space, of which only a few MB or GB would actually be relevant.

Instead, I found that HFS+ keeps an "allocation file". This file contains a single bit for each block in the filesystem, to store whether the block is allocated (1) or free (0). Simply looking a this bitmap and backing up all blocks that are allocated should make sure I had all data, and only left unused blocks behind.

The position of this allocation file is stored in the volume header, just like the catalog file. In my case, it was stored in a single extent, making it fairly easy to parse.

The volume header says:

00000070  00 00 00 00 03 a3 60 00  03 a3 60 00 00 00 3a 36  |......`...`...:6|
          ^^^^^^^^^^^^^^^^^^^^^^^ Allocation file size, in bytes: 0x12600000

00000080  00 00 00 01 00 00 3a 36  00 00 00 00 00 00 00 00  |......:6........|
                      ^^^^^^^^^^^ First extent size, in 4k blocks: 0x3a36
          ^^^^^^^^^^^ First extent offset, in 4k blocks: 0x1

This means the allocation file takes up 0x3a36 blocks (of 4096 bytes of 8 bits each, so it can store the status of 0x3a36 * 4k * 8 = 0x1d1b0000 blocks, which is rounded up from the total size of 0x1d1aa8f6 blocks).

First, I got the allocation file off the disk image (this uses bash arithmetic expansion to convert hex to decimal, you can also do this manually):

dd if=/dev/backup of=allocation bs=4096 skip=1 count=$((0x3a36))

Then, I wrote a small python script parse-allocation-file.py to parse the allocate file and output a ddrescue mapfile. I started out in bash, but that got tricky with bit manipulation, so I quickly converted to Python.

The first attempt at this script would just output a single line for each block, to let ddrescuelog merge adjacent blocks, but that would produce such a large file that I stopped it and improved the script to do the merging directly.

cat allocation | ./parse-allocation-file.py > Allocated.notdone

This produces an Allocated.notdone mapfile, in which all free blocks are marked as "finished", and all allocated blocks are marked as "non-tried".

As a sanity check, I verified that there was no overlap between the non-allocated areas and all files (i.e. the output of the following command showed no done/rescued blocks):

ddrescuelog AllFiles.list.done --and-mapfile Allocated.notdone | ddrescuelog --show-status -

Then, I looked at how much data was allocated, but not part of any file:

ddrescuelog AllFiles.list.done --or-mapfile Allocated.notdone | ddrescuelog --show-status -

This marked all non-allocated areas and all files as done, leaving a whopping 21GB of data that was somehow in use, but not part of any files. This size includes stuff like the volume header, catalog, the allocation file itself, but 21GB seemed a lot to me. It also includes the metadata file, so perhaps there's a bit of data in there for each file on disk, or perhaps the file content of hard linked data?

Nearing the end

Armed with my Allocated.notdone file, I used the same commands as before to let ddrescue backup all allocated sectors and made sure all data was safe.

For good measure, I let ddrescue then continue backing up the remainder of the disk (e.g. all unallocated sectors), but it seemed the disk was nearing its end now. The backup speed (even during the "fast" first 30 seconds) had dropped to under 300kB/s, so I was looking at a couple of more weeks (and thousands of powercycles) for the rest of the data, assuming the speed did not drop further. Since the rest of the backup should only be unused space, I shut down the backup and focused on the recovered data instead.

What was interesting, was that during all this time, the number of reallocated sectors (as reported by SMART) had not increased at all. So it seems unlikely that the slowness was caused by bad sectors (unless the disk firmware somehow tried to recover data from these reallocated sectors in the background and locked up itself in the process). The slowness also did not seem related to what sectors I had been reading. I'm happy that the data was recovered, but I honestly cannot tell why the disk was failing in this particular way...

In case you're in a similar position, the scripts I wrote are available for download.

So, with a few days of work, around a week of crunch time for the hard disk and about 4,000 powercycles, all 1000GB of files were safe again. Time to get back to some real work :-)

 
0 comments -:- permalink -:- 14:53
Automatically remotely attaching tmux and forwarding things

GnuPG logo

I recently upgraded my systems to Debian Stretch, which caused GnuPG to stop working within Mutt. I'm not exactly sure what was wrong, but I discovered that GnuPG version 2 changed quite some things and relies more heavily on the gpg-agent, and I discovered that recent SSH version can forward unix domain socket instead of just TCP sockets, which allows forwarding a gpg-agent connection over SSH.

Until now, I had my GPG private keys stored on my server, Tika, where my Mutt mail client also runs. However, storing private keys, even with a passphrase, on permanentely connected multi-user system never felt quite right. So this seemed like a good opportunity to set up proper forwarding for my gpg agent, and keep my private keys confined to my laptop.

I already had some small scripts in place to easily connect to my server through SSH, attach to the remote tmux session (or start it), set up some port forwards (in particular a reverse port forward for SSH so my mail client and IRC client could open links in my browser), and quickly reconnect when the connection fails. However, once annoyance was that when the connection fails, the server might not immediately notice, so reconnecting usually left me with failed port forwards (since the remote listening port was still taken by the old session). This seemed like a good occasion to fix that as wel.

The end result is a reasonably complex script, that is probably worth sharing here. The script can be found in my scripts git repository. On the server, it calls an attach script, but that's not much more than attaching to tmux, or starting a new session with some windows if no session is running yet.

The script is reasonably well-commented, including an introduction on what it can do, so I will not repeat that here.

For the GPG forwarding, I based upon this blogpost. There, they suggest configuring an extra-socket in gpg-agent.conf, but I've found that gpg-agent already created an extra socket (whose path I could query with gpgconf --list-dirs), so I didn't use that extra-socket configuration line. They also talk about setting StreamLocalBindUnlink to clean up a lingering socket when creating a new one, but that is already handled by my script instead.

Furthermore, to prevent a gpg-agent from being autostarted by gnupg serverside (in case the forwarding fails, or when I would connect without this script, etc.), I added no-autostart to ~/.gnupg/gpg.conf. I'm not running systemd user session on my server, but if you are you might need to disable or mask some ssh-agent sockets and/or services to prevent systemd from creating sockets for ssh-agent and starting it on-demand.

My next step is to let gpg-agent also be my ssh-agent (or perhaps just use plain ssh-agent) to enforce confirming each SSH authentication request. I'm currently using gnome-keyring / seahorse as my SSH agent, but that just silently approves everything, which doesn't really feel secure.

 
0 comments -:- permalink -:- 16:46
Bouncing packets: Kernel bridge bug or corner case?

Tux

While setting up Tika, I stumbled upon a fairly unlikely corner case in the Linux kernel networking code, that prevented some of my packets from being delivered at the right place. After quite some digging through debug logs and kernel source code, I found the cause of this problem in the way the bridge module handles netfilter and iptables.

Just in case someone else actually finds himself in this situation and actually manages to find this blogpost, I'll detail my setup, the problem and it solution here.

See more ...

Related stories

 
0 comments -:- permalink -:- 18:40
CrashPlan: Cheap cloud backup that runs on Linux

For some time, I've been looking for a decent backup solution. Such a solution should:

  • be completely unattended,
  • do off-site backups (and possibly onsite as well)
  • be affordable (say, €5 per month max)
  • run on Linux (both desktops and headless servers)
  • offer plenty of space (couple of hundred gigabytes)

Up until now I haven't found anything that met my demands. Most backup solutions don't run on (headless Linux) and most generic cloud storage providers are way too expensive (because they offer high-availability, high-performance storage, which I don't really need).

Backblaze seemed interesting when they launched a few years ago. They just took enormous piles of COTS hard disks and crammed a couple dozen of them in a custom designed case, to get a lot of cheap storage. They offered an unlimited backup plan, for only a few euros per month. Ideal, but it only works with their own backup client (no normal FTP/DAV/whatever supported), which (still) does not run on Linux.

Crashplan

Crashplan logo

Recently, I had another look around and found CrashPlan, which offers an unlimited backup plan for only $5 per month (note that they advertise with $3 per month, but that is only when you pay in advance for four years of subscription, which is a bit much. Given that if you cancel beforehand, you will still get a refund of any remaining months, paying up front might still be a good idea, though). They also offer a family pack, which allows you to run CrashPlan on up to 10 computers for just over twice the price of a single license. I'll probably get one of these, to backup my laptop, Brenda's laptop and my colocated server.

The best part is that the CrashPlan software runs on Linux, and even on a headless Linux server (which is not officially supported, but CrashPlan does document the setup needed). The headless setup is possible because CrashPlan runs a daemon (as root) that takes care of all the actual work, while the GUI connects to the daemon through a TCP port. I still need to double-check what this means for the security though (especially on a multi-user system, I don't want to every user with localhost TCP access to be able to administer my backups), but it seems that CrashPlan can be configured to require the account password when the GUI connects to the daemon.

The CrashPlan software itself is free and allows you to do local backups and backups to other computers running CrashPlan (either running under your own account, or computers of friends running on separate accounts). Another cool feature is that it keeps multiple snapshots of each file in the backup, so you can even get back a previous version of a file you messed up. This part is entirely configurable, but by default it keeps up to one snapshot every 15 minutes for recent changes, and reduces that to one snapshot for every month for snapshots over a year old.

When you pay for a subscription, the software transforms into CrashPlan+ (no reinstall required) and you get extra features such as multiple backup sets, automatic software upgrades and most notably, access to the CrashPlan Central cloud storage.

I've been running the CrashPlan software for a few days now (it comes with a 30-day free trial of the unlimited subscription) and so far, I'm quite content with it. It's been backing up my homedir to a local USB disk and into the cloud automatically, I don't need to check up on it every time.

The CrashPlan runs on Java, which I doesn't usually make me particularly enthousiastic. However, the software seems to run fast and reliable so far, so I'm not complaining. Regarding the software itself, it does seem to me that it's not intended for micromanaging. For example, when my external USB disk is not mounted, the interface shows "Destination unavailable". When I then power on and mount the external disk, it takes some time for Crashplan to find out about this and in the meanwhile, there's no button in the interface to convince CrashPlan to recheck the disk. Also, I can add a list of filenames/path patterns to ignore, but there's not really any way to test these regexes.

Having said that, the software seems to do its job nicely if you just let it do its job in the background. On piece of micromanagement which I do like is that you can manually pause and resume the backups. If you pause the backups, they'll be automatically resumed after 24 hours, which is useful if the backups are somehow bothering you, without the risk that you forget to turn the backups back on.

Backing up only when docked

Of course, sending away backups is nice when I am at home and have 50Mbit fiber available, but when I'm on the road, running on some wifi or even 3G connection, I really don't want to load my connection with the sending of backup data.

Of course I can manually pause the backups, but I don't want to be doing that every time when I pick up my laptop and get moving. Since I'm using a docking station, it makes sense to simply pause backups whenever I undock and resume them when I dock again.

The obvious way to implement this would be to simply stop the CrashPlan daemon when undocking, but when I do that, the CrashPlanDesktop GUI becomes unresponsive (and does not recover when the daemon is started again).

So, I had a look at the "admin console", which offers "command line" commands, such as pause and resume. However, this command line seems to be available only inside the GUI, which is a bit hard to script (also note that not all of the commands seem to work for me, sleep and help seem to be unknown commands, which cause the console to close without an error message, just like when I just type something random).

It seems that these console commands are really just sent verbatim to the CrashPlan daemon. Googling around a bit more, I found a small script for CrashPlan PRO (the business version of their software), which allows sending commands to the daemon through a shell script. I made some modifications to this script to make it useful for me:

  • don't depend on the current working dir, hardcode /usr/local/crashplan in the script instead
  • fixed a bashism (== vs =)
  • removed -XstartOnFirstThread argument from java (MacOS only?)
  • don't store the commands to send in a separate $command but instead pass "$@" to java directly. This latter prevents bash from splitting arguments with spaces in them into multiple arguments, which causes the command "pause 9999" to be interpreted as two commands instead of one with an argument.

I have this script under /usr/local/bin/CrashPlanCommand:

#!/bin/sh
BASE_DIR=/usr/local/crashplan

if [ "x$@" == "x" ] ; then
  echo "Usage: $0 <command> [<command>...]"
  exit
fi

hostPort=localhost:4243
echo "Connecting to $hostPort"

echo "Executing $@"

CP=.
for f in `ls $BASE_DIR/lib/*.jar`; do
    CP=${CP}:$f
done

java -classpath $CP com.backup42.service.ui.client.ConsoleApp $hostPort "$@"

Now I can run CrashPlanCommand 'pause 9999' and CrashPlanCommand resume to pause and resume the backups (9999 is the number of minutes to pause, which is about a week, since I might be undocked more than 24 hourse, which is the default pause time).

To make this run automatically on undock, I created a simply udev rules file as /etc/udev/rules.d/10-local-crashplan-dock.rules:

ACTION=="change", ATTR{docked}=="0", ATTR{type}=="dock_station", RUN+="/usr/local/bin/CrashPlanCommand 'pause 9999'"
ACTION=="change", ATTR{docked}=="1", ATTR{type}=="dock_station", RUN+="/usr/local/bin/CrashPlanCommand resume"

And voilà! Automatica pausing and resuming on undocking/docking of my laptop!

 
1 comment -:- permalink -:- 17:05
Getting Screen and X (and dbus and ssh-agent and ...) to play well

When you use Screen together with Xorg, you'll recognize this: You log in to an X session, start screen and use the terminals within screen to start programs every now and then. Everything works fine so far. Then, you logout and log in again (or X crashes, or whatever). You happily re-attach the still running screen, which allows you to continue whatever you were doing.

But now, whenever you want to start a GUI program, things get wonky. You'll get errors about not being able to find configuration data, connect to gconf or DBUS, or your programs will not start at all, with the ever-informative error message "No protocol specified". You'll also recognize your ssh-agent and gpg-agent to stop working within the screen session...

What is happening here, is that all those programs are using "environment variables" to communicate. In particular, when you log in, various daemons get started (like the DBUS daemon and your ssh-agent). To allow other programs to connect to these daemons, they put their contact info in an environment variable in the login process. Whenever a process starts another process, these environment variables get transfered from the parent process to the child process. Sine these environment variables are set in the X sesssion startup process, which starts everything else, all programs should have access to them.

However, you'll notice that, after logging in a second time, the screen you re-attach to was not started by the current X session. So that means its environment variables still point to the old (no longer runnig) daemons from the previous X session. This includes any shells already running in the screen as well as new shells started within the screen (since the latter inherit the environment variables from the screen process itself).

To fix this, we would like to somehow update the environment of all processes that are already running when we login, to update them with the addresses of the new daemons. Unfortunately, we can't change the environment of other processes (unless we resort to scary stuff like using gdb or poking around in /dev/mem...). So, we'll have to convice those shells to actually update their own environments.

So, this solution has two parts: First, after login, saving the relevant variables from the environment into a file. Then, we'll need to get our shell to load those variables.

The first part is fairly easy: Just run a script after login that writes out the values to a file. I have a script called ~/bin/save-env to do exactly that. It looks like this (full version here):

#!/bin/sh

# Save a bunch of environment variables. This script should be run just
# after login. The saved variables can then be sourced by every bash
# shell, so long running shells (e.g., in screen) or incoming SSH shells
# can also use these services.

# Save the DBUS sessions address on each login
if [ -n "$DBUS_SESSION_BUS_ADDRESS" ]; then
echo export DBUS_SESSION_BUS_ADDRESS="$DBUS_SESSION_BUS_ADDRESS" > ~/.env.d/dbus
fi

if [ -n "$SSH_AUTH_SOCK" ]; then
echo export SSH_AGENT_PID="$SSH_AGENT_PID" > ~/.env.d/ssh
echo export SSH_AUTH_SOCK="$SSH_AUTH_SOCK" >> ~/.env.d/ssh
fi

# Save other variables here

This script fills the directory ~/.env.d with files containg environment variables, separated by application. I could probably have thrown them all into a single file, but it seemed like a good idea to separate them. Anyway, these files are created in such a way that they can be sourced by a running shell to get the new files.

If you download and install this script, don't forget to make it executable and create the ~/.env.d directory. You'll need to make sure it gets run as late as possible after login. I'm running a (stripped down) Gnome session, so I used gnome-session-properties to add it to my list of startup applications. You might call this script from your .xession, KDE's startup program list, or whatever.

For the second part, we need to set our saved variables in all of our shells. This sounds easy, just run for f in ~/.env.d/*; do source "$f"; done in every shell (Don't be tempted to do source ~/.env.d/*, since that sources just the first file with the other files as arguments!). But, of course we don't want to do this manually, but let every shell do it automatically.

For this, we'll use a tool completely unintended, but suitable enough for this job: $PROMPT_COMMAND. Whenever Bash is about to display a prompt, it evals whatever is in the variable $PROMPT_COMMAND. So it ends up evaluating that command all the time, which makes it a prefect place to load the saved variables. By setting the $PROMPT_COMMAND variable in your ~/.bashrc variable, it will become enabled in every shell you start (except for login shells, so you might want to source ~/.bashrc from your ~/.bash_profile):

# Source some variables at every prompt. This is to make stuff like
# ssh agent, dbus, etc. working in long-running shells (e.g., inside
# screen).
PROMPT_COMMAND='for f in ~/.env.d/*; do source "$f"; done'

You might need to be careful where to place this line, in case PROMPT_COMMAND already has some other value, like is default on Debian for example. Here's my full .bashrc file, note the += and starting ; in the second assignment of $PROMPT_COMMAND.

The astute reader will have noticed that this will only work for existing shells when a prompt is displayed, meaning you might need to just press enter at an existing prompt (to force a new one) after logging in the second time to get the values loaded. But that's a small enough burden, right?

So, with these two components, you'll be able to optimally use your long-running screen sessions, even when your X sessions are not so stable ;-)

Additionally, this stuff also allows you to use your faithful daemons when you SSH into the machine. I use this so I can start GUI programs from another machine (in particular, to open up attachments from my email client which runs on a server somewhere). See my recent blogpost about setting that up. However, since running a command through SSH non-interactively never shows a prompt and thus never evaluates $PROMPT_COMMAND, you'll need to manually source the variables at once in your .bashrc directly. I do this at the top of my ~/.bashrc.

Man, I need to learn how to writer shorter posts...

 
0 comments -:- permalink -:- 15:26
xauth breaking X11 forwarding over SSH

This morning, I was trying to enable X forwarding, to run applications on my server (where I have GHC available) to my local workstation (where I have an X server running). The standard way to do this, is to use SSH with the -X option. However, this didn't work for me:

mkooijma@ewi1246:~> ssh -X kat
Last login: Wed May 20 13:48:13 2009 from ewi1246.ewi.utwente.nl
matthijs@katherina:~$ xclock
X11 connection rejected because of wrong authentication.

Running ssh with -vvv showed me another hint:

debug2: X11 connection uses different authentication protocol.

It turned out this problem was caused by some weird entries in my .Xauthority file, which contains tokens to authenticate to X servers. The entries in the file can be queried with the xauth command:

matthijs@katherina:~$ xauth list
#ffff##:  MIT-MAGIC-COOKIE-1  00000000000000000000000000000000
#ffff##:  XDM-AUTHORIZATION-1  00000000000000000000000000000000
localhost/unix:10  MIT-MAGIC-COOKIE-1  00000000000000000000000000000000

(I replaced the actual authentication keys with zeroes here). The last entry is the useful one. It is the proxy key added by ssh when I logged in. That is the one it should send over the ssh forwarded X connection (where ssh will replace it with the actual key, this is called authentication spoofing). However, I found that for some reason X clients were sending the XDM-AUTHORIZATION-1 key instead (hence the "different authentication protocol" message), causing the connection to fail.

I've solved the issue by removing the #ffff## entries from the .Xauthority file (but since I couldn't just run xauth remove #ffff#, I turned it around by readding only the one I wanted:

matthijs@katherina:~$ rm ~/.Xauthority
matthijs@katherina:~$ xauth add localhost/unix:10  MIT-MAGIC-COOKIE-1  00000000000000000000000000000000

I'm still not sure what these #ffff## entries do or mean (I suspect xdm has added them, since I am running xdm on this machine), but I've made inquiries on the xorg list.

As a last note: If you want to use X forwarding and enable the GLX protocol extensions for OpenGL rendering, you need to disable security checks in the X forwarding, by running ssh -Y instead of ssh -X.

 
0 comments -:- permalink -:- 16:38
Awesome window manager

Awesome

Recently, I switched Window manager again. I still haven't found the window manager that really works for me (having run Ratpoison, Enlightenment, ion3 and wmii in the past), perhaps I should write my own. Anyway, a while back a friend of mine recommended XMonad, a Haskell based window manager. Even though having a functionally programmed window manager is cool, I couldn't quite find my way around in it (especially it's DIY statusbar approach didn't really suit me).

While googling around for half-decent statusbar setups for XMonad, I came across the Awesome window manager. It's again a tiling window manager, as are most of the ones I have been using lately, but it just made this cool impression on me (in particular the clock in the statusbar, which is by default just a counter of type time_t, ie number of seconds since the epoch. Totally unusable and I replaced it already, but cool).

Awesome allows you to tag windows, possibly with multiple tags and then show one tag at a time. (which is similar to having workspaces, but slightly more powerful). But, one other feature which looks promising, is that it can show multiple tags at once. So you can quickly peak at your IRC screen, without losing focus of your webbrowser, for example. I'm not yet quite used to this setup, so I'm not sure if it is really handy, but it looks promising. I will need to fix my problems with terminal resizes and screen for this to work first, though....

 
0 comments -:- permalink -:- 19:09
USB Suspend breaking USB Mass storage

I've recently been fiddling around with my kernel, configuring most options as modules to enable faster dehibernation (which seems to work). Anyway, afterwards I could not longer mount my USB stick. For some reason, USB Mass storage support got broken. The USB device was properly recognized, but there was no SCSI device generated.

[   62.166775] usb 2-1: new full speed USB device using ohci_hcd and address 2
[   62.455087] scsi0 : SCSI emulation for USB Mass Storage devices
[   62.460703] usb-storage: device found at 2
[   62.460707] usb-storage: waiting for device to settle before scanning
[   65.047735] usb 2-1: reset full speed USB device using ohci_hcd and address 2
[   65.409629] usb 2-1: reset full speed USB device using ohci_hcd and address 2
[   65.769525] usb 2-1: reset full speed USB device using ohci_hcd and address 2
[   66.041867] usb-storage: device scan complete

To cut a long story (and debugging session) short: The problem was a kernel option "USB Selective suspend/resume and wakeup" (CONFIGUSBSUSPEND) that I enabled in the process (it sounded useful). Disabling this option made everything work again. Yay!

[  629.199144] usb 1-1: new full speed USB device using ohci_hcd and address 4
[  629.786207] Initializing USB Mass Storage driver...
[  629.787051] scsi0 : SCSI emulation for USB Mass Storage devices
[  629.789039] usb-storage: device found at 4
[  629.789043] usb-storage: waiting for device to settle before scanning
[  629.789055] usbcore: registered new driver usb-storage
[  629.789058] USB Mass Storage support registered.
[  632.291508]   Vendor: USB-DISK  Model: FreeDik-FlashUsb  Rev: 1.06
[  632.291519]   Type:   Direct-Access                      ANSI SCSI revision: 00
[  632.302503] SCSI device sda: 32736 512-byte hdwr sectors (17 MB)
[  632.305490] sda: Write Protect is off
[  632.305493] sda: Mode Sense: 0b 00 00 08
[  632.305496] sda: assuming drive cache: write through
[  632.312994] SCSI device sda: 32736 512-byte hdwr sectors (17 MB)
[  632.315988] sda: Write Protect is off
[  632.315992] sda: Mode Sense: 0b 00 00 08
[  632.315994] sda: assuming drive cache: write through
[  632.325982] SCSI device sda: 32736 512-byte hdwr sectors (17 MB)
[  632.328979] sda: Write Protect is off
[  632.328982] sda: Mode Sense: 0b 00 00 08
[  632.328984] sda: assuming drive cache: write through
[  632.329014]  sda: unknown partition table
[  632.334028] sd 0:0:0:0: Attached scsi removable disk sda
[  632.335561] usb-storage: device scan complete

 
0 comments -:- permalink -:- 19:53
Copyright by Matthijs Kooijman