G-RAID, Time Machine, and Spotlight Headache

A few days ago, of course right before the “Money Back Guarantee” expired on our G-RAID 8TB Time Machine drive at work both my S3 and I were battling with a rather nasty pernicious bug that was plaguing this device on our new fancy Mac Pro Server running OSX Mountain Lion.

The problem was this, you plug the drive in, using Firewire 800 and Time Machine sees it and starts backing up files. That works just fine. After say 1TB of files get backed up Time Machine works gamely for about three or four hours and then the drive suddenly goes deaf. What I mean is that the drive is still connected, the icon is on the Desktop, but you can’t do anything with it. It gives you a fusillade of meaningless errors, vague ones like “Unspecified error with file system” and the like and Time Machine is stuck and can’t do anything at all with the drive. It’s not really a headache for us currently because the server is brand-spanking-new, but still, it’s a concern for us. You have to eject the drive, and not a plain eject either, but a Force Eject. When you move it to another computer and plug it in and do a fsck on the drive everything pans out fine. Everything is hunky-dory, journal is fine, structures are peachy, the works. So annoying.

So off to Google we go. Turns out there MIGHT be a bug in the “Turn off Hard Drives when possible” in the Energy Saver preference pane in System Preferences. This strikes me as a wee bit of bullshit, the drive should go to sleep and wake up elegantly like anything connected to a Mac should (and almost always does!) so, fine, turn that off. Testing. Ah, failed. So next stop was to try to irritate the drive with constant actions. To that end I created a script:

!/bin/bash

while true
do
touch /Volumes/G-RAID/keepalive
sleep 60
done

So what this script does is touch, which is a Unix command in the Mac that just runs out and accesses a file, it’s size is zero, it just runs the most basic of file operation on a drive. If you touch a file on a sleeping drive, it should wake it up. If the drive is counting down until it goes to sleep, this operation will reset that counter. Then the entire thing takes a nap for a minute and does it again, and it does it over and over forever.

We tried that, and still ended up with a failed Time Machine backup and a drive that’s gone deaf. The exact error you get in Time Machine is “com.apple.backupd: Error: (22) setxattr for key:com.apple.backupd.HostUUID … ” So, still no solution to our problems. We finally figured out what the silver bullet was, and it came from an unexpected source. We added the G-RAID drive to the Privacy pane of Spotlight in the System Preferences on the server and voilà! Magical solution!

Since I did that, the drive has been working happily since I made the change, it’s been about a week. My working theory is that mds (which runs the Spotlight service) either locks a file or does something sneaky with this extended attribute on the HostUUID object and that, somehow, ruins access for the entire file system on that drive. It’s not that the file system is damaged, it’s just not working.

So, where’s the bug? Is it in mdsworker, mds itself, backupd (Time Machine), Firewire 800, the Firewire 800 cable, or the G-RAID drive? The answer is a definitive YES. Somewhere. Something is causing it and the only solution seems to keep mds’s muddy hands to itself and pester the drive every minute with a meaningless file operation via touch.

The upside is the damn thing works, so we’ll keep going with it until it stops working. I wish there was something clearer than this Error 22 from backupd to go on, but alas, this seems to be a valid workaround and frankly I don’t really need Spotlight to go futzing about on the drive anyways. There won’t be any searching done on it anyhow, just the indexing that Time Machine needs and that’s it.

I guess in the end, all’s well that ends well.