OSX Mavericks Possible Data Corruption Bug

Over the past two weeks there has been much upheaval in my life. Involved with this upheaval has been one of the most unwanted activities any IT professional has to do as part of their professional lives and that is bowing out gracefully. Sometimes IT professionals can actually achieve this state of grace, however most of the time fear overwhelms grace and trust. The morality I will leave to another blog post to come.

In rescuing data from a computing device a few days ago I discovered that the act of using a USB external hard drive with a Macintosh MacBook Pro with OSX Mavericks may have a nasty bug lying in tall grass. I had about 212GB of data that needed to be moved to another medium, and I elected to use a Western Digital external hard drive using USB 2. This drive had never before shown any signs of failure however after copying the data onto the drive using OSX Mavericks, the HFS filesystem on the drive suffered some mystery damage that I’ve never witnessed before. Thankfully the volume was mountable and I could rescue the data from the errant drive and copy it to another drive and effectively save my bacon. The error concerned a failure in the node structure when fsck was asked to diagnose the HFS Journaled filesystem present on the suspect drive. Now I can’t say for sure that OSX Mavericks caused this failure, but the proximity of it and an earlier email from Western Digital stating that there might be drive problems with OSX Mavericks also rang in my mind as a potential problem that points to this particular possible bug. Now the Western Digital warning was just for their drives that used the extended WD software to mount the drives to the Macintosh file system, I suspect that the bug is indeed deeper than even WD knows, or Apple perhaps.

If you are using WD, or perhaps any other external hard drive or memory-stick technology with OSX Mavericks the smart money is on frequent backup and sync to multiple locations. Really smart administrators will backup over the network to some other computing platform with it’s own independent drive technology. If you are using Macintosh OSX Mavericks, I would say it’s better to be safe than sorry and for the love of all that is cute and fuzzy, make your backups!

G-RAID, Time Machine, and Spotlight Headache

A few days ago, of course right before the “Money Back Guarantee” expired on our G-RAID 8TB Time Machine drive at work both my S3 and I were battling with a rather nasty pernicious bug that was plaguing this device on our new fancy Mac Pro Server running OSX Mountain Lion.

The problem was this, you plug the drive in, using Firewire 800 and Time Machine sees it and starts backing up files. That works just fine. After say 1TB of files get backed up Time Machine works gamely for about three or four hours and then the drive suddenly goes deaf. What I mean is that the drive is still connected, the icon is on the Desktop, but you can’t do anything with it. It gives you a fusillade of meaningless errors, vague ones like “Unspecified error with file system” and the like and Time Machine is stuck and can’t do anything at all with the drive. It’s not really a headache for us currently because the server is brand-spanking-new, but still, it’s a concern for us. You have to eject the drive, and not a plain eject either, but a Force Eject. When you move it to another computer and plug it in and do a fsck on the drive everything pans out fine. Everything is hunky-dory, journal is fine, structures are peachy, the works. So annoying.

So off to Google we go. Turns out there MIGHT be a bug in the “Turn off Hard Drives when possible” in the Energy Saver preference pane in System Preferences. This strikes me as a wee bit of bullshit, the drive should go to sleep and wake up elegantly like anything connected to a Mac should (and almost always does!) so, fine, turn that off. Testing. Ah, failed. So next stop was to try to irritate the drive with constant actions. To that end I created a script:

!/bin/bash

while true
do
touch /Volumes/G-RAID/keepalive
sleep 60
done

So what this script does is touch, which is a Unix command in the Mac that just runs out and accesses a file, it’s size is zero, it just runs the most basic of file operation on a drive. If you touch a file on a sleeping drive, it should wake it up. If the drive is counting down until it goes to sleep, this operation will reset that counter. Then the entire thing takes a nap for a minute and does it again, and it does it over and over forever.

We tried that, and still ended up with a failed Time Machine backup and a drive that’s gone deaf. The exact error you get in Time Machine is “com.apple.backupd: Error: (22) setxattr for key:com.apple.backupd.HostUUID … ” So, still no solution to our problems. We finally figured out what the silver bullet was, and it came from an unexpected source. We added the G-RAID drive to the Privacy pane of Spotlight in the System Preferences on the server and voilà! Magical solution!

Since I did that, the drive has been working happily since I made the change, it’s been about a week. My working theory is that mds (which runs the Spotlight service) either locks a file or does something sneaky with this extended attribute on the HostUUID object and that, somehow, ruins access for the entire file system on that drive. It’s not that the file system is damaged, it’s just not working.

So, where’s the bug? Is it in mdsworker, mds itself, backupd (Time Machine), Firewire 800, the Firewire 800 cable, or the G-RAID drive? The answer is a definitive YES. Somewhere. Something is causing it and the only solution seems to keep mds’s muddy hands to itself and pester the drive every minute with a meaningless file operation via touch.

The upside is the damn thing works, so we’ll keep going with it until it stops working. I wish there was something clearer than this Error 22 from backupd to go on, but alas, this seems to be a valid workaround and frankly I don’t really need Spotlight to go futzing about on the drive anyways. There won’t be any searching done on it anyhow, just the indexing that Time Machine needs and that’s it.

I guess in the end, all’s well that ends well.

It’s no fun being sick

The past two days have been a literal blur for me. I started feeling the chills and off-feeling early on and as time went on, it just grew worse. Eventually I started having severe GI issues, pretty regular headaches, a fever, hot-flashes and chills, profuse sweating – all in all very not fun. I skipped out on pretty much all food the first day, I just couldn’t risk anything. I then realized that I needed some calories to keep going so I opened a can of Coca Cola, which usually either clears my system or ultimately resolves my problems one way or another. After holding down several bouts of severe nausea I found a place to lay down that felt good. For some reason the big blue couch in my living room, some throw pillows, and a big blue comforter was EXACTLY what I needed. I pounded down at least 4 hours of solid dead-to-the-world sleep and felt far better when I awoke. After I got over whatever hit me, I figure the likely suspects are either e. coli or cryptosporidium, I felt better. I started with a glass of milk, then moved up to toast. This morning I hazarded cooked oatmeal and that wasn’t a problem.

Tomorrow I will return to work after two days of being sick. I bust on Western for a lot, mostly petty internal bickering and annoyances, but when it comes to being really sick and needing time to recover, its worth it’s weight in gold, to have over 480 accumulated hours of Sick Leave available to use, it’s just one possible stressor that is nowhere to be found.

Everything is better now. My stomach feels fine, my system feels fine, and everything is working as nature intended. I don’t know where I caught the bacteria but I know it was a bacteria. Something that survived stomach acid and set off an attempted coup in my system. With lots of rest and a really quite on-top-of-things immune system I bounced back handily.

I missed my post-a-day yesterday, but there was just no way, I wasn’t able to type let alone flop off the couch. I don’t like being sick, it makes me incredibly emotional, unbelievably needy, and a giant mush in pretty much every regard – assuming I have enough energy available for any of it. Scott was wonderful, as well as all our friends who came to visit and enjoy Glee Night with us, but my two boys, Owien and Griffin were very cute. The minute I would flop down they’d leap on wherever I was and curl up and sleep with me, as if to protect me. So unbearably cute.

With luck I can avoid the matrix of things I did two days ago that got me sick, I have some suspicions and I can do things to avoid having to run into those things again. I hate having to feel that way. I don’t ever want to do it again.