During the last couple of months I have been without my 38TB NAS as it broke down on me.
We all have equipment failures from time to time, we take (most of us anyways) relevant actions to prevent these things from happening. My journey to broken NAS hell started with an upgrade of the NAS software. I have been a big fan of QNAP nas systems and have been using them for quite some time, starting with a 4 BAY unit and even ran a 19″ rack unit when I was running my own business.
The NAS in question, the ts-831x is a dual 10 Gigabit NAS, perfect for my 10 Gigabit network where I need speed and reliability as most of my virtual machines are running on the NAS (With iSCSI) and the unit has 8x BAYS to fit 7x6TB drives together with a SSD drive for cacheing. I should have been more careful when I first installed it, as I had several issues with the RAID structure as well as failing drives, at that time my reasoning where purely around the drives itself and not the NAS, a big misstake.
The updated software option offered a way to schedule a “Scrub” your drives. Scrubbing is a procedure where the actual bits on the drive, the raid and the filesystem(s) are compared to checksums to identify potential issues. I’m not a big fan of turning on new features, but instead of disabling the feature I’ve scheduled it to run the first test within 30 days, that I of course completely forgot about…
Essentially what happened was that the scrubbing started and one of the SATA ports failed resulting in a damaged raid filesystem.
Repairing the filesystem was not possible from within the QNAP interface, with was strange as the most reason for failure where corrupted metadata. QNAP support has been surpassingly reluctant to help, I’ve had a ticket open for over 4 weeks and the only promise was that a “developer soon will look into this”. It did not help that their “Remote support” did not work in most cases, requiring assistant from myself, at some point the tools on their end who manages SSH keys stopped working and I was not able to run the Remote tool, of course QNAP was not aware of it and just blaming my NAS. They did come back a week later informing me the issue was on their end.
So instead of trying to talk to someone who have no clue about what’s going on I’ve reached out to other companies, mostly online. I’ve realised that if I’m going to get this solved I have to take actions myself. So I installed all 7 drives into a PC with 8 motherboard SATA slots. That worked fine for the drives. However booting a OS from an USB drive didn’t work at all. Windows 10 booted but crashed while loading.
So I grabbed a RAID PCI card (MegaRAID) and installed a SSD drive into the raid, created a virtual volume, installed Windows 10 and now I was ready for the next step.
Actually before performing any actions I’ve decided to get some cooling into the mix. Hard drives can get really warm under heavy load, so I decided to trow in a office cooler into the mix, making the whole setup quite weird.
Time to restore
I’ve looked at a numer of different software for Raid recovery and none of them seemed to do the role well. There where both options for commercial and open source alternatives. After some time spending I’ve found the suite(s) from Diskinternals to meet my needs very well. They both support software linux RAID (MD) systems, EXT4 and VMFS (for my VMWare iSCSI setup).
However the filesystem was empty and a low-level scan took forever (About a week actually) so I’ve decided to contact their support. Here I must say I’ve never seen a support that been so responsive. They where able to quickly determinate (with the help of teamviewer) that the software raid (MD) was broken. After enabling “developer mode” in their Raid recovery tool we where able to find that some logical volumes had duplicate set of PV:
As can be seen on the picture above my NAS has grown from 38TB to twice the size (ops) – That’s of course not the case and the software was not able to correctly identify that error. Disk internals however did not give up that easily, instead gave the case to their developers and after just a few days had a fix.
Please remember that most software solutions out there for recovery damaged files/filesystems offer a trial that is able to fully recover the drives but will not allow you to export any data without paying for it. This software even offers preview of files for a huge amount of file formats, I could more or less preview my entire movie collection and even individual VMDK files for VMWare, just to be sure that I was able to export the disks.
Here Disk internal support was able to help without me paying for professional support, over a period of about two weeks they delivered a few patches to their software and I was able to browse the entire file system without having to do a deep scan.
When one is repairing EXT file systems it’s not uncommon that the filesystem metadata is lost, all your files are there but not containing your file system structure, i.e. if you have pictures structured in folders for years, months and with relevant filenames, if they are lost you will just end up restoring perhaps thousands of files that will loose the filesystem structure and it’s file names.
This was by far my biggest concern as I was fairly confident that the data was not lost (as I did not write to the drives) but loosing the metadata would be equally bad as I have around 2M files and folders and organizing them would be almost impossible.
So reaching a point where I could just “mount” the filesystem on my Windows 10 PC and browse my EXT4 files was a great success! (pardon my bad taste in music)
Why couldn’t I just restore the files from a backup?
That is an excellent question. Most virtual machines get’s backed up every night on another NAS so I was able to restore important virtual machines that was running on the broken NAS. Most VMs that are even more important runs on a SSD RAID-5 setup on the ESX host so I didn’t loose any important stuff like firewall, emails etc. during the breakdown.
The same process was planned to take place for regular files but QNAP and rsync didn’t just work. So these files was not copied over to my backup. Well in fact they where initially (using regular CIFS copy) just to make sure I’d would not loose any files. However I’ve decided to re-arrange my backup NAS filesystem and therefore lost the initial backup I had. *bummer*
Now the company I originally bought my NAS from has gone bankrupt and I have not yet replaced it. However most important stuff has been moved into a new SAN (home built) with Fiber Channel. That’s another post.