This is a horror story, but don’t worry… I just lost a lot of time… this time.
What happened?
For context, I’ve been messing around with self-hosting. I have a Proxmox and inside I’m running TrueNAS as a VM and passing the disks.
Before all that, I spent a lot of time running Badblocks and SMART tests, both of which check the health of the disks and are especially helpful in finding problems in a new disk, and then I started dumping data from my laptop and external drives.
I watched a lot about TrueNAS and ways to speed up it with caches, RAID levels (for data protection even if a disk or two were to stop working), and other important things.
Then, some days after dumping everything something bugged me about the disk layouts and which /dev/sdX
they were using… since most were just virtualized disks I was passing to speed up the NAS operations… should be easy right? Just remove and then add again…
But wait! Can’t remove? Why?
You see…
RTFM
RTFM or Read the fucking manual. I was click-happy and just created and added disks willy-nilly. Hell! I even added “mirrored” drives because of “redundancy”.
The problem?
- All those were virtualized drives coming from a single SSD.
- Some of those drives became an integral part of the NAS Pool.
What is a Pool?
If the part above didn’t scare you… it’s because you don’t know what a pool is.
You gather multiple disks in a pool and they work as a unit. You don’t have multiple disks anymore, you have one Pool that works as such.
With RAID levels you distribute data between multiple disks so that you can lose disks without losing data (there are trade-offs, but you probably want some redundancy).
The horror time
Some of the “extra” disks you can add also became part of the pool.
They were the dedup tables (to speed handling deduplication) and metadata (to handle small files and metadata).
Each of those, when lost WILL brick your data. So if you are going to assign disks to them, you really want to assign multiple to each to have redundancy there.
Meanwhile, what I did was to give it shares of one single SSD.
I set up RAID 6, losing 2 disks worth of space to have enough redundancy only to add one point of failure to absolutely destroy all my data in that case of losing that single SSD. Isn’t this scary?
Test your backup!
More than a couple times I heard about this, you might have also and I’ll be saying here again: TEST YOUR BACKUP!
One part is: if you do lose your data, you don’t want to be stuck trying to figure out how to retrieve your data.
Other part is: can you even retrieve that data?
I removed some of those virtualized drives and bricked the pool. Lesson learned, still had the data, was able to dump everything again.
I also know that I can turn off a couple drives and still have my data. (Also, after RTFM I’m only using actual cache drives for read/write that won’t affect the pool.)
How are your backups?
Are they available? Can you actually restore them in a pinch?
Better give a look when you don’t need it, than having to use it and finding yourself without any.
Top comments (0)