In October 2019, we lost about 10,000 unique and important files in one night due to a silly mistake. That incident ended up saving our new business in 2025. Here’s the story.
Before we dive in, I should mention that this is not a technical article, and technical terms don’t affect the main story or the experience I’m sharing.
Gama (gama.ir) is a K-12 educational content-sharing platform that started in 2014 in Iran. It has more than 10 million users worldwide and offers over seven different services, including Past Papers, Tutorials in different formats, Online Exams, a School Hub, Live Streaming, a highly active Q&A Community, a Tutoring service, and more.
Gamatrain.com is our new product based in the UK. Its focus is on more internationally recognized education systems, such as CIE (Cambridge International Education), AQA, Edexcel, and others.
The Disaster Begins
We were experiencing the first real hype of our business, and as a content monetization company, most of our revenue depended on our content. Since a large portion of our content was user-generated, not just produced by our internal team, ensuring the safety of our files was one of our core principles.
However, our backup strategy was overly simplistic. We kept a copy of each file on a very large, isolated HDD.
Everything was running smoothly, and we had never been forced to rely on our backups except for a few minor incidents caused by human error.
Then, one of our engineers introduced GlusterFS, a more well-known distributed file system used in many production environments. We reviewed the costs and benefits, and after much discussion, we decided to migrate from MooseFS to GlusterFS. Two months later, the migration was complete.
At first, the team was thrilled with GlusterFS and its features. But soon, we realized that our backup HDD was nearly 90% full, and we had to make a decision. Since the team believed that GlusterFS was more stable and reliable than MooseFS — and given that we had never needed our full backups before — we made a terrible decision: we canceled our backup strategy and relied solely on Gluster’s replication system.
A truly stupid decision!
The Day Everything Went Wrong
About two months later, one day, we started noticing errors on our website — pages displaying missing files. Initially, we assumed it was just a minor network issue. However, after checking our monitoring tools and logs, we found that the Gluster network was reporting sync errors and missing chunks — meaning Gluster couldn’t locate certain parts of files.
Soon, the errors spread to more pages and more files. We decided to restart our Gluster network, still believing it was just a database and network issue that could be fixed with a simple reboot.
But nature always has a lesson to teach you, right? :)
At 3:30 AM, after restarting Gluster and remounting it to our edge servers, the errors disappeared! Thank God! Everything was working again.
“Let’s do a final check and then get some sleep after this long day,” I told my colleague.
Just then, a WhatsApp message came in from our content team:
“The files are empty.”
Wait… what?
“What do you mean by empty?”
“There’s nothing in them!”
I checked a few files myself. They were indeed empty.
Maybe it was just a mounting issue? I ran some ls command on empty files. The ls command showed that the files existed and their sizes weren’t zero!
Okay, This was no ordinary incident any more.
Long story short. We had lost about 10,000 files out of 50,000, a massive blow to our company’s revenue and reputation. And the worst part? These weren’t just our company’s files. Imagine waking up one day to find that 20 out of the 100 YouTube videos you’re earning money from are gone!
But wait, we had a old backup HDD! Surely, we could restore some files from there, right?
Nope.
The Backup That Was Useless
After migrating to GlusterFS, we had made improvements to our directory structure, which meant all files had been assigned new hashes and new paths stored in the database.
There were no longer any direct links between our missing files and the backup pool. The backup was useless.
We made multiple attempts to recover the files, but nothing worked. Even when we managed to restore some files, they had different names, leaving us unable to link them back to their original posts.
Finally, we had to email all affected users and ask them to re-upload their files.
But for our engineering team, this incident led to tough decisions. We completely redesigned our file storage system and implemented a solid backup strategy.
The Birth of GFK (Gama File Keeper)
The solution was simple in name but complex in execution.
In our new file keeper system, every uploaded file is mapped in a dedicated table that records all crucial metadata, including its checksum.
Why is this important? Because a checksum is the true identity of a file — as long as the file isn’t modified, its checksum remains the same.
This means that even if a file is deleted and later recovered with a different name, we can still match it with our database records and restore it correctly to its original post or user.
Another major improvement: file deletions are now soft deletes. Files aren’t permanently removed for three months. If no restoration requests are made in that period, an automated script removes them safely from the disk.
After extensive testing in multiple scenarios, the new file manager system worked flawlessly.
But it still wasn’t enough.
The Birth of Gama Backapp
Our biggest lesson? Relying on a single system, technology, or even location is not safe.
We needed something much more robust for backups.
Backapp (short for “Backup Application”) was designed as a three-layer backup strategy:
- Warm Layer (Every 2 Hours) An incremental sync between the Hot Disk (used by the production application) and a Warm Disk in the same data center. It does not sync deleted files, only new or modified ones.
- Cold Layer (Every 6 Hours) The Warm Disk syncs with a Cold Disk in another data center, ensuring resilience against local incidents.
- Offline Layer (Once a Week & Monthly Checks) A physical hard drive kept outside data centers (in our office). The drive is locked and identified by serial numbers, and it syncs with the Cold Layer once a week. A monthly verification process ensures that everything is intact before permanently removing old files. A file deletion process takes six months to fully propagate across all layers.
- Databases (Every Day): For our database, we now take full backups every 24 hours and archive the last 30 days and last 12 months (one each month) of backups.
How File Deletion Works in the Backup System
If a file is deleted, it takes six months for it to be permanently removed across all backup layers. The process follows these steps:
Now: Soft delete from the file keeper (file is marked as deleted but still recoverable from file keeper admin).
3 months: The file is removed from the Hot Disk (active production storage).
4 months: The file is removed from the Warm Backup.
5 months: The file is removed from the Cold Backup.
6 months: The file is finally removed from the Offline Backup.
This gradual deletion process ensures that we always have enough time to restore accidentally deleted files before they are lost forever.
Technology Stack of Backapp
Backapp is built with a reliable and efficient stack to handle automated backup processes:
Golang — Used for the core logic and offline windows application.
Bash scripts — Automates backup and restore tasks.
SQLite — Stores logs and backup metadata.
Rsync Protocol — Ensures efficient and incremental file syncing.
Grafana XtraDB Backup — Manages database backups efficiently.
These improvements were painful to implement and required precise migration operations, but they were worth it.
Because five years later, this design saved our new company (Gamatrain.com) from a rare disaster — a story I’ll share soon in my next article.
But let’s just say… it wasn’t as easy as I had expected.
Cheers!
We later realized that our mistake might have been due to a misconfiguration in GlusterFS. It could have been an issue like an improper replica count, a failed volume heal, a split-brain scenario, or even a filesystem mounting problem. Whatever it was, it silently corrupted thousands of files, making them appear to exist while being completely empty.
Top comments (0)