In 1936(!), Dale Carnegie wrote in his book, How to Win Friends and Influence People about how important names are to most people. I personally learned this in another great book from Steve McConnell. He wrote Code Complete in 1993. I must have been lucky enough to read it shortly after having written it and it changed the way I programmed forever. This is one of the most important books I have ever read about programming and fervently recommend it to anyone wanting to learn programming or anyone who just wants to refresh things they should be doing as they program every day.
In this book (and in hundreds of blog posts around the web) you will find exceptional explanations on how fundamental it is to name your variables, your files, your directories, your tiddlers,... any information you save with a significant and semantical NAME. Nothing new there, but I want to share with you an extreme case of bad naming that could have cost a lot if it weren't for the resilience of the great operating system Linux is.
This past week, to close the year with something to remember, our team had a surreal experience with a case of terrible naming. Watch carefully how Linux saves us!
Context and setup of the disaster
At some point in time, someone decides to install Redash on one of our servers. They, very correctly, install it in a directory named /opt/redash.
They also add the new directory to our backups following company policies. So we have local and off-server backups.
All is good up to here.
Sometime later, a requirement to set up a docker container with a MySQL production database appears. The person in charge of this task decides to mount the MySQL data directory inside /opt/redash. Yes! you read that correctly. We now have a Redash install AND a MySQL data directory living inside /opt/redash. But wait for it, it gets better.
Following good practice, a script is created to make a daily dump of the database that is inside the container. Good. Where do we save the dump? Well, of course! Inside, /opt/redash, where else? Anyone who reads "redash" will clearly understand that you can save non-related database dumps inside there. To further entangle the situation we will save the MySQL database password that the script uses and all the docker configuration files inside /opt/redash also. Obviously, no documentation of this special case is created.
Now that I am writing this I think I understand that they did that out of laziness, just to not have to add the database dump to the backups, by saving it in /opt/redash which was already in the backups definition they had an easy win. When the disaster occurred my mind was not calm enough to see this.
Time goes by. The person who set up the situation above is no longer with us. A review of the server contents is done. Turns out that no one is using Redash, it wasn't what they needed. So, what do we do? we decide to eliminate it. First, we shut down the service and wait a month or so to see if anyone complains. Next, we create a directory named deleteme_after_202202 (see what a significant name that is !!) and we move (this is where Linux saved us) /opt/redash inside deleteme_after_202202. We stop making backups of /opt/redash. Our monitoring system does not trigger any exceptional situation and everything seems to be working.
Some weeks go by. In an unrelated succession of events, we run out of space in our main backup server so we decide to clean up. One of the first candidates is the redash directory, nobody has been using it for months anyway and it just contains an old redash install, right? This lack of space follows our other backup server (where we had another backup of that redash directory) so we delete redash from there too. Do you see the trend of how Murphy catches you no matter what you do?
Now disaster strikes. The server is restarted. On reboot, our monitoring system starts complaining that we have a site down. A database is missing. You can imagine our confusion. We start investigating and see that the mount point of the docker container has disappeared. So the production docker container has restarted, it mounted the directory (/opt/redash/mysql_data) but there was no MySQL data directory (we had moved it out some weeks ago), so it started empty. We start looking for backups but none are to be found. Total despair.
The save happened with the "move". We were cautious enough to move the /opt/redash directory instead of directly deleting it. Linux was intelligent enough to keep the i-node references even if the parent directory was different so the docker container kept working inside the deleteme_after_202202 directory.
It took us a while to stumble on that. A while I would prefer to forget, but that is part of the work. In the end, we were able to recover everything and get some training for when we won't be so lucky.
Some of the things I took away
follow standards: we don't use docker containers for production databases, and we shouldn't make exceptions
create standards (document things): if exceptions must be made, for whatever reason, create procedures for it, convert that exception into a recognized company standard, write things down, and reach a consensus with the rest of the team. Leave README files where they can be seen for these special cases!
talk with the team (share) reaching a consensus would obligate us to define how backups are made, how monitoring those backups is done, and probably even a naming convention for the process. If you see something that is not following the company standards raise the issue, share your concern with the rest of the affected team, put it on the endless to-do list
Please, please stop to think about what you are doing before you give something a name!
Don't be lazy! There are a lot of people counting on your work, even if nobody sees it until it breaks. Even if no one is going to come and say what a good job you are doing protecting their data, do it! Do it to the best of your knowledge.
As John Lenon said:
“When you do something noble and beautiful and no one notices, don't be sad. Sunrise is a beautiful spectacle but without a doubt, the greater part of the audience is still asleep.”
Live and learn.
Did you find this article valuable?
Support Joe Bordes by becoming a sponsor. Any amount is appreciated!