Set up redundant backup systems. Test those systems regularly. Ideally, that would be the end of this chapter, but it’s not enough. It’s not uncommon for companies, even well-established companies, to drop the ball with backups. In some ways, it’s understandable. It’s the type of problem that humans simply aren’t good at dealing with. As a result, the process never gets the attention it deserves.
I personally made this mistake with Sifter in the early days. I lost about eight hours of data because I only had nightly backups in place. As a result, even though we were able to successfully restore from backups, those backups were incomplete. The bookmarking service Magnolia also suffered issues with backups that ultimately led to its demise. More recently, GitLab experienced multiple failures with it’s backup process that led to lost data. There are plenty of stories, but the key takeaway is that this isn’t a theoretical problem. Make sure that your application isn’t another casualty of faulty backups.
First and foremost, set up automated backups. However you do it, ensure that it runs automatically and doesn’t rely on anyone manually doing anything. These must be fully automatic. Period. You also want to design this process such that if it fails, it fails loudly. Emails. Text messages. Phone calls. From a process standpoint, backup failures should be just as critical as actual downtime.
Beyond having backups, there are two more key points to remember. First is backup frequency. Second is backup location. With frequency, you’ll likely want to maintain some combination and amount of hourly, daily, weekly, and/monthly backups. You’ll also want a replica database.
Think of these as your lines of defense. If something goes wrong with your database server, your replica database is your first line of defense. It’s going to be the most current, and it’s already loaded and connected to your production environment. However, because the replica database just mirrors your production database, there are categories of database issues where, if your primary database has issues, those issues will be propagated to the replica database. In those situations, you’ll have to go to your regular backups.
Once you’re at a point where your replica database has issues and you’ve had to turn to your regular backups, there’s a good chance that some portion of those regular backups have been corrupted as well. If you noticed the problem and caught it in under an hour, you may be alright. However, if it’s been a day or two, you may have to go further back than you’d like. In this case, you will have lost some data. Just remember, losing some data is a lot better than losing all of the data.
That covers backup frequency at a high level, but we still need to talk about location. This is a bit simpler. Never put all of your backups in one place. The short version is that disasters happen. Whether natural disasters or business disasters, it’s entirely possible for your primary data center to go completely offline and be unreachable. If your backups are there, you’re out of luck. If, however, your backups are off site at another location, you can rebuild.
Creating backups isn’t enough. You have to diversify in both frequency and location to ensure you have full coverage in the event of an emergency.
The other critical aspect of those backup snapshots is encryption. Whenever you create snapshots of your database, and especially when you save those snapshots outside of your production environment, you’ll want to ensure that they’re encrypted. This will create an extra layer of security, but it also means you’ll need to securely store the secret to decrypt those backups if you ever need them.
In addition to automatically creating the back ups, you also need to automatically test the backups and ensure that they’re working. There are few things more insidious than backups being silently incorrect. If the process runs without errors but the data is incorrect, it’s a tragedy waiting to happen. The best form of this is automatically loading the backups somewhere and ensuring that the data is changing.
The final step to setting up a solid backup process is documenting it. If it’s not well-documented, it’s not complete. If the time comes that you need to fall back on your backups, you’re going to have more important things to focus on than remembering how your backup and restore processes work. You need to be confident that detailed instructions are available so that you don’t have to waste time figuring out how to get back on track.
My parting advice would to be publish a sterilized version of your backup and restore process on a security page publicly available on your web site. If you’re not proud enough of the effort you’ve invested in the process to make it public, you’re not doing enough.
Automatically test your database backups. Marco Arment details his process for automatically testing database backups and ensuring that he’s pays attention to whether or not they’re still working.