- Chances are good that you’ll make a mistake that affects a lot of people.
- Be honest and transparent.
- Mistakes don’t become problems until you try to cover them up.
- Spin is tempting, but it will only make things worse.
- Focus on taking care of customers first and foremost.
- If a mistake/problem is ongoing, keep customers in the loop. Radio silence only multiplies the cost of mistakes.
Launching your own application may be the first time that you’ve ever been responsible for your company’s mission-critical systems. It may also be your first time on the front line interacting with customers. There’s a good chance that something bad will eventually happen, and it won’t be easy to tell your customers. But I hope that with a little guidance, you can be ready for it.
We’re only human, and we all make mistakes. Hopefully your mistakes will be small and won’t significantly affect your customers, but there’s a good chance that your mistakes will affect your customers at some point. Whether it’s something as fleeting as downtime or as serious as losing customer data, something will happen. Since you can bet that something is eventually going to go wrong, it’s best to prepare a few things ahead of time. This is one of the few areas where I believe premature optimization is not only acceptable but necessary.
First and foremost, you’ll need some sort of status website. It should be ready and available at status.yourdomain.com, and it should be hosted on a server that’s completely separate from your application–because if your main server goes down, you don’t want to lose your status page too. You’ll also want to make sure that people in the know can post to your Twitter account or other social media to help keep your customers in the loop.
Don’t leave your customers in the dark while you’re fixing the problem. The moment something significant goes wrong, let your customers know–you don’t have to have answers right away, but let them know that you’re working on it. You can do this through Twitter or your status site, or you could even make an announcement within your application (if it’s still online). Huge corporations may be able to get by with poor communication, but you’ll be much better off if you can keep your customers in the loop while you’re fixing the problem.
Don’t overreact: even some of the worst mistakes rarely lead to the worst-case scenarios that you may imagine at the time. In all likelihood, you’ll have a handful of angry customers with a batch of frustrated customers, but the large majority will be understanding as long as you handle the situation honestly and effectively.
After you’ve taken a deep breath, fix it. Once you’ve let your customers know about the problem, get to work and try not to worry about anything else. You’ll probably have a handful of customers reach out to you during this time because they may not have seen your announcements. If you’re on your own, don’t worry about replying right away unless you have time. You’ll be able to reply after everything’s fixed–and with far more useful information. If you have the time, you might send them a quick reply pointing them to your status updates while you work on the problem.
Once the storm has passed, be honest, clear, and precise–no matter what. There’s no benefit to even the slightest sugarcoating at this point. If you gloss over what happened, that’ll only make things worse. I’ve read quite a few postmortems, and the only ones that ever go well are those that are down to earth and honest. Explain exactly what went wrong and where you made your mistakes. Include any relevant technical details that you think your customers would want.
Don’t forget to clearly outline the steps that you’re taking to prevent this kind of problem in the future. It’s okay if you don’t have a perfect answer right off the bat, but you should have a plan that you can share with your customers within twenty-four to forty-eight hours. Communication makes all the difference.
Anecdote: My Big Mistake with Sifter
At one point, we had begun talking about taking upgrading our infrastructure, and we were exploring the idea of moving hosts and improving our backups while continuing to work on some improvements to the application. With the hopes that it might buy us some time and lessen the urgency of the looming infrastructure upgrades, I decided that I’d just upgrade the size of our virtual machine.
I chose poorly.
When I reviewed our performance the next morning, I noticed that the upgraded virtual machine hadn’t made much of a difference, and so we decided to revert back to the original virtual machine. All I was thinking about at the time was the downtime that we were facing–a quick revert would just need a reboot, and we’d be down for less than a minute. Leaving the upgraded virtual machine in place for now and resizing downward later would mean twenty to thirty minutes of downtime. I chose the former since it would minimize downtime. But within seconds of making that decision, I realized it was a bad idea.
When we reverted to the old copy of our virtual machine, we overwrote all the customer data that had been created on the new virtual machine overnight–it was about eleven hours’ worth. We managed to recover about three hours’ worth from our backups, but the remaining eight hours of lost data coincided with Europe’s peak business hours, so some of our customers were seriously affected. I had never felt a sinking feeling like that before. My initial overreaction was that our customers would leave in droves, and Sifter wouldn’t survive. Fortunately, that was far from the case.
We immediately went into recovery mode, and we remained transparent as we talked with our customers about our mistake, the consequences, and our plans. We issued a month’s credit to every affected customer (and lost a fair amount of revenue as a result). It wasn’t the lost revenue that got to me–I was disappointed with myself. We have thousands of people who trust us with their data, and I had let them down. But over the next couple of days, I learned just how wonderful and understanding our customers could be. To the best of my knowledge, I don’t think anyone canceled as a direct result of the data loss, and most were very supportive.
Since then, we’ve dramatically improved our infrastructure and backups. Losing some of our customers’ data was a painful lesson, but it’s a lesson we won’t forget.