How I Keep Limited Run Running
A little less than three years ago, I wrote a post that talked about how I kept our web services running, while being the only developer and operations person in our company. Since then, the amount of traffic we handle has increased more than 3000%, our revenue has grown over 5000%, and this topic is more important than ever.
And yes, I’m still our only developer and operations person.
Background
I’m Tom, and I’m half of Limited Run, Card Included, and a few other things. I also organize most of the Rails Rumble, but that’s a seasonal thing.
For our consumer facing services, I’m the only developer and the only operations person. Having an actual work-life balance while in this role has taken a serious amount of time to develop, but that’s for another post. This post is about how I manage to keep our services, specifically Limited Run, up and running. While some services’ availability is less critical than others, Limited Run is a self-serve store platform. We have customers all over the world. These store owners must have the ability to work on their store 24 hours a day, 7 days a week, 365 days a year. Additionally, every second one of their customers cannot visit their store or make a purchase, is money they’re losing. It doesn’t matter to our customers or our customers’ customers, that I’m out to dinner, or at a baseball game, or asleep at 3 am. If something should happen, it needs to be fixed as quickly as possible.
In companies with a larger staff than ours and similar availability requirements, on-call schedules help to ease the situation. Further, operations people spread out across the world help to keep everyone sane. We’re not in that situation, so we’ve built an enormous number of automated systems, checks, and fail-safes over the years to ensure we can keep everything operating as smoothly as possible. However, simply having more employees available to resolve issues, is no excuse for not doing everything possible to fortify your infrastructure.
There are three important areas I focus on to keep things running smoothly: monitoring, automation, and on-the-go preparedness. Here’s how I approach each of them.
Monitoring
Without thorough monitoring, you absolutely cannot expect to be able to keep your service operating. I use and seriously endorse the following services to help with monitoring:
- Honeybadger - If your application isn’t configured to notify you of errors, just stop reading right now and go set something up. There is just no excuse for that. We used to rely on email notifications directly from our application for this, and that served us well for a while. That is, until we ran into an issue with some dkim signing logic that prevented email notifications - even those for errors - from being delivered. So, we went with Honeybadger to handle notifying us of errors, and haven’t looked back.
- Clicky - This is a fantastic, realtime analytics service. Not only does it offer everything you’d expect from a Google Analytics type service, but statistics come in as they occur. Clicky will tell you approximately how many people are currently on your site and, with Spy, will even let you see every action as it happens. If you aren’t using a realtime analytics service like this, you’re flying blind. The benefit of knowing exactly what’s going on, at any given time, is immeasurable.
- Pingdom - On it’s own, Pingdom is extremely useful. Get an email, SMS, or push notification if your site goes down. An uptime monitor like Pingdom is essential if you want to know when there is a service disruption.
- Amazon CloudWatch - We are heavy users of and believers in Amazon Web Services. Their CloudWatch service is essential for not only automation (see below), but monitoring as well. We’ve configured various triggers, mostly related to anomalies, that will send us email notifications that can be acted upon in different ways.
- PagerDuty - Email notifications are well and good, but at some point, they may become just noise. It’s important to separate the most important notifications, such as those from Pingdom, and bring them to your attention no matter where you are or what time of day it is. That’s where PagerDuty comes in. Additionally, I’m a heavy sleeper, so Pingdom downtime emails are not enough for me in the middle of the night. With just Pingdom, you may still find yourself worrying. However, if you couple Pingdom with PagerDuty, you will never worry again. PagerDuty will call your cellphone within seconds of receiving an email from an uptime monitor, like Pingdom. Knowing that you aren’t missing downtime notifications is extremely comforting. I recommend using this sparingly, for only the most important things, like downtime notifications. Also, as an iOS user that loves scheduled Do Not Disturb so my phone doesn’t constantly buzz from emails all night, adding PagerDuty’s alert phone number to my favorites and only allowing phone calls from those contacts while in Do Not Disturb, can’t be beat.
- Twitter - A lesser known feature of Twitter is their SMS and Push Notification delivery of individual tweets (called “Mobile Notifications”). I subscribe to @PagerDutyOps and @Cloud_Status’s Mobile Notifications so that I know when PagerDuty and AWS are having their own problems. We rely on these services to ensure our services stay up, but sometimes shit hits the fan for them too. Stay informed.
- New Relic - New Relic is like analytics for your application, rather than your traffic. Together with CloudWatch, I use New Relic to track down performance regressions and determine where the bottlenecks are. The faster your application is, especially in the places that are hit hardest under heavy load, the easier time you’ll have scaling and better handling traffic surges. New Relic was critical for us during a major refactoring of our entire cart and checkout process late last year, which both became magnitudes less resource intensive.
- Sprint.ly - While not strictly related to keeping our services running, we rely heavily on Sprint.ly for organizing and planning our day-to-day development efforts.
In addition to paid services, we’ve found it necessary to build a number of custom services to help keep an eye on various things. Here are just a few services that get run via cronjobs:
- Clickylert - This is a little script that polls Clicky for the number of people currently on our sites and detects significant, upward changes. When a large spike is detected, Clickylert emails me how many people are currently on the site along with how the traffic changed over the past hour. Additionally, Clickylert can be configured to send an email to PagerDuty, which would call my phone. If Pingdom is how I get notified when something very bad happens, Clickylert is what notifies me when something out of the ordinary is happening, which may or may not lead to something bad. Over the years, our scale has grown so much and our infrastructure has improved even more so, that Clickylert is no longer configured to call me. We have massive stores running on Limited Run sending waves and waves of traffic at all hours of the day and night. It simply became easier to handle huge influxes of traffic out of nowhere, than to worry about spikes in traffic.
- EC2 Instance Watcher - A few years ago, we had an issue that resulted in runaway EC2 instance replacements. This was a costly mistake, as you pay for a minimum of one hour for each instance, even if that instance was only booted for a couple of minutes. This script will deliver notifications, first by email, then to PagerDuty if things get really out of hand, whenever our instance count (running + terminated) goes above the range we normally expect it to be in. Nowadays, you can configure EC2 auto scaling groups to automatically email you on every event (which I highly suggest), so this script isn’t quite as useful as it once was.
- Backlog Job Watcher - A while back, we had an issue that prevented our background jobs from being properly run. Everything from email notifications to processing lossless audio albums halted for a few hours. This script checks our backlog of jobs every so often and ensures it doesn’t get too long. If it does, it contacts me.
One other important thing we do is use god (there are more modern alternatives, but this is simple and has worked well for us for many years) to ensure various services continue running on each of our servers. We use god to ensure HAProxy, Nginx, Apache, and our background workers continue running. We also have a cronjob that starts god every so often in case it dies, which does happen. Just like subscribing to push notifications about PagerDuty’s and Amazon’s status, layering your monitoring is important. Things may fail, but multiple failures at once are much less likely.
Architecture & Automation
Invest in your architecture. Not necessarily with money directly, but with time. The more of your architecture you can automate and depend on, the better you’ll sleep at night and the more quickly you’ll be able to recover if something goes wrong. It seems like an obvious point, but I think a lot of people don’t take this seriously enough at first. If monitoring will get you half the way, self configuring server instances and auto scaling will get you the rest of the way. Being a solo developer is not an excuse to not invest time in automation. In fact, being a solo developer is even more of a reason to invest the time.
I’m a big believer in Amazon Web Services. As someone who was never really great at server administration, AWS has really helped me wrap my head around a lot of concepts and has given me a great place to experiment with configuring and running various systems. If you’re the only developer of your web service and don’t want to or can’t pay for a fully managed service like Heroku, you’re going to eventually have to learn to configure and fix things. AWS is a great place to learn and practice because you can spin up as many fresh, clean installs of a server instance as you need to get things right and it won’t cost more than a few dollars.
Going further than just using AWS directly for testing, there have been a ton of recent advancements in automated server configuration. I personally use Ansible for our entire infrastructure, but there are other alternatives like Chef, Puppet, and Amazon OpsWorks. I cannot stress enough how important Vagrant and automation software like Ansible is to building out your infrastructure. Pick one, learn it, and use it.
I use Ansible to develop and configure our base machines, but I then turn each into a self configuring AMI. When an instance launches, it pulls down the latest configuration files for the role it’s going to play, checks out any code that needs to be run and connects itself to the rest of the infrastructure. When you combine self configuring EC2 instances with AWS’ auto scaling and load balancing (or with your own custom load balancers, like we do), you’ve already gone a long way in ensuring your service stays up and running. When most people hear “auto scaling”, they think about when you get a rush of traffic and you automatically fire up additional server instances to keep up. Although that’s a main benefit of auto scaling, another benefit is simply keeping your service online. If an instance dies for some reason, a new instance can automatically start up and take its place. Additionally, having all of your instances behind a load balancer (even just a single instance), means that application servers can come and go without affecting connectivity to your service.
To take the idea of auto scaling further, you have to design your systems to be loosely coupled. For most web services, if you remove data from the equation (into Amazon RDS, for example), you’re simply left with processing. For Limited Run, we use a combination of RDS, S3, background queues, and EC2 instances that can be turned on and off at any time, without affecting anything. When you can think of EC2 instances running your web application as disposable processing servers, auto scaling is no longer just about adjusting for traffic, it’s about keeping your service online, no matter what happens.
One final piece of the infrastructure puzzle, is redundancy. Being an AWS user, we leverage RDS’s Multi-AZ support, which places a slave of our main database in another availability zone. Additionally, we run EC2 instances across multiple availability zones, and have Route 53 configured with failovers to backup load balancers in various availability zones. Whatever your specific setup is, the main point is that taking out a single server, or every server in a single availability zone, shouldn’t cause you any problems. Okay, there may be a brief minute or so issue when things failover, but you aren’t left scrambling to fix things. As much as possible, things should be fixing themselves.
On-The-Go Preparedness
Over the years, as our monitoring, automation, and infrastructure has improved, I’ve become much less dependent on having to be prepared to deal with an issue while away from home (where my main office is). Additionally, while I may run web services, and work on software all day, I’m actually not that into technology when I leave the house. For me, as I get closer and closer to 30, part of that work-life balance I mentioned earlier, means being able to step away from technology, or at least, not have it in my face all the time. Having said that, I don’t go anywhere without, at the very least, my phone. But if I’m going out for the entire day, I will generally throw a bag in my car that also has an iPad and MacBook Air in it, just in case. Also worth noting, my iPhone is on AT&T and my iPad is on Verizon. Cell coverage isn’t amazing where I live, so being able to tether on either network, is nice.
On my iPhone and iPad, I have have access to the following apps:
- Prompt - A really nice SSH application from Panic that works on iPhones and iPads.
- Screens - A fantastic VNC application. Screens will get me into a machine in my home office, letting me do whatever I need to do from a development environment. This is more of a last resort.
In addition to these, I also have apps for the AWS Console, PagerDuty, New Relic, and Pingdom, but rarely use them. If I have my iPad with me and need access to the AWS Console, I end up just using the website, which works great.
Again, this is all just a final safety net in case automated systems don’t resolve something that arises while I’m out. The goal is to not need to rely on these things, but if something does come up, it can be resolved from my phone. Having an iPad or a laptop with me can sometimes speed things up in certain circumstances, but generally isn’t necessary.
Conclusion
Werner Vogel, Amazon’s CTO, famously said: “everything fails, all the time”. You need to design for failure and you need to be ready to react when something goes wrong. For me, as a single developer, that means thorough monitoring, being able to rely on an automated architecture, and always being connected.