Tom Mango

How I Keep Limited Run Running

A little less than three years ago, I wrote a post that talked about how I kept our web services running, while being the only developer and operations person in our company. Since then, the amount of traffic we handle has increased more than 3000%, our revenue has grown over 5000%, and this topic is more important than ever.

And yes, I’m still our only developer and operations person.

Background

I’m Tom, and I’m half of Limited Run, Card Included, and a few other things. I also organize most of the Rails Rumble, but that’s a seasonal thing.

For our consumer facing services, I’m the only developer and the only operations person. Having an actual work-life balance while in this role has taken a serious amount of time to develop, but that’s for another post. This post is about how I manage to keep our services, specifically Limited Run, up and running. While some services’ availability is less critical than others, Limited Run is a self-serve store platform. We have customers all over the world. These store owners must have the ability to work on their store 24 hours a day, 7 days a week, 365 days a year. Additionally, every second one of their customers cannot visit their store or make a purchase, is money they’re losing. It doesn’t matter to our customers or our customers’ customers, that I’m out to dinner, or at a baseball game, or asleep at 3 am. If something should happen, it needs to be fixed as quickly as possible.

In companies with a larger staff than ours and similar availability requirements, on-call schedules help to ease the situation. Further, operations people spread out across the world help to keep everyone sane. We’re not in that situation, so we’ve built an enormous number of automated systems, checks, and fail-safes over the years to ensure we can keep everything operating as smoothly as possible. However, simply having more employees available to resolve issues, is no excuse for not doing everything possible to fortify your infrastructure.

There are three important areas I focus on to keep things running smoothly: monitoring, automation, and on-the-go preparedness. Here’s how I approach each of them.

Monitoring

Without thorough monitoring, you absolutely cannot expect to be able to keep your service operating. I use and seriously endorse the following services to help with monitoring:

In addition to paid services, we’ve found it necessary to build a number of custom services to help keep an eye on various things. Here are just a few services that get run via cronjobs:

One other important thing we do is use god (there are more modern alternatives, but this is simple and has worked well for us for many years) to ensure various services continue running on each of our servers. We use god to ensure HAProxy, Nginx, Apache, and our background workers continue running. We also have a cronjob that starts god every so often in case it dies, which does happen. Just like subscribing to push notifications about PagerDuty’s and Amazon’s status, layering your monitoring is important. Things may fail, but multiple failures at once are much less likely.

Architecture & Automation

Invest in your architecture. Not necessarily with money directly, but with time. The more of your architecture you can automate and depend on, the better you’ll sleep at night and the more quickly you’ll be able to recover if something goes wrong. It seems like an obvious point, but I think a lot of people don’t take this seriously enough at first. If monitoring will get you half the way, self configuring server instances and auto scaling will get you the rest of the way. Being a solo developer is not an excuse to not invest time in automation. In fact, being a solo developer is even more of a reason to invest the time.

I’m a big believer in Amazon Web Services. As someone who was never really great at server administration, AWS has really helped me wrap my head around a lot of concepts and has given me a great place to experiment with configuring and running various systems. If you’re the only developer of your web service and don’t want to or can’t pay for a fully managed service like Heroku, you’re going to eventually have to learn to configure and fix things. AWS is a great place to learn and practice because you can spin up as many fresh, clean installs of a server instance as you need to get things right and it won’t cost more than a few dollars.

Going further than just using AWS directly for testing, there have been a ton of recent advancements in automated server configuration. I personally use Ansible for our entire infrastructure, but there are other alternatives like Chef, Puppet, and Amazon OpsWorks. I cannot stress enough how important Vagrant and automation software like Ansible is to building out your infrastructure. Pick one, learn it, and use it.

I use Ansible to develop and configure our base machines, but I then turn each into a self configuring AMI. When an instance launches, it pulls down the latest configuration files for the role it’s going to play, checks out any code that needs to be run and connects itself to the rest of the infrastructure. When you combine self configuring EC2 instances with AWS’ auto scaling and load balancing (or with your own custom load balancers, like we do), you’ve already gone a long way in ensuring your service stays up and running. When most people hear “auto scaling”, they think about when you get a rush of traffic and you automatically fire up additional server instances to keep up. Although that’s a main benefit of auto scaling, another benefit is simply keeping your service online. If an instance dies for some reason, a new instance can automatically start up and take its place. Additionally, having all of your instances behind a load balancer (even just a single instance), means that application servers can come and go without affecting connectivity to your service.

To take the idea of auto scaling further, you have to design your systems to be loosely coupled. For most web services, if you remove data from the equation (into Amazon RDS, for example), you’re simply left with processing. For Limited Run, we use a combination of RDS, S3, background queues, and EC2 instances that can be turned on and off at any time, without affecting anything. When you can think of EC2 instances running your web application as disposable processing servers, auto scaling is no longer just about adjusting for traffic, it’s about keeping your service online, no matter what happens.

One final piece of the infrastructure puzzle, is redundancy. Being an AWS user, we leverage RDS’s Multi-AZ support, which places a slave of our main database in another availability zone. Additionally, we run EC2 instances across multiple availability zones, and have Route 53 configured with failovers to backup load balancers in various availability zones. Whatever your specific setup is, the main point is that taking out a single server, or every server in a single availability zone, shouldn’t cause you any problems. Okay, there may be a brief minute or so issue when things failover, but you aren’t left scrambling to fix things. As much as possible, things should be fixing themselves.

On-The-Go Preparedness

Over the years, as our monitoring, automation, and infrastructure has improved, I’ve become much less dependent on having to be prepared to deal with an issue while away from home (where my main office is). Additionally, while I may run web services, and work on software all day, I’m actually not that into technology when I leave the house. For me, as I get closer and closer to 30, part of that work-life balance I mentioned earlier, means being able to step away from technology, or at least, not have it in my face all the time. Having said that, I don’t go anywhere without, at the very least, my phone. But if I’m going out for the entire day, I will generally throw a bag in my car that also has an iPad and MacBook Air in it, just in case. Also worth noting, my iPhone is on AT&T and my iPad is on Verizon. Cell coverage isn’t amazing where I live, so being able to tether on either network, is nice.

On my iPhone and iPad, I have have access to the following apps:

In addition to these, I also have apps for the AWS Console, PagerDuty, New Relic, and Pingdom, but rarely use them. If I have my iPad with me and need access to the AWS Console, I end up just using the website, which works great.

Again, this is all just a final safety net in case automated systems don’t resolve something that arises while I’m out. The goal is to not need to rely on these things, but if something does come up, it can be resolved from my phone. Having an iPad or a laptop with me can sometimes speed things up in certain circumstances, but generally isn’t necessary.

Conclusion

Werner Vogel, Amazon’s CTO, famously said: “everything fails, all the time”. You need to design for failure and you need to be ready to react when something goes wrong. For me, as a single developer, that means thorough monitoring, being able to rely on an automated architecture, and always being connected.