Tuesday, November 9, 2010

Fix it or Kick It and the ten minute maxim

One of the things I brought up in my presentation to the Atlanta DevOps group was the concept of "Payment". One of the arguments that people like to trot out when you suggest an operational shift is that "We can't afford to change right now". My argument is that you CAN'T afford to change. It's going to cost you more in the long run. The problem is that in many situations, the cost is detached from the original event.

Take testing. Let's assume you don't make unit testing an enforced part of your development cycle. There are tons of reasons people do this but much of it revolves around time. We don't have time to write tests. We don't have time to wait for tests to run. We've heard them all. Sure you get lucky. Maybe things go out the door with no discernible bugs. But what happens 3 weeks down the road when the same bug that you solved 6 weeks ago crops up again? It's hard to measure the cost when it's so far removed from the origination.

Configuration management is the same way. I'm not going to lie. Configuration management is a pain in the ass especially if you didn't make it a core concept from inception. You have to think about your infrastructure a bit. You'll have to duplicate work initially (i.e. templating config files). It's not easy but it pays off in the long run. However as with so many things, the cost is detached from the original purchase.

Fix it?

Walk with me into my imagination. A scary place where a server has started to misbehave. What's your initial thought? What's the first thing you do? You've seen this movie and done this interview:

  • log on to the box
  • perform troubleshooting
  • think
  • perform troubleshooting
  • call vendor support (if it's an option)
  • update trouble ticket system
  • wait
  • troubleshoot
  • run vendor diag tools

What's the cost of all that work? What's the cost of that downtime? Let's be generous. Let's assume this is a physical server and you paid for 24x7x4 hardware support and a big old RHEL subscription. How much time would you spend on each task? What's the turn around time to getting that server back into production?

Let's say that the problem was resolved WITHOUT needing replacement hardware but came in at the four hour mark. That's three hours that the server was costing you money instead of making you money. Assuming a standard SA salary of $75k/year in Georgia, that works out to $150. That's just doing a base salary conversion not calculating all the other overhead associated with staffing an employee. What if that person consulted with someone else during that time, a coworker at the same rate, for two of those hours. $225. Not too bad, right? Still a tangible cost. Maybe one you're willing to eat.

But let's assume the end result was to wipe and reinstall. Let's say it takes another hour to get back to operational status. Woops. Forgot to make that tweek to Apache that we made a few weeks ago. Let's spend an hour troubleshooting that.

But we're just talking man power at this point. This doesn't even take into account end-user productivity, loss of customers from degraded performance or any host of issues. God forbid that someone misses something that causes problems to other parts of the environment (like not setting the clock and inserting invalid timestamps into the database or something. Forget that you shouldn't let your app server handle timestamps). Now there's cleanup. All told your people spent 5 hours to get this server back into production while you've been running in a degraded state. What does that mean when our LOB is financial services and we have an SLA and attached penalties? I'm going to go easy on you and let you off with 10k per hour of degraded performance.

Get ready to credit someone $50k or worse cut a physical check.

Kick it!

Now I'm sure everyone is thinking about things like having enough capacity to maintain your SLA even with the loss of one or two nodes but be honest. How many companies actually let you do that? Companies will cut corners. They roll the dice or worse have a misunderstanding of HA versus capacity planning.

What you should have done from the start was kick the box. By kicking the box, I mean performing the equivalent of a kickstart or jumpstart. You should, at ANY time, be able to reinstall a box with no user interaction (other than the action of kicking it) and return it to service in 10 minutes. I'll give you 15 minutes for good measure and bad cabling. My RHEL/CentOS kickstarts are done in 6 minutes on my home network and most of that time is the physical hardware power cycling. With virtualization you don't even have a discernible bootup time.

Unit testing for servers

I'll go even farther. You should be wiping at least one of your core components every two weeks. Yes. Wiping. It should be a part of your deploy process in fact. You should be absolutely sure that should you ever need to reinstall under duress that you can get that server back into service in an acceptable amount of time. Screw the yearly DR tests. I'm giving you a world where you can perform bi-monthly DR tests as a matter of standard operation. All it takes is a little bit of up front planning.

The 10 minute maxim

I have a general rule. Anything that has to be done in ten minutes can be afforded twenty minutes to think it through. Obviously, it's a general rule. The guy holding the gun might not give you twenty minutes. And twenty minutes isn't a hard number. The point is that nothing is generally so critical that it has to be SOLVED that instant. You can spend a little more time up front to do things right or you can spend a boatload of time on the backside trying to fix it.

Given the above scenario, you would think I'm being hypocritical or throwing out my own rule. I'm not. The above scenario should have never happened. This is a solved problem. You should have spent 20 minutes actually putting the config file you just changed into puppet instead of making undocumented ad-hoc changes. You should have spent an hour when bringing up the environment to stand up a CM tool instead of just installing the servers and doing everything manually. That's the 10 minute maxim. Take a little extra time now or take a lot of time later.

You decide how much you're willing to spend.

No comments: