Tuesday, November 9, 2010

Fix it or Kick It and the ten minute maxim

One of the things I brought up in my presentation to the Atlanta DevOps group was the concept of "Payment". One of the arguments that people like to trot out when you suggest an operational shift is that "We can't afford to change right now". My argument is that you CAN'T afford to change. It's going to cost you more in the long run. The problem is that in many situations, the cost is detached from the original event.

Take testing. Let's assume you don't make unit testing an enforced part of your development cycle. There are tons of reasons people do this but much of it revolves around time. We don't have time to write tests. We don't have time to wait for tests to run. We've heard them all. Sure you get lucky. Maybe things go out the door with no discernible bugs. But what happens 3 weeks down the road when the same bug that you solved 6 weeks ago crops up again? It's hard to measure the cost when it's so far removed from the origination.

Configuration management is the same way. I'm not going to lie. Configuration management is a pain in the ass especially if you didn't make it a core concept from inception. You have to think about your infrastructure a bit. You'll have to duplicate work initially (i.e. templating config files). It's not easy but it pays off in the long run. However as with so many things, the cost is detached from the original purchase.

Fix it?

Walk with me into my imagination. A scary place where a server has started to misbehave. What's your initial thought? What's the first thing you do? You've seen this movie and done this interview:

  • log on to the box
  • perform troubleshooting
  • think
  • perform troubleshooting
  • call vendor support (if it's an option)
  • update trouble ticket system
  • wait
  • troubleshoot
  • run vendor diag tools

What's the cost of all that work? What's the cost of that downtime? Let's be generous. Let's assume this is a physical server and you paid for 24x7x4 hardware support and a big old RHEL subscription. How much time would you spend on each task? What's the turn around time to getting that server back into production?

Let's say that the problem was resolved WITHOUT needing replacement hardware but came in at the four hour mark. That's three hours that the server was costing you money instead of making you money. Assuming a standard SA salary of $75k/year in Georgia, that works out to $150. That's just doing a base salary conversion not calculating all the other overhead associated with staffing an employee. What if that person consulted with someone else during that time, a coworker at the same rate, for two of those hours. $225. Not too bad, right? Still a tangible cost. Maybe one you're willing to eat.

But let's assume the end result was to wipe and reinstall. Let's say it takes another hour to get back to operational status. Woops. Forgot to make that tweek to Apache that we made a few weeks ago. Let's spend an hour troubleshooting that.

But we're just talking man power at this point. This doesn't even take into account end-user productivity, loss of customers from degraded performance or any host of issues. God forbid that someone misses something that causes problems to other parts of the environment (like not setting the clock and inserting invalid timestamps into the database or something. Forget that you shouldn't let your app server handle timestamps). Now there's cleanup. All told your people spent 5 hours to get this server back into production while you've been running in a degraded state. What does that mean when our LOB is financial services and we have an SLA and attached penalties? I'm going to go easy on you and let you off with 10k per hour of degraded performance.

Get ready to credit someone $50k or worse cut a physical check.

Kick it!

Now I'm sure everyone is thinking about things like having enough capacity to maintain your SLA even with the loss of one or two nodes but be honest. How many companies actually let you do that? Companies will cut corners. They roll the dice or worse have a misunderstanding of HA versus capacity planning.

What you should have done from the start was kick the box. By kicking the box, I mean performing the equivalent of a kickstart or jumpstart. You should, at ANY time, be able to reinstall a box with no user interaction (other than the action of kicking it) and return it to service in 10 minutes. I'll give you 15 minutes for good measure and bad cabling. My RHEL/CentOS kickstarts are done in 6 minutes on my home network and most of that time is the physical hardware power cycling. With virtualization you don't even have a discernible bootup time.

Unit testing for servers

I'll go even farther. You should be wiping at least one of your core components every two weeks. Yes. Wiping. It should be a part of your deploy process in fact. You should be absolutely sure that should you ever need to reinstall under duress that you can get that server back into service in an acceptable amount of time. Screw the yearly DR tests. I'm giving you a world where you can perform bi-monthly DR tests as a matter of standard operation. All it takes is a little bit of up front planning.

The 10 minute maxim

I have a general rule. Anything that has to be done in ten minutes can be afforded twenty minutes to think it through. Obviously, it's a general rule. The guy holding the gun might not give you twenty minutes. And twenty minutes isn't a hard number. The point is that nothing is generally so critical that it has to be SOLVED that instant. You can spend a little more time up front to do things right or you can spend a boatload of time on the backside trying to fix it.

Given the above scenario, you would think I'm being hypocritical or throwing out my own rule. I'm not. The above scenario should have never happened. This is a solved problem. You should have spent 20 minutes actually putting the config file you just changed into puppet instead of making undocumented ad-hoc changes. You should have spent an hour when bringing up the environment to stand up a CM tool instead of just installing the servers and doing everything manually. That's the 10 minute maxim. Take a little extra time now or take a lot of time later.

You decide how much you're willing to spend.

Monday, November 8, 2010

Transitions

I haven't had a chance to mention this but those of you who I'm connected with on LinkedIn are aware that I'm starting with a new company on Wednesday. I'm taking a few days to get some house work done and then diving in. I don't like switching companies in general but I'm really excited about this opportunity. In addition to having almost a blank slate, I'm working with a much smaller team and a chance to contribute back to the community. It's also a chance for me to work in the Atlanta startup scene; something I've been hoping to do for a few years now.

So what about the previous company? Well they're looking to back fill my position. Please feel free to contact me if you're interested. I can put you in touch with the right people. Fair warning, it's a challenging place to work. They'll tell you the same thing. I've blogged about working at a "traditional" company before right here so you can go back and glean information from that.

Tuesday, November 2, 2010

Using Hudson and RVM for Ruby unit testing

As with everything lately, something popped up on Twitter that prompted a blog post. In this case, @wakaleo was looking for any stories/examples for his Hudson book. I casually mentioned I could throw in some notes about how we use Hudson on the Padrino project.

Prerequisites

Here's what you'll need:

I'll leave you to get Hudson working. There are prebuilt packages for every distro under the sun. If you can't get past this step, you'll need to rethink a few things.

Setting up RVM

Once you have it installed, log in as your Hudson user and set up RVM.

RVM Protip - If there are any gems (like say Bundler) that you ALWAYS install, edit .rvm/gemsets/default.gems and .rvm/gemsets/global.gems and add them there. In my examples, I did not do that.

You'll want to go ahead and install all the VMs you plan on testing against. We use 1.8.7, 1.9.1, 1.9.2, JRuby, RBX and REE:

for i in 1.8.7 1.9.1 1.9.2 jruby ree rbx; do rvm install ${i}; done

This will take a while. When it's done, we can now dive into configuring our job in Hudson

What is the Matrix?

So you've got Hudson running and RVM all set up? Open the Hudson console and create a new job of type "Build multi-configuration project". From the job configuration screen, you'll want to set some basics - repository, scm polling and the like. The key to RVM comes under "Configuration Matrix"

 

The way any user-defined variables work in Hudson, whether a build parameter or matrix configuration, is that you provide a "key" and then a value for that key. The value for that key is accessible to your build steps as a sigil variable. So if your key is my_funky_keyname_here, you can reference $my_funky_keyname_here in your build steps to get that value. With a configuration matrix, each permutation of the matrix provides the value for that key in the given permutation. So if I have:

foo as one axis with 6 values (1, 2, 3 ,4 ,5 ,6) and bar with 3 values (1, 2, 3)

each combination of foo and bar will be available to my build steps as $foo and $bar. The first run will have $foo as 1 and $bar as 1. Second run will have $foo as 2 and $bar as 1. On an on until the combinations are exhausted.

This makes for some REALLY powerful testing matrices. In our case, however, we only need one axis - rubyvm

Hudson Protip - Don't get creative with your axis or parameter names. In our case, we'll be performing shell script steps. Don't call your axis "HOME" because that will just confuse things. Just don't do it.

So now we've added an axis called 'rubyvm' and provided it with values '1.8.7 1.9.1 1.9.2 jruby rbx ree'. As explained, this means that our build steps will iterate over each value of 'rubyvm' for us and repeat our build steps.

Configuring your job

Now that you've got your variables in place, you can write the steps for your job. This took me a little bit of time to work out the best flow. There were some things with how RVM operates with the shell that caught me off-guard initially (the rvm command being a function alias versus an executable). I've broken the test job into three steps:

  • Create my gemset, install bundler and run bundle install/bundle check
  • Run my unit tests
  • Destroy my gemset

In addition to taking advantage of the variable provided by the configuration matrix, we're also going to take advantage of some variables exposed by Hudson in a given job run - $BUILD_NUMBER. Using these two bits of information, we can build a gemset name for RVM that is unique to that run and that ruby vm.

Step 1:

#!/bin/bash -l

rvm use $rubyvm@padrino-$rubyvm-$BUILD_NUMBER --create

gem install bundler

bundle install

bundle check

This uses the --create option of RVM to create our gemset. If our build number is 99 and our ruby vm is ree, we're creating a gemset called padrino-ree-97 for ree. Pretty straightforward.

Next we install bundler and then run the basic bundler tasks. All operations are performed in the workspace for your hudson project. This is typically the root directory of your SCM repository. If the root of your repo doesn't contain your Gemfile and Rakefile, you'll probably want to make your first step a 'cd' to that directory.

The reason for using a full shebang line is to make sure that RVM instantiates properly.

Step 2:

#!/bin/bash -l

rvm use $rubyvm@padrino-$rubyvm-$BUILD_NUMBER

rake test

Each build step is a distinct shell session. For that reason we need to "use" the previously created gemset. Then we run our rake tasks.

Step 3:

#!/bin/bash -l

rvm use $rubyvm@global

rvm --force gemset delete padrino-$rubyvm-$BUILD_NUMBER

This is the "cleanup" step. This cleans up our temporary gemsets that we created for the test run. My understanding was the each step was "independent". Should the middle step fail, the final step would still be executed. This doesn't appear to be the case anymore. For this reason, you'll probably want to occasionally go in and clean up gemsets from failed builds. If your build passes, the gemset will clean itself up. There's probably justification for some sort of "cleanup" job here but I haven't gotten around to trying to pass variables as artifacts to other build steps.

Now you can run the job and watch as Hudson gleefully executes your test cases against each ruby vm. How many of those run concurrently is dependent on how many workers you have configured globally in Hudson.

Unit Testing Protip - One thing you'll find out early on is how concurrent your unit tests REALLY are. In the case of Padrino, ALL of our unit tests were using a hardcoded path (/tmp/sample_project) for testing. My first major step once I got added to the project was to refactor ALL of our tests to make that dynamic so that we could run more than one permutation at a time. You can see an example of how I did that here. Essentially I created an instance variable for our temp directory using UUID.new.generate. It was the quickest way to resolve the problem. If your tests aren't capable of running in parallel, that's one way to address it.

One thing to be aware of: if you have intensive unit tests and your hudson server isn't very powerful, you simply may not have the capacity to run multiple tests at the same time. I had to spin up some worker VMs on other machines around the house to serve as Hudson slave nodes. Our unit tests were actually taking LONGER when we tried to run them in parallel because of the strain of compiling native extension gems and actually running the tests.

Optional profit! step

Code coverage is important. However it makes NO sense to run code coverage tasks on EVERY VM permutation. You only need to run it once (unless you have some VM dependent code in your application). What I've done is take advantage of "Post build actions" to kick off a second job I've defined. This job does nothing but runs our code coverage rake tasks. Steps 1 and 3 are the same as above without the rubyvm variable. Step 2 is different:

#!/bin/bash -l

rvm use 1.8.7@padrino-rcov-$rubyvm-$BUILD_NUMBER

bundle exec rake hudson:coverage:clean

bundle exec rake hudson:coverage:unit

We've broken the coverage tests into a unique rake task so they don't impact normal testing. This creates a code coverage report that's visible in Hudson under that project's page. Currently we don't run the coverage report job unless the primary job finishes.

Wrap up

That's pretty much it in a nutshell. I'm looking to move Hudson to a more powerful VM here at the house as soon as the hardware comes in. I should be able to then run all the tests across all VMs at one time. Screenshots for each of the steps described in this post are available here