Monday, March 29, 2010

DevOps - Operations to Developers

This is part 2 in a general set of discussions on DevOps. Part 1 is here

Production
I have a general rule I've lived by that has served me well and it's NSFW. I learned it in these exact words from a manager many years ago:

"Don't f*** with production"

Production is sacrosanct. You can dick around with anything else but if it's critical to business operations, don't mess with it without good reason and without an audit trail. There is nothing more frustrating than trying to diagnose an outage because someone did something they THOUGHT was irrelevant (like a DNS change - I speak from experience) and causing a two hour outage of a critical system. It's even more frustrating when there's no audit trail of WHAT was done so that it can be undone. Meanwhile, you've got 20 different concerned parties calling you every five minutes asking "are we there yet?". How much development work would get done if you operated under the same interrupt driven environment?

Change Control
Yes, it's a hassle and boring and not very rockstar but it's not only critical but sometimes it's the law.

Side note: I pretty much hate meetings in general but they do serve a purpose. My main frustration is that meetings take away time where work could actually be getting done. They always devolve into a glorified gossip session. What should have taken 15 minutes to discuss ends up taking an hour as conversations that started while waiting for that last person to show up carry over into meeting proper. Sadly the person who is late is usually Red Leader and we can't seem to stay on target. Everyone has something they would rather be doing and usually it's something that will actually accomplish something rather than the stupid meeting.

The exception for me, has always been change control meetings. I typically enjoy those because that's when things happen. We're finally going to release your cool new feature into production that you've spent a month developing and fine tuning. Of course, this is when we find out that you neglected to mention that you needed firewall rules to go along with it. This is when we find out exactly what that new table is going to be used for and that we MIGHT want to put it in its own bufferpool. All of the things you didn't think of?We bring them to the surface in these meetings because these are pain points we've seen in the past. We think of these things.

Auditing
As mentioned in production, typically we don't have the benefit of looking over changes in source control. We can't check a physical object into SVN. Sure, there are amazing products like Puppet and Cfengine that make managing server configurations easier. We have applications that can track changes. We have applications that map our switch ports but it's simply not that easy for us to track down what changed.

Your application is encapsulated in that way. You know what changed, who changed it and (with appropriate comments) WHY it was changed.

Meanwhile a DNS change may have happened, a VLAN change, a DAS change...you name it. Production isn't just your application. It's all the moving parts underneath that power it. That application that you developed that is tested on a single server doesn't always account for the database being on a different machine or the firewall rules associated with it.

Yes, we'd love to have a preproduction environment that mimics production but that's not always an option. We have to have an audit trail. Things have to be repeatable. So no, we can't just change a line in a jsp for you to fix a bug that didn't get caught in testing. It would take us longer to do that on 10 servers than if we just pushed a new build.

Outages
Outages are bad, mmkay? You probably won't lose your job over a bug but I've had to deal with someone being fired because he didn't follow the process and caused an outage. It sucks but we're the one who gets the phone call at 2AM when something is amiss.

And even AFTER the outage, we have to fill out Root Cause Analysis reports sometimes after being up for 24 hours straight fixing a serious issue. You can either write a unit test for a bit of code or you can keep fixing the same bug after every release. We'd prefer you write the unit test, personally.

I know all of these things make us look like a slow, unmoving beast. I know you hate sitting in meeting after meeting explaining that the bug will be fixed just as soon as ops pushes the code. I know that we get pissy and blame you for everything that goes wrong with an application. We're sorry. We're just running on 2 hours of sleep in three days getting the new hardware installed for your application that someone thinks has to go online yesterday. Meanwhile, we're dealing with a full disk on this server and a flaky network connection on another. Cut us some slack.

No comments: