Avatar of Andreas Krüger

So you have a nice farm of servers you have been administering. You have a hodgepodge collection of scripts on your hard drive, quite useful for you yourself, though largely unknown to the rest of the world.

You want to lift your script infrastructure to a new level. “Automate! Automate! Automate!” has been the battle cry, rumbling in the distance for quite some time now.

It seems to make sense. So you try it out. You’ve grabbed one of these hip automation tools (Ansible, Puppet, Chef, or similar – this blog post doesn’t care which one). You’ve played around a bit. Initial experiments were promising. You’re hooked.

You resolve to adopt automation. When some task comes your way, you’ll not do it the manual way (like you’ve done for years). Instead, you’ll take the time to automate that task. You hope to collect an increasing amount of automation solutions and you’ll eventually arrive in the wonderful new land of hip present-day tooling.

The reality is: You won’t. It’s a trap. You’re been tempted to enter a tar pit, where movement is slow.

Personally, I love and highly recommend automation. But please, don’t try to introduce it horizontally, one task at a time, across servers. Instead, introduce it vertically, one server at a time.

There are two problems with horizontal automation, which will reduce your movements to tar pit slowness: You’ll need extremely robust automation, as it needs to deal with whatever results your previous manual maintenance have left you. And it’s difficult to test.

Let me explain this with a (toy) example. Say, you want to add a new admin’s key to /root/.ssh/authorized_keys on a bunch of servers. In your particular organization, you find yourself confronted with the following issues:

  • Most files end with '\n', but some don’t.

  • Each key has a “comment” field at its end. You have a company policy what information should should be in there. This is largely followed, but not always.

  • In your organization, those authorized_keys files are long, so they should be sorted by comment field.

  • Ah – you also have some glitches, where the files are not entirely sorted (although they should be).

  • You even occasionally encounter duplicate keys in the same file. Sometimes, the entire line is duplicate, sometimes only the key. It may happen one line featuring the duplicate key obeys the company comment field policy, the other one doesn’t.

  • An additional # comment line ought to give the information who added the key when. It typically precedes the key. For historical reasons, on database servers that comment line follows the key. For different historical reasons, front-end servers typically don’t have it at all.

If you’re now scratching your head, thinking “this is ridiculous”, if you are wondering what even the right concept is to deal with such a mess, I completely agree with you. You do not want to expand the effort doing conceptual work here.

I agree this example is a bit exaggerated to support my point. But then, authorized_keys is about the most simple configuration file format imaginable. Other stuff we daily deal with is considerably more complicated. Issues like the above do pop up everywhere in legacy, manually maintained systems.

The main point here: Just say “no”. Do not even try to automate on top of existing systems, which have been maintained via a traditional, manual process. In theory, it could be done. But in practice, you’ll burn incredible amounts of time and brain power.

In all of the many places I’ve seen over the years, people with admin skills and roles are either slightly overloaded, or worse. They simply do not have time on their hands to burn. Attempts to introduce automation the horizontal way, one task at a time, are doomed to failure.

Instead, automate vertically: One server at a time.

Take a server, and produce an automation solution which builds that server. And does so from scratch. The entire server. It should be built with no manual step whatsoever, other than the initial pressing of one button.

When something needs to be improved about that server, the standard way to go about it should be: Get yourself a new, fresh “clean slate” and rebuild the entire thing, with the improvement. Test it to see it works as intended. (This, too, can and should be automated.) Switch over active duty from the present to the newly constructed server. Ditch the old one.

Never touch a running system. Rebuild it from scratch.

This may seem absurdly wasteful. And, in some way, it is. Doing things this way requires availability of more (virtual) machines. You’ll also be burning CPU cycles in a grand scale, rebuilding whole things again and again when only very little has actually changed.

But CPU cycles and even virtual machine construction have become cheap. You and your time remain precious. Constructing from scratch automates much easier and more pleasantly (and is also much less error-prone) than does fiddling with existing, manually constructed infrastructure.

Other benefits include:

  • You’ll never again need to touch a running system. Your automation leaves those alone. It builds up the new servers instead. So if you get it horribly wrong, just ditch that new server, instead of making it active. Nobody but you will know you goofed. No harm is done.

  • You’ll put your automation solution under version control. So there’s no longer a need to maintain information like “who added whom to authorized_keys”, certainly not in those authorized_keys file themselves. The version control system will maintain such information, consistently, automatically, reliably, for free.

  • Big upgrades (a new version of the operating system or similar) are nightmares no longer. Just change the automation and build the server anew. You are at your leisure checking whether it works. Switch duty from the old to the new one if it does, ditch it if it doesn’t.

  • Hopefully, your hardware setup is redundant. If it isn’t, each hardware crash results in an unpleasant cocktail of “interruption of service of undetermined duration”, “emergency”, “lots of manual labor”, “overtime”, and, last but not least, “let’s hope those back-ups actually work” (assuming you do back-ups). With clean-slate automation in place, you know that you can rebuild that single-point-of-failure server from scratch. You know in advance how long it’ll take. You’ve done it before. Normal business. No sweat.

  • Development and testing will appreciate if you can provide them with test environments. These will be exact duplicates of the production machines. Or, if there’s a need to deviate, the nature and extend of the deviation will be known.

  • Automating from scratch opens whole new opportunities for securing servers. For one example, you can try changing whole file systems to read-only.

So, don’t resolve to automate tasks horizontally, one task at a time, fiddling with existing servers. It’s a waste of your time. Instead, resolve to get rid of servers that have been installed manually, one server type at a time. You’ll find your automation effort progresses much faster this way.

Also, your progress becomes more visible. If your organization is anything like those I tend to encounter, this matters.

To nail it down, let’s return to the initial example. If root’s authorized_keys files need augmentation, would I reconfigure that part in my automation system and actually rebuild a whole bunch of servers anew from scratch?

Short answer: It would at least feel extremely good to be able to do so. And if the cost is just a heap of CPU cycles, why not?

But I admit to glossing over a few things here. In particular, what is the cost of “switching over of active duty”?

That may be easily done for a stateless microservice without database. It may be tough for a big-iron central database server.

So, my real, practical answer is: You should fight hard to be able to re-build even the tough ones. And it’s important you actually do so, every so many weeks.

But, until your entire infrastructure and the software it runs become automation-friendly, the switching over will probably end up having a price tag attached: Typically, the price is a certain amount of down-time of service.

Whether that price is worth whatever you’re trying to achieve has (of course) to be judged on a case-by-case basis.

But, some weeks ago, you did roll out that server from scratch automatically, right? Then you’ll find smuggling in a bit of horizontal automation to be much easier and much more pleasant, compared with the tar-pit experience of having to deal with a manually installed, warty, legacy system.