So you have a nice farm of servers you have been administering. You have a hodgepodge collection of scripts on your hard drive, quite useful for you yourself, though largely unknown to the rest of the world.

You want to lift your script infrastructure to a new level. “Automate! Automate! Automate!” has been the battle cry, rumbling in the distance for quite some time now.

It seems to make sense. So you try it out. You’ve grabbed one of these hip automation tools (Ansible, Puppet, Chef, or similar – this blog post doesn’t care which one). You’ve played around a bit. Initial experiments were promising. You’re hooked.

You resolve to adopt automation. When some task comes your way, you’ll not do it the manual way (like you’ve done for years). Instead, you’ll take the time to automate that task. You hope to collect an increasing amount of automation solutions and you’ll eventually arrive in the wonderful new land of hip present-day tooling.

The reality is: You won’t. It’s a trap. You’re been tempted to enter a tar pit, where movement is slow.

Personally, I love and highly recommend automation. But please, don’t try to introduce it horizontally, one task at a time, across servers. Instead, introduce it vertically, one server at a time.

There are two problems with horizontal automation, which will reduce your movements to tar pit slowness: You’ll need extremely robust automation, as it needs to deal with whatever results your previous manual maintenance have left you. And it’s difficult to test.

Let me explain this with a (toy) example. Say, you want to add a new admin’s key to /root/.ssh/authorized_keys on a bunch of servers. In your particular organization, you find yourself confronted with the following issues:

If you’re now scratching your head, thinking “this is ridiculous”, if you are wondering what even the right concept is to deal with such a mess, I completely agree with you. You do not want to expand the effort doing conceptual work here.

I agree this example is a bit exaggerated to support my point. But then, authorized_keys is about the most simple configuration file format imaginable. Other stuff we daily deal with is considerably more complicated. Issues like the above do pop up everywhere in legacy, manually maintained systems.

The main point here: Just say “no”. Do not even try to automate on top of existing systems, which have been maintained via a traditional, manual process. In theory, it could be done. But in practice, you’ll burn incredible amounts of time and brain power.

In all of the many places I’ve seen over the years, people with admin skills and roles are either slightly overloaded, or worse. They simply do not have time on their hands to burn. Attempts to introduce automation the horizontal way, one task at a time, are doomed to failure.

Instead, automate vertically: One server at a time.

Take a server, and produce an automation solution which builds that server. And does so from scratch. The entire server. It should be built with no manual step whatsoever, other than the initial pressing of one button.

When something needs to be improved about that server, the standard way to go about it should be: Get yourself a new, fresh “clean slate” and rebuild the entire thing, with the improvement. Test it to see it works as intended. (This, too, can and should be automated.) Switch over active duty from the present to the newly constructed server. Ditch the old one.

Never touch a running system. Rebuild it from scratch.

This may seem absurdly wasteful. And, in some way, it is. Doing things this way requires availability of more (virtual) machines. You’ll also be burning CPU cycles in a grand scale, rebuilding whole things again and again when only very little has actually changed.

But CPU cycles and even virtual machine construction have become cheap. You and your time remain precious. Constructing from scratch automates much easier and more pleasantly (and is also much less error-prone) than does fiddling with existing, manually constructed infrastructure.

Other benefits include:

So, don’t resolve to automate tasks horizontally, one task at a time, fiddling with existing servers. It’s a waste of your time. Instead, resolve to get rid of servers that have been installed manually, one server type at a time. You’ll find your automation effort progresses much faster this way.

Also, your progress becomes more visible. If your organization is anything like those I tend to encounter, this matters.

To nail it down, let’s return to the initial example. If root’s authorized_keys files need augmentation, would I reconfigure that part in my automation system and actually rebuild a whole bunch of servers anew from scratch?

Short answer: It would at least feel extremely good to be able to do so. And if the cost is just a heap of CPU cycles, why not?

But I admit to glossing over a few things here. In particular, what is the cost of “switching over of active duty”?

That may be easily done for a stateless microservice without database. It may be tough for a big-iron central database server.

So, my real, practical answer is: You should fight hard to be able to re-build even the tough ones. And it’s important you actually do so, every so many weeks.

But, until your entire infrastructure and the software it runs become automation-friendly, the switching over will probably end up having a price tag attached: Typically, the price is a certain amount of down-time of service.

Whether that price is worth whatever you’re trying to achieve has (of course) to be judged on a case-by-case basis.

But, some weeks ago, you did roll out that server from scratch automatically, right? Then you’ll find smuggling in a bit of horizontal automation to be much easier and much more pleasant, compared with the tar-pit experience of having to deal with a manually installed, warty, legacy system.