DevOps Zone is brought to you in partnership with:

Ranjib is a system administrator at Google. Prior to Google, Ranjib was a senior consultant with ThoughtWorks. He works on private cloud implementation strategies, cloud adoption, system automation etc. He has worked on both application development as well as system administration, for past 6 years. Prior to ThoughtWorks, Ranjib was working with Persistent Systems . Ranjib has done his gradation in lifescience and masters in Bioinformatics. Ranjib is a staunch FOSS supporter. Ranjib is a DZone MVB and is not an employee of DZone and has posted 13 posts at DZone. You can read more from them at their website. View Full User Profile

Infrastructure Tooling Anti-Patterns: Accumulator

  • submit to reddit

As our (or our clients) infrastraucture grows and runs for longer durations, I have noticed that there are certaion parts of our infrastructure that are known only by certain people to a certain extent. Due to the nature of IT operations, most of the engineers stay in firefighting mode, and fix they some of the problem with a manual hotfix (be it stability related issues, security related issues or performance related issues).

Over time these pieces of infrastructure (or infrastructure services) accumulate some feature or functionality that is not automated or documented, and slowly it attains a state where if you kill that server it will be difficult to recreate it, not only because you don't know what exact steps need to be taken to bring it back to the original state, but also there are dependenies with other integration points you need to worry about. In the community we call them 'Works of Art'. 

There are many ways to fix them, but this post is about how to catch them.

An ounce of prevention is worth a pound of cure. 

I prefer to kill the whole environment (staging, pre-production, UAT) every weekend or have non-functional relases where I just recreate the production infrastructure at regular intervals. This does not eliminate the accumulation of manual fixes, but this does indicate if any manual fixes are present that are crtical for the services to run. By doing this more frequently I reduce the risk of large, accumulated manual fixes. To me this is a litmus test or Gold Standard for Automated Infrastructure.

Published at DZone with permission of Ranjib Dey, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)