Automating Dockerized Jenkins Upgrades

Introduction

If you’ve used Jenkins for a while in production, then you will be aware that Jenkins frequently publishes updates to its server for security and functionality changes.

On a dedicated, non-dockerized host, this is generally managed for you through package management. With Docker it can get slightly more complicated to reason about upgrades, as you’ve likely separated out the context of the server from its data.

Problem

You want to reliably upgrade your Jenkins server.

Solution

This technique is delivered as a Docker image composed of a number of parts. First we will outline the Dockerfile that builds the image. This Dockerfile draws from the library docker image (which contains a docker client) and adds a script that manages the upgrade.

The image is run in a docker command that mounts the docker items on the host, giving it the ability to manage any required Jenkins upgrade.

Dockerfile

We start with the Dockerfile:

FROM docker                                                    <1>
ADD jenkins_updater.sh /jenkins_updater.sh                     <2>
RUN chmod +x /jenkins_updater.sh                               <3>
ENTRYPOINT /jenkins_updater.sh                                 <4>

<1> – Use the ‘docker’ standard library image

<2> – Add in the ‘jenkins_updater.sh’ script (see below)

<3> – Ensure that the ‘jenkins_updater.sh’ script is runnable

<4> – Set the default entrypoint for the image to be the ‘jenkins_updater.sh’ script

The above Dockerfile encapsulates the requirements to back up Jenkins in a runnable Docker image. It uses the ‘docker’ standard library image. We use this to get a Docker client to run within a container. This container will run the script in the next listing to manage any required upgrade of Jenkins on the host.

NOTE: If your docker daemon version differs from the version in the ‘docker’ Docker image, then you may run into problems. Try to use the same version.

jenkins_updater.sh

This is the shell script that manages the upgrade within the container:

#!/bin/sh                                                        <1>
set -e                                                           <2>
set -x                                                           <3>
if ! docker pull jenkins | grep up.to.date                       <4>
then
 docker stop jenkins                                             <5>
 docker rename jenkins jenkins.bak.$(date +%Y%m%d%H%M)           <6>
 cp -r /var/docker/mounts/jenkins_home \                         <7>
       /var/docker/mounts/jenkins_home.bak.$(date +%Y%m%d%H%M)   <7>
 docker run -d \                                                 <8>
     --restart always \                                          <9>
     -v /var/docker/mounts/jenkins_home:/var/jenkins_home \      <10>
     --name jenkins \                                            <11>
        -p 8080:8080 \                                           <12>
     jenkins                                                     <13>
fi

<1> – This script uses the ‘sh’ shell (not the ‘/bin/bash’ shell) because only ‘sh’ is available on the ‘docker’ Docker image

<2> – This ‘set’ command ensures the script will fail if any of the commands within it fail

<3> – This ‘set’ command logs all the commands run in the script to standard output

<4> – The ‘if’ block only fires if ‘docker pull jenkins’ does not output ‘up to date’

<5> – When upgrading, begin by stopping the jenkins container

<6> – Once stopped, rename the jenkins container to ‘jenkins.bak.’ followed by the time to the minute

<7> – Copy the Jenkins container image state folder to a backup

<8> – Run the docker command to start up Jenkins, and run it as a daemon

<9> – Set the jenkins container to always restart

<10> – Mount the jenkins state volume to a host folder

<11> – Give the container the name ‘jenkins’ to prevent multiple of these containers running simultaneously by accident

<12> – Publish the 8080 port in the container to the 8080 port on the host

<13> – Finally, the jenkins image name to run is given to the docker command

The above script tries to pull jenkins from the docker hub with the ‘docker pull’ command. If the output contains the phrase ‘up to date’, then the ‘docker pull | grep …’ command returns true. However, we only want to upgrade when we did _not_ see ‘up to date’ in the output. This is why the ‘if’ statement is negated with a ‘!’ sign after the ‘if’.

The result is that the code in the ‘if’ block is only fired if we downloaded a new version of the ‘latest’ Jenkins image. Within this block, the running Jenkins container is stopped and renamed. We rename it rather than delete it in case the upgrade did not work and we need to reinstate the previous version.

Further to this rollback strategy, the mount folder on the host containing Jenkins’ state is backed up also.

Finally, the latest-downloaded Jenkins image is started up using the docker run command.

NOTE: You may want to change the host mount folder and/or the name of the running Jenkins container based on personal preference.

The attentive reader might be wondering how this Jenkins image is connected to the host’s Docker daemon. To achieve this, the image is run using a commonly-used method in the book:

The jenkins-updater image invocation

docker run                                                 <1>
    --rm \                                                 <2>
    -d \                                                   <3>
 -v /var/lib/docker:/var/lib/docker \                      <4>
 -v /var/run/docker.sock:/var/run/docker.sock \            <5>
 -v /var/docker/mounts:/var/docker/mounts                  <6>
 dockerinpractice/jenkins-updater                          <7>

<1> – The docker run command

<2> – You want the container to be removed when it has completed its job

<3> – Run the container in the background

<4> – Mount the host’s docker daemon folder to the container

<5> – Mount the host’s docker socket to the container so the docker command will work within the container

<6> – Mount the host’s docker mount folder where the Jenkins data is stored, so that the jenkins_updater.sh script can copy the files

<7> – The dockerinpractice/jenkins-updater image is the image to be run

Automating the upgrade

This one-liner makes it easy to run within a crontab. We run this on our home

servers. The crontab line looks like this:

0 * * * * docker run --rm -d -v /var/lib/docker:/var/lib/docker -v /var/run/docker.sock:/var/run/docker.sock -v /var/docker/mounts:/var/docker/mounts dockerinpractice/jenkins-updater 

NOTE: The above is all on one line because crontab does not ignore newlines if there is a backslash in front in the way that shellscripts do.

The end result is that a single crontab entry can safely manage the upgrade of your Jenkins instance without you having to worry about it. The task of automating the cleanup of old backed up containers and volume mounts is left as an exercise for the reader.

Discussion

This technique exemplifies a few things which we come across throughout the book which can be applied in similar contexts to situations other than Jenkins.

First, it uses the core docker image to communicate with the Docker daemon on the host. Other portable scripts might be written to manage Docker daemons in other ways. For example, you might want to write scripts to remove old volumes, or report on the activity on your daemon.

More specifically, the ‘if’ block pattern could be used to update and restart other images when a new one is available. It is not uncommon for images to be updated for security reasons, or to make minor upgrades.

If you are concerned with difficulties in upgrading versions, it’s also worth pointing out that you need not take the ‘latest’ image tag (which this technique does). Many images have different tags that track different version numbers.

For example, your image ‘exampleimage’ might have a exampleimage:latest tag, as well as an exampleimage:v1.1 tag, and a exampleimage:v1. Any of these might be updated at any time, but the :v1.1 tag is less likely to move to a new version than the :latest one. The :latest one could move to the same version as a new  :v1.2 one (which might require steps to upgrade) or even a :v2.1 one, where the new major version ‘2’ indicates a change more likely to be disruptive to any upgrade process.

This technique also outlines a rollback strategy for docker upgrades. The separation of container and data (using volume mounts) can create tension about the stability of any upgrade. By retaining the old container and a copy of the old data at the point where the service was working, it is easier to recover  from failure.

Database Upgrades and Docker

Database upgrades are a particular context in which these stability concerns are germane.

If you want to upgrade your database to a new version, you have to consider whether the upgrade requires a change to the data structures and storage of the database’s data. It’s not enough simply to run the new version’s image as a container and expect it to work.

It gets a bit more complicated if the database is ‘smart’ enough to know which version of the data it is ‘seeing’, and can perform the upgrade itself accordingly. In these cases, you might be more comfortable upgrading.

Many factors feed into your upgrade strategy. Your app might tolerate an ‘optimistic’ approach (as we see here in the Jenkins example) which assumes everything will be OK, and prepares for failure when (not if) it occurs. On the other hand, you might demand 100% uptime, and not tolerate failure of any kind at all. In such cases, a fully-tested upgrade plan and a deeper knowledge of the platform than running ‘docker pull’ is generally desired (with or without the involvement of Docker).

Although Docker does not remove the upgrade problem, the immutability of the versioned images can make it simpler to reason about them. Docker can also help you prepare for failure in two ways: backing up state in host volumes, and making testing predictable state more easy. The hit you take in managing and  understanding what Docker is doing can give you more control and certainty  about the upgrade process.

 

 

This technique is taken from the upcoming second edition of my book Docker in Practice:

Get 39% off with the code: 39miell

Buying the print or ebook editions will get you ebook updates as they are published.

Posted in Uncategorized | Leave a comment

Things I Wish I Knew Before Using Jenkins Pipelines

I started playing with Pipelines using the web interface, then hit a block as I didn’t really know the ropes.

Here’s some things I wish I’d known first:

Wrap Steps in a Node

All code that does steps in a pipeline should be wrapped in a node block:

node() {
  sh('env')
}

If the node is not specified, eg:

def myvariable='blah'

then by default it will run on the master.

You can specify the node by supplying an argument:

node('mynode') {
  [...]
}

If the pipeline code is not in a node block, then it’s run on the master in some kind of lightweight node/thread.

Checkout scm applies only when called from source

This was a gotcha for me. ‘checkout scm’ is a great single line that checks out the source the Jenkinsfile is taken from

But when updating the script in the browser this won’t work!

When storing your source in source control, you can then switch to using ‘checkout scm’. Otherwise use the ‘git’ function.

Wrap in try / catch

Your code can be wrapped in a try/catch block.

I use this along with timeout() to see whether a node is available before using it:

def nodetest() { 
  sh('echo alive on $(hostname)') 
} 
// By default we use the 'welles' node, which could be offline.
usenode='welles' 
try { 
  // Give it 5 seconds to run the nodetest function
  timeout(time: 5, unit: 'SECONDS') { 
    node(usenode) { 
      nodetest() 
    } 
  } 
} catch(err) { 
  // Uh-oh. welles not available, so use 'cage'.
  usenode='cage' 
} 
// We know the node we want to use now.
node(usenode) {
  [...]
}

Oh yeah, and functions are available to you too.

Handy, eg for seeing whether a node is available.

Wrap things in stages

Want those neat stages to show up in the Jenkins job homepage?

Then wrap your stuff in stages, eg:

[...]
stage('setupenv') {
  node(nodename) {
    sh 'mkdir -p ' + builddir
    dir(builddir) {
      checkout([$class: 'GitSCM', branches: [[name: '*/master']], doGenerateSubmoduleConfigurations: false, extensions: [[$class: 'SubmoduleOption', disableSubmodules: false, parentCredentials: false, recursiveSubmodules: true, reference: '', trackingSubmodules: false]], submoduleCfg: [], userRemoteConfigs: [[url: 'https://github.com/ianmiell/shutit']]])
    }
  }
}
stage('shutit_tests') {
  node(nodename) {
    dir(builddir + '/shutit-test') {
      sh('PATH=$(pwd)/..:${PATH} ./run.sh -s tk.shutit.shutit_test shutit_branch master -l info 2>&1')
    }
  }
}
[...]

It’s a good idea to keep stages discrete, as they can be isolated from one another – for example, you could switch one stage to another node if you want (but then you might want to look into stash()ing files…).

WTF the Deal is with Pipeline Syntax vs Groovy?

Pipeline syntax may be preferable to groovy, but is newer. See here:

https://jenkins.io/blog/2016/12/19/declarative-pipeline-beta/

The docs confusingly assume a familiarity with both.

It’s Still a Bit Buggy

I tried to change the branch the Jenkinsfile was pulled from, but the old branch persisted, and it wouldn’t pick up changes from the new one. I ended up having to create a new job and delete the old one

Input

Want to force user input before continuing?

Simple:

input('OK to continue?')

But – seemed to work better for me when I had defined stages first!

Examples

There are Jenkinsfile examples here but they look a bit unloved.

Some gists were more useful to me.

This intro was pretty good too.

And the canonical reference is here.

 

Posted in Uncategorized | 3 Comments

Five Books I Advise Every DevOps Engineer to Read

Here is a list of books that have helped and changed not only my career, but also my life.

If there’s a theme to them, it’s that they are less about IT than how people interact with technology, and how an understanding of that can make you and your organisation more efficient.

The Goal 

Published in 1984, before Windows 1.0 was released, The Goal is still read by many and recommended by the likes of Jeff Bezos.

thegoal

Unusually for a business book, it’s a novel. A manager of a factory that is threatened with closure has three months to turn around its dysfunctional organisation. After going on a bender and having a row with his wife, he sobers up and bumps into an old friend that guides him on how to debug his business. He makes it up with his wife and figures out how to turn things around at work.

Many readers will already be aware of The Phoenix Project, a book popular among the DevOps community. TPP is based on The Goal, and has a similar plot. Personally, I prefer The Goal over The Phoenix Project for a couple of reasons.

First, it’s really well-written. It’s a good enough novel that my wife (a mental health nurse with zero interest in IT) read it and enjoyed it in a couple of sittings.

Second, the fact that it’s not about 21st-century software encourages you to think about your work in terms of systems rather than our specific local terms. Continuous improvement, problems of delivery flow, disaffected staff, and angry spouses have always been with us, and the solutions can be surprisingly similar.

For me, this book emphasised and backed up my instinct that in an imperfect world, the importance of focussing on the biggest problems first and the human factor are vital in improving any delivery environment.

The Checklist Manifesto

download

How have different industries dealt with failure?

Atul Gawande is a surgeon and public health researcher who here looks at three fields: medicine, construction, and aviation. All these fields have little tolerance for failure (buildings that fall down, planes that fall out of the sky, and doctors that kill tend to attract headlines and unwanted attention).

What becomes clear is that there is a maturity model for dealing with failure – first a ‘hero’ model is espoused (think of the aviation heroes of early 20th century adventure and wars, or 18th century doctors), and then complexity ensues, reducing the pool of ‘heroes’ to zero. Following that there’s a crisis, and the implementation of simple processes helps manage the chaos. Aviation went from having the ‘fighter ace’ to ‘too much plane to fly’ and moved to a training model using checklists and human-friendly processes. In medicine, simple checklists help reduce error (and the cost of lawsuits). Standard processes along with creativity – applied reliably and when required – helps buildings stay up.

This book stiffened my resolve to improve documentation and process in a growing business, which I wrote about at more length here.

The Practice of Management

tpom

Another oldie, this time from 1954, it discusses businesses of the day and their challenges in a timeless way (aside from the complete absence of the feminine pronoun).

A glance through it will reveal the same concerns we have always had – the section on ‘Automation’ is itself a fascinating historical document, and applies to today just as much as it did 60 years ago. There is a section on the importance of ‘innovation’ that reads like a contemporary call to arms. If you thought Google was the first organisation to try and do without middle management, there’s a chapter headed ‘Ford’s Attempt To Do Without Managers’.

If last century is so last century, then he looks to the Roman army and the Jesuits (‘the oldest elite corps) for how management training has been shown to work.

This book is a great mind-expander for those who need to start thinking about human organisations and their challenges in delivering what we now call ‘value’.

The Art of Business Value

taobv

This is more a work of practical philosophy than about business. Mark Schwartz is a CIO working in the field, who here deconstructs some of the lazy assumptions and rhetoric surrounding what has come to be known as capital a ‘Agile’.

Practical and down-to-earth, he first breaks down what ‘business value’ might mean, and shows that there is little clarity about what this often taken-for-granted concept stands for. Other terms get similar treatment: who is the ‘customer’ in agile; is profit and business success the same thing; how granular can the organisation be?

This book gives you the courage to ask simple questions and not take for granted that the messages you’re getting about how to work are based on solid foundations.

I talked about trusting your local knowledge a while back here and wrote the talk up here.

It’s not often you get a book that’s lucid, useful, and quotes French philosophy while making it relevant.

Getting Things Done

gtd

I picked up this book almost by accident in a bookstore while on holiday. I’d read so many stories and references to it on HackerNews I was ready to mock its easy slogans and trite advice.

Damn, was I wrong! This book didn’t so much change my life as turn it upside down. I was a stressed out, time-poor SRE who couldn’t possibly fulfil all his obligations, and using the advice and guidance here I have since transformed my career by writing a book, developing this blog, and moving jobs.

Again, this is less about technology than the human element to improving efficiency.

Its advice was so pragmatic and sensible that I rue the fact that I didn’t read it decades before.

Did I Miss One?

I’m always on the lookout for good and classic works to read and make me think. If I missed any, let me know.

Ad

My book Docker in Practice:

Get 39% off with the code: 39miell

Posted in Uncategorized | Leave a comment

Things I Learned Managing Site Reliability for Some of the World’s Busiest Gambling Sites

tl;dr

For several years I managed the 3rd line site reliability operation for many of the world’s busiest gambling sites, working for a little-known company that built and ran the core backend online software for several businesses that each at peak could take tens of millions of pounds in revenue per hour. I left a couple of years ago, so it’s a good time to reflect on what I learned in the process.

In many ways, what we did was similar to what’s now called an SRE function (I’m going to call us SREs, but the acronym didn’t exist at the time). We were on call, had to respond to incidents, made recommendations for re-engineering, provided robust feedback to developers and customer teams, managed escalations and emergency situations, ran monitoring systems, and so on.

The team I joined was around 5 engineers (all former developers and technical leaders), which grew to around 50 of more mixed experience across multiple locations by the time I left.

I’m going to focus here on process and documentation, since I don’t think they’re talked about usefully enough where I do read about them.

If you want to read something far longer Google’s SRE book is a great resource.

Process

Process is essential to running and scaling an SRE operation. It’s the core of everything we achieved. When I joined the team, habits were bad – there was a ticketing system, but one-journal resolutions were not uncommon (‘Site down. Fixed, closing.’).

An SRE operation is basically a factory processing information and should act accordingly. You wouldn’t have a factory running without processes to take care of the movement of goods, and by the same token you shouldn’t have a knowledge-intensive SRE operation running without processes to take care of the movement of knowledge.

One frequent objection to process I heard is that it ‘stifles creativity’. In fact, effective process (bad process implemented poorly can mess anything up!) clears your mind to allow creative thought.

A great book on this subject is ‘The Checklist Manifesto’, which inspired many of the changes we made, and was widely read within the team. It cites the examples of the aviation industry’s approach to process, which enables remarkable creativity under stressful conditions by mental automation of routine operations. There’s even a film about one incident discussed and the pilot himself cited checklists and routine as an enabler of his fast-thinking creativity and control in that stressful situation. In fact, we used a similar process ourselves: in emergency situations, an experienced engineer would dive into finding a solution, while a more junior one would follow the checklist.

Another critique of process is that process can inhibit effective working and collaboration. It absolutely can if process is treated as an entity justified by its own existence rather than another living asset. The only thing that can guard against this is culture. More on that later.

Process – Tooling

The first thing to get right is the ticketing system. Like monitoring solutions, people obsess over which ticketing system is best. And they are wrong to. The ticketing system you use you will generally end up preferring simply due to familiarity. The ticketing system is only bad if it drives or encourages bad processes. What a bad process is depends on the constraints of your business.

It’s far more important to have a ticketing system that functions reliably and supports your processes than the other way round.

Here’s an example. We moved from RT to JIRA during my tenure. JIRA offered many advantages over RT, and I would generally recommend JIRA as a collaborative tool. The biggest problem we had switching, however, was the loss of some functionality we’d built into RT, which was critical to us. RT allowed us to get real-time updates on tickets, which meant that collaboration on incidents was somewhere between chat and ticketing. This record was invaluable in post-incident review. RT also allowed us to hide entries from customers, which again was really hard to lose. We got over it, but these things were surprisingly important because they’d become embedded in our process and culture.

When choosing or changing your ticketing system, think about what’s really important to operations, not specific features that seem nice when on a list. What’s important to you can vary from how nice it looks (seriously – your customers might take you more seriously, and your brand might be about good design), to whether the reporting tools are powerful.

Documentation

After process, documentation is the most important thing, and the two are intimately related.

There’s a book in documentation, because, again, people focus on the wrong things. The critical thing to understand is that documentation is an asset like any other. Like any business assets, documentation:

  • If properly looked after, will return investment many times over
  • Requires investment to maintain (like the fabric of a factory)
  • If out of date, costs money simply because it’s there (like out-of-date inventory)
  • If of poor quality, or not usable is a liability, not an asset

But this is not controversial – few people disagree with the idea that good documentation is useful. The point is: what do you do about it?

Documentation – Where We Were

We were in a situation where documentation provided to us was not useful (eg from devs: ‘a network partition is not covered here as it is highly unlikely’. Well, guess what happened! And that was documentation they kindly bothered to write…), or we simply relied on previously-journalled investigations (by this time we were writing things down) to figure out what to do next time something similar happened.

This was frustrating all of us, and we spent a long time complaining about the documentation fairy not visiting us before we took responsibility for it ourselves.

Documentation – What I Did

pile_of_papers

Here’s what I did.

  • I took two years’ worth of priority incidents (ie those that triggered – or would have triggered – an out of hours call), and listed them. There were over 1700 of them.
  • Then I categorised them by type of issue.
  • Then I went through each type of issue and summarised the steps needed to either resolve, or get to a point where escalation was required

This took seven months of my full-time attention. I was a senior employee and I was costing my company lots of money to sit there and write. And because I had a clueful boss, I never got questioned about whether this was a good use of time. I was trusted (culture, again!). I would say it took four months before any dividends at all were seen from this effort. I remember this four-month period as a nerve-wracking time, as my attention was taken away from operations to what could have been a complete waste of my time and my employer’s money and an embarrassing failure.

Why not give it to an underling to do? For a few reasons. This was so important, and we had not done it before so I needed to know it was being done properly. I knew exactly what was needed, so I knew I could write it in such a way that it would be useful to me at the very least. I was also a relatively experienced writer (arts grad, former journalist), so I liked to think that that would help me write well.

We called these ‘Incident Models’ as per ITIL, but they can also be called ‘run books’, ‘crib sheets’, whatever. It doesn’t matter. What mattered was:

  • They were easy to find/search for
  • It was easy to identify whether you got a match
  • They were not duplicated
  • They could be trusted

We put this documentation in plain text within the ticketing system, under a separate JIRA project.

The documentation team got wind of what we were up to and tried to pressure us to use an internal wiki for this. We flat-out refused, and that was critical: the documentation system’s colocation with the ticketing system meant that searching and updating the documentation had no impedance mismatch. Because it was plain text it was fast, simple to update, and uncluttered. We resisted process that jeopardised the utility of what we were doing.

Documentation and the Criticality of De-cluttering

When we started, we designed a schema for these Incident Models which was a thing of beauty, covering every scenario and situation that could crop up.

bureaucracy.jpeg

In the end it was almost a complete waste of time. What we ended up using was a really dumb structure of:

  • Statement of problem
  • Steps 1-n of what to do
  • Further/deeper discussion, related articles

That was it. Attempts to structure it more thoroughly all failed as it was either confusing to newcomers, created too much administrative overhead, or didn’t cover enough. Some articles developed their own schema over time that was appropriate to the task, and new categories (eg the ‘jump-off’ article that told you which article to go to next) evolved over time. We couldn’t design for these things in advance because we didn’t know what would work or what would not.

Call it ‘agile documentation’ if you want – agile’s what sells these days (it was ITIL back then). Again, what was critical was that simplicity and utility trumped everything else.

There Is No Documentation Fairy

Having spent all this time and effort a couple of other things became clear regarding documentation.

 

documentation_fairy

First, we gave up accepting documentation from other teams. If they commented code, great, if there was something useful on the wiki for us to find, also great. But when it came to handing over projects we stopped ‘asking for documentation’. Instead we’d arrange sessions with experienced SREs where the design of the project would be discussed.

Invariably (assuming they had no ops experience), the developer would focus on the things they’d built and how it worked – and these things were often the most thoroughly tested and least likely to fail.

By contrast, the SRE would focus on the weak points, the things that would go wrong. ‘What happens if the network gets partitioned? What if the database runs out of disk? Can we work out from the logs why the user didn’t get paid?’

We’d then go away and write our own documentation and get the engineer to sign off on it – the reverse of the traditional flow! They’d often make useful comments and give us added insights in the process.

The second thing we noticed was that our engineers were still reluctant to update the docs that only they were using. There was still a sense that documentation should be given to them. The leadership had to constantly reinforce that this was their documentation, not tablets of stone handed down from on high, and if they didn’t constantly maintain this, they would become useless.

writer-at-computer.jpg

This was a cultural problem and took a long time to undo. Undoing it also required the documentation changes to be reinforced by process.

In the end, I’d say about 10% of the ongoing working time was spent maintaining and writing documentation. After the initial 7-month burst, most of that 10% was spent on maintenance rather than producing new material.

Documentation – Benefits

After getting all this documentation done, we experienced benefits far in excess of the 10% ongoing cost. To call out a few:

  • Easier onboarding

Before this process started we were reluctant to take on less experienced staff. After, onboarding became a breeze. Among other things the training involved following incidents as they happened and shadowing more experienced staff. New staff were tasked with helping maintain docs, which helped them understand what gaps they had in their knowledge.

  • Better training

The docs gave us a resource that allowed us to identify training requirements. This ended up being a curriculum of tools and techniques that any engineer could aim to get a working knowledge of.

  • Less stress through simpler escalation

These was a big one. Before we had the step-by-step incident models, when to escalate was a stressful decision. Some engineers had a reputation for escalating early, and all were insecure about whether they’d ‘missed something obvious’ before calling a responsible tech lead out of hours. SREs would also get called out for not escalating early enough as well!

The incident models removed that problem. Pretty soon, the first question an escalated-to techie asked was ‘have you followed the incident model’? If so, and there something obvious was missed, then gaps in it became clear and quickly-fixed. Soon, non-SREs were busy updating and maintaining the docs themselves for when they were escalated-to. It became a virtuous circle.

  • Better discipline

The obvious value of documentation to the team helped improve discipline in other respects. Interestingly, SREs previously had the reputation for being the ‘loudest’ team – there was often a lot of ‘lively’ debate, and the team was very social – which made sense, as we relied on each other as a team to cover a large technical area, dealt with often non-technical customer execs, and sharing knowledge and culture was critical.

As time progressed, the team became quieter and quieter – partly due to the advent of chatrooms, increased remote working, and international teams, but also due to the fact that so much of the work became routine: follow the incident model, when you’re done, or don’t understand something, escalate to someone more senior.

  • Automation

Automating the investigations this way meant that the way was clear to further automate them with software.

Having metrics on which tickets were linked to which incident models meant that we knew where best to focus our effort. We wrote scripts to comb through log files in the background, make encoding issues quicker and simpler to figure out, automate responses to customers (‘Issue was caused by a change made by app admin user XXX’), and a lot more.

These automations inspired an automation tool we built for ourselves based on pexpect: http://ianmiell.github.io/shutit/ But that’s another story. Basically, once we got going it was a virtuous circle of continuous improvement.

Back to Process

Given you have all these assets, how do you prevent them from degrading in value over time? This is where process is critical.

Two processes were critical in ensuring everything continued smoothly: triage and post-incident review.

Process – Triage

'Here's a copy of our new triage plan...the order is now walking wounded first, the dying and dead second, lawyers last...'

5%-10% of time was spent on the triage process. Again, it took a long time to get the process right, but it resulted in massive savings:

  • Reduce the steps to the minimum useful steps

It’s so tempting to put as much as possible into your triage process, but it’s vital to keep the value in the process over completeness. Any step that is not often useful tends to get skipped over and ignored by the triager.

  • Focus on saving cost in the process

Looking for duplicates, finding the relevant incident model, reverting quickly to the customer, and escalating early all reduced the cost per ticket significantly. It also saved other engineers the context switch of being asked a question while they’re thinking about something else. It’s hard to evaluate the benefits of these items, but we were able to deal with increased volumes of incidents with fewer people and less difficulty. Senior management and customers noticed.

Recording the details of these efforts also saved time, as (for example) engineers given a triaged ticket could see that the triager searched for previous incidents with a string that maybe they could improve on. It also meant that more experienced staff could review the triage quality.

  • Review triage

Experienced staff need to review the triage process regularly to ensure it’s actually being applied effectively.

When I moved to another operations team (in a domain I knew far less about), I cut the incident queue in half in about 3 days, just by applying these techniques properly. The triage process was there, but it wasn’t being followed with any thought or oversight, and was given to a junior member of staff who was not the most capable. Big mistake. Triage must be done – or overseen – by someone with a lot of experience, as while it looks routine and mechanical it involves a lot of significant decisions that rest on experience in the field.

And yes, I was the new boss, and I chose to spend my first week doing the ‘lowly’ task of triage. That’s how important I thought it was.

  • Rota the task

No-one wants to do triage for long, so we rota’d it per week. This allowed some continuity and consistency, but stopped engineers from going crazy by spending too long doing the same task over and over.

Process – Post-Incident Review

autopsy

The mirror image of triage was the ‘post incident review’. Every ticket was reviewed by an experienced team member. Again, this was a process that took up about 5% of effort, but was also significant.

A standard form was filled out and any recommendations were added to a list of backlog ‘improvement’ tasks which could be prioritised. This gave us a number for technical/process debt that we wanted to look at.

Culture

blame.jpg

I’ve mentioned culture a few times, and it’s what you always return to if you’re trying to enact any kind of change at all, since culture is at root a set of conceptual frameworks that underlie all our actions.

I’ve also mentioned that people often focus on the ‘wrong thing’. Time and again I hear people focus on tools and technology rather than culture. Yes, tools and technology are important, but if you’re not using them effectively then they are worse than useless. You can have the best golf clubs in the world, but if you don’t know how to swing and you’re playing baseball then they won’t help much.

Culture requires investment far more than technology does (I invested over half a year just writing documentation, remember). If the culture is right, people will look for the right tools and technology when they need to.

When given a choice about what to spend time and money on, always go for culture first. It cost me a lot of budget, but forcibly removing an ‘unhelpful’ team member was the best thing I did when I took over another team. The rest of the team flowered once he left, no longer stifled by his aggressive behaviour, and many things got done that didn’t before.

We also built a highly effective team with a budget so small that recruiters would phone me up to yell at me what I was looking for was ‘impossible’, but by focussing on the right behaviours, investing time in the people we found, and having good processes in place, we got an extremely effective and loyal team that all went on to bigger and better things within and outside the company (but mostly within!).

Politics

A quick word on politics. You’ve got to pick your battles. You’re unlikely to get the resources you need, so drop the stuff that wont get done to the floor.

Yes, you need a monitoring solution, better documentation, better trained staff, more testing… you are not going to get all these things unless you have a money machine, so pick the most important and try and solve that first. If you try and improve all these things at once, you will likely fail.

After process, and documentation, I tried to crack the ‘reproducible environment’ puzzle. That led me to Docker, and a complete change of career. I talk about these things a little here and here.

Any Questions?

Reach me on twitter: @ianmiell

Or LinkedIn

My book Docker in Practice:

Get 39% off with the code: 39miell

 

Posted in Uncategorized | 13 Comments

Clustered VM Testing How-To

Recently I’ve been testing clusters of VMs running on my local host.

I thought that there must be a standard way to test multi-node VM setups, but asking around at work, and on github yielded no answers.

So I came up with my own solution, which I outline here.

ShutItFile

A ShutItFile is a superset of a Dockerfile that allows straighforward automation of automation tasks.

Here’s an example of a ShutItFile that manipulates two VMs, and tests network connectivity between them.

It creates two machines (machine1 and machine2) and logs into them in turn using the ‘VAGRANT_LOGIN’ directive. On each machine it installs python, sets up a simple python http server which serves the text: ‘Hi from machine1’ (from machine1) or ‘Hi from machine2’ from machine2.

It then tests that the output matches expectation from both machines using the ‘ASSERT_OUTPUT’ directive.

To demonstrate the ‘testing’ nature of the ShutItFile, a ‘PAUSE_POINT’ directive is included, which drops you into the run with a terminal, and a deliberately wrong ‘ASSERT_OUTPUT’ directive is included to show what happens when a test fails (and the terminal is interactive). This makes debugging a _lot_ easier.

DELIVERY bash

# Set up trivial webserver on machine1
VAGRANT_LOGIN machine1
INSTALL python
# Add file 
RUN echo 'hi from machine1' > /root/index.html
RUN nohup python -m SimpleHTTPServer 80 &
VAGRANT_LOGOUT

# Set up trivial webserver on machine2
VAGRANT_LOGIN machine2
INSTALL python
RUN echo 'hi from machine2' > /root/index.html
RUN nohup python -m SimpleHTTPServer 80 &
VAGRANT_LOGOUT

# Test machine2 from machine1
VAGRANT_LOGIN machine1
INSTALL python
RUN curl machine2
ASSERT_OUTPUT hi from machine2
VAGRANT_LOGOUT

# Test machine1 from machine2
VAGRANT_LOGIN machine2
INSTALL python
RUN curl machine1
ASSERT_OUTPUT hi from machine1
VAGRANT_LOGOUT

# Example debug
VAGRANT_LOGIN machine1
INSTALL python
PAUSE_POINT 'Have a look around, debug away'
# Trigger a 'failure'
RUN curl machine2
ASSERT_OUTPUT will never happen
VAGRANT_LOGOUT

To run this ShutItFile (which we call here ‘ShutItFile.sf’), you run like this:

# Install shutit
pip install shutit
shutit skeleton --shutitfile ShutItFile.sf \
    --name /tmp/shutitfile_build \
    --domain twovm.twovm \
    --delivery bash\
    --pattern vagrant \
    --vagrant_num_machines 2\
    --vagrant_machine_prefix machine

This code for this example is available here.

Video

There’s a video of the above run here:

Create Your Own

If you want to create your own multinode test:

pip install shutit  #use sudo if needed, --upgrade if upgrading
shutit skeleton

Follow the instructions, choosing ‘shutitfile’ as the pattern, and ‘vagrant’ as the delivery method, eg:

$  shutit skeleton

# Input a name for this module.
# Default: /space/git/shutitfile/examples/vagrant/simple_two_machine/shutit_sabers


# Input a ShutIt pattern.
Default: bash

bash:              a shell script
docker:            a docker image build
vagrant:           a vagrant setup
docker_tutorial:   a docker-based tutorial
shutitfile:        a shutitfile-based project (can be docker, bash, vagrant)

shutitfile


# Input a delivery method from: bash, docker, vagrant.
# Default: ' + default_delivery + '

docker:      build within a docker image
bash:        run commands directly within bash
vagrant:     build an n-node vagrant cluster

vagrant
# ShutIt Started... 
# Loading configs...
# Run:
cd /space/git/shutitfile/examples/vagrant/simple_two_machine/shutit_sabers && ./run.sh
# to run.
# Or
# cd /space/git/shutitfile/examples/vagrant/simple_two_machine/shutit_sabers && ./run.sh -c
# to run while choosing modules to build.

and follow the commands given (at the place in bold above) to run.

Initially you are given empty ShutItFiles. You could start by adding the commands from the example here.

A cheatsheet for the various ShutItFile commands is available here.

Watch me do this here.

Real-world Usage

As an example of real-world usage, this technique is being used to regression test Chef recipes used to provision OpenShift.

The Chef scripts are here, and the regression tests are here.

 

 

 

Posted in Uncategorized | 3 Comments

Easy Shell Automation

Regular readers will be familiar with ShutIt, a framework I work on that allows me to automate all sorts of workflows and tools that I publish on GitHub.

This article demonstrates a new feature that uses this platform to make doing expect-type tasks trivial.

Embedded ShutIt

In response to a request, I recently added a feature which may be useful to others.

All this is available in python scripts if you:

pip install shutit

You can now automate interactions in python scripts. This script just gets the hostname and logs it:

import shutit_standalone
import logging
shutit_obj = shutit_standalone.create_bash_session()
hostname_str = shutit_obj.send_and_get_output('hostname')
shutit_obj.log('Hostname is: ' + hostname_str, 
                loglevel=logging.CRITICAL)

Since ShutIt is a big wrapper/platform built onpexpect, it takes care of setting up the prompt, figuring out when the command is done and a whole load of other stuff you never want to worry about about terminals.

Log Into Server Example

This example logs into a server, taking the password from user input, and ensures git is installed on it before logging out:

import shutit_standalone
import logging
shutit_obj = shutit_standalone.create_bash_session()
username = shutit_obj.get_input('Input username: ')
server = shutit_obj.get_input('Input server: ', ispass=True)
password = shutit_obj.get_input('Input password', ispass=True)
shutit_obj.login('ssh ' + username + '@' + server,
                 password=password)
shutit_obj.install('git')
shutit_obj.logout()

ShutIt takes care of determining what package manager is on the host. If you’re not logged in as root it prompts you for a sudo password before attempting the install.

Pause Mid-Flight to Look Around

If you want to insert yourself in the middle of the run, you can add a ‘pause_point’, which will hand you back the terminal until you hit CTRL+[, after which it continues:

import shutit_standalone
import logging
username = shutit_obj.get_input('Input username: ')
server = shutit_obj.get_input('Input server: ', ispass=True)
password = shutit_obj.get_input('Input password', ispass=True)
shutit_obj.login('ssh ' + username + '@' + server,
                 password=password)
shutit.obj.pause_point('Take a look around!')
shutit_obj.install('git')
shutit_obj.logout()

Send Commands Until Specific Output Seen

If you need to wait for something to happen, you can ‘send_until’ a regexp is seen in the output. This trivial example runs a command to wait 20 seconds and then create a file, and the ‘send_until’ command does not complete until the file is created.

import shutit_standalone
import logging
username = shutit_obj.get_input('Input username: ')
server = shutit_obj.get_input('Input server: ', ispass=True)
password = shutit_obj.get_input('Input password', ispass=True)
shutit_obj.login('ssh ' + username + '@' + server,
                 password=password)
shutit_obj.send('rm -f newfile && sleep 20 && touch newfile &')
shutit.obj.send_until('ls newfile | wc -l','1')
shutit_obj.logout()

Challenge!

This can do a lot more, but I just want to give a flavour here.

I challenge you to give me a real-world automation task I can’t automate!

Ad

My book Docker in Practice:

Get 39% off with the code: 39miell

Posted in Uncategorized | Leave a comment

1-Minute Multi-Node VM Setup

tl;dr

Quickly spin up multiple VMs with useful DNSs on your local machine and automate complex environments easily.

Here’s a video:

Introduction

Maintaining Docker at scale, I’m more frequently concerned with clusters of VMs than the containers themselves.

The irony of this is not lost on me.

Frequently I need to spin up clusters of machines. Either this is very slow/unreliable (Enterprise OpenStack implementation) or expensive (Amazon).

The obvious answer to this is to use Vagrant, but managing this can be challenging.

So I present here a very easy way to set up a useful Vagrant cluster. With this framework, you can then automate your ‘real’ environment and play to your heart’s content.

$ pip install shutit
$ shutit skeleton
# Input a name for this module.
# Default: /Users/imiell/shutit_resins
[hit return to take default]
# Input a ShutIt pattern.
Default: bash
bash: a shell script
docker: a docker image build
vagrant: a vagrant setup
docker_tutorial: a docker-based tutorial
shutitfile: a shutitfile-based project
[type in vagrant]
vagrant
How many machines do you want (default: 3)? 3
[hit return to take default]
What do you want to call the machines (eg superserver) (default: machine)?
[hit return to take default]
Do you want to have open ssh access between machines? (default: yes) yes
Initialized empty Git repository in /Users/imiell/shutit_resins/.git/
Cloning into ‘shutit-library’...
remote: Counting objects: 1322, done.
remote: Compressing objects: 100% (33/33), done.
remote: Total 1322 (delta 20), reused 0 (delta 0), pack-reused 1289
Receiving objects: 100% (1322/1322), 1.12 MiB | 807.00 KiB/s, done.
Resolving deltas: 100% (658/658), done.
Checking connectivity… done.
# Run:
cd /Users/imiell/shutit_resins && ./run.sh
to run.
[follow the instructions to run up your cluster.
$ cd /Users/imiell/shutit_resins && ./run.sh

This will automatically run up an n-node cluster and then finish up.

NOTE: Make sure you have enough resources on your machine to run this!

BTW, if you re-run the run.sh it automatically clears up previous VMs spun up by the script to prevent your machine grinding to a halt with old machines.

Going deeper

What you can do from there is automate the setup of these nodes to your needs.

For example:

def build(self, shutit):
[... go to end of this function ...]
# Install apache
    shutit.login(command='vagrant ssh machine1')
    shutit.login(command='sudo su - ')
    shutit.install('apache2')
    shutit.logout()
    shutit.logout()
# Go to machine2 and call machine1's server
    shutit.login(command='vagrant ssh machine2')
    shutit.login(command='sudo su -')
    shutit.install('curl')
    shutit.send('curl machine1.vagrant.test')
    shutit.logout()
    shutit.logout()

Will set up an apache server and curl a request to the first machine from the second.

Examples

This is obviously a simple example. I’ve used this for these more complex setups which are can be instructive and useful:

Chef server and client

Creates a chef server and client.

Docker Swarm

Creates a 3-node docker swarm

OpenShift Cluster

This one sets up a full OpenShift cluster, setting it up using the standard ansible scripts.

Automation of an etcd migration on OpenShift

This branch of the above code sets up OpenShift using the alternative Chef scripts, and migrates an etcd cluster from one set of nodes to another.

Docker Notary

Setting up of a Docker notary sandbox.

Help Wanted

If you have a need for an environment, or can improve the setup of any of the above please let me know: @ianmiell

Learn More

My book Docker in Practice:

Get 39% off with the code: 39miell

Posted in Uncategorized | 1 Comment