Things I Learned Managing Site Reliability for Some of the World’s Busiest Gambling Sites

tl;dr

For several years I managed the 3rd line site reliability operation for many of the world’s busiest gambling sites, working for a little-known company that built and ran the core backend online software for several businesses that each at peak could take tens of millions of pounds in revenue per hour. I left a couple of years ago, so it’s a good time to reflect on what I learned in the process.

In many ways, what we did was similar to what’s now called an SRE function (I’m going to call us SREs, but the acronym didn’t exist at the time). We were on call, had to respond to incidents, made recommendations for re-engineering, provided robust feedback to developers and customer teams, managed escalations and emergency situations, ran monitoring systems, and so on.

The team I joined was around 5 engineers (all former developers and technical leaders), which grew to around 50 of more mixed experience across multiple locations by the time I left.

I’m going to focus here on process and documentation, since I don’t think they’re talked about usefully enough where I do read about them.

If you want to read something far longer Google’s SRE book is a great resource.

Process

Process is essential to running and scaling an SRE operation. It’s the core of everything we achieved. When I joined the team, habits were bad – there was a ticketing system, but one-journal resolutions were not uncommon (‘Site down. Fixed, closing.’).

An SRE operation is basically a factory processing information and should act accordingly. You wouldn’t have a factory running without processes to take care of the movement of goods, and by the same token you shouldn’t have a knowledge-intensive SRE operation running without processes to take care of the movement of knowledge.

One frequent objection to process I heard is that it ‘stifles creativity’. In fact, effective process (bad process implemented poorly can mess anything up!) clears your mind to allow creative thought.

A great book on this subject is ‘The Checklist Manifesto’, which inspired many of the changes we made, and was widely read within the team. It cites the examples of the aviation industry’s approach to process, which enables remarkable creativity under stressful conditions by mental automation of routine operations. There’s even a film about one incident discussed and the pilot himself cited checklists and routine as an enabler of his fast-thinking creativity and control in that stressful situation. In fact, we used a similar process ourselves: in emergency situations, an experienced engineer would dive into finding a solution, while a more junior one would follow the checklist.

Another critique of process is that process can inhibit effective working and collaboration. It absolutely can if process is treated as an entity justified by its own existence rather than another living asset. The only thing that can guard against this is culture. More on that later.

Process – Tooling

The first thing to get right is the ticketing system. Like monitoring solutions, people obsess over which ticketing system is best. And they are wrong to. The ticketing system you use you will generally end up preferring simply due to familiarity. The ticketing system is only bad if it drives or encourages bad processes. What a bad process is depends on the constraints of your business.

It’s far more important to have a ticketing system that functions reliably and supports your processes than the other way round.

Here’s an example. We moved from RT to JIRA during my tenure. JIRA offered many advantages over RT, and I would generally recommend JIRA as a collaborative tool. The biggest problem we had switching, however, was the loss of some functionality we’d built into RT, which was critical to us. RT allowed us to get real-time updates on tickets, which meant that collaboration on incidents was somewhere between chat and ticketing. This record was invaluable in post-incident review. RT also allowed us to hide entries from customers, which again was really hard to lose. We got over it, but these things were surprisingly important because they’d become embedded in our process and culture.

When choosing or changing your ticketing system, think about what’s really important to operations, not specific features that seem nice when on a list. What’s important to you can vary from how nice it looks (seriously – your customers might take you more seriously, and your brand might be about good design), to whether the reporting tools are powerful.

Documentation

After process, documentation is the most important thing, and the two are intimately related.

There’s a book in documentation, because, again, people focus on the wrong things. The critical thing to understand is that documentation is an asset like any other. Like any business assets, documentation:

  • If properly looked after, will return investment many times over
  • Requires investment to maintain (like the fabric of a factory)
  • If out of date, costs money simply because it’s there (like out-of-date inventory)
  • If of poor quality, or not usable is a liability, not an asset

But this is not controversial – few people disagree with the idea that good documentation is useful. The point is: what do you do about it?

Documentation – Where We Were

We were in a situation where documentation provided to us was not useful (eg from devs: ‘a network partition is not covered here as it is highly unlikely’. Well, guess what happened! And that was documentation they kindly bothered to write…), or we simply relied on previously-journalled investigations (by this time we were writing things down) to figure out what to do next time something similar happened.

This was frustrating all of us, and we spent a long time complaining about the documentation fairy not visiting us before we took responsibility for it ourselves.

Documentation – What I Did

pile_of_papers

Here’s what I did.

  • I took two years’ worth of priority incidents (ie those that triggered – or would have triggered – an out of hours call), and listed them. There were over 1700 of them.
  • Then I categorised them by type of issue.
  • Then I went through each type of issue and summarised the steps needed to either resolve, or get to a point where escalation was required

This took seven months of my full-time attention. I was a senior employee and I was costing my company lots of money to sit there and write. And because I had a clueful boss, I never got questioned about whether this was a good use of time. I was trusted (culture, again!). I would say it took four months before any dividends at all were seen from this effort. I remember this four-month period as a nerve-wracking time, as my attention was taken away from operations to what could have been a complete waste of my time and my employer’s money and an embarrassing failure.

Why not give it to an underling to do? For a few reasons. This was so important, and we had not done it before so I needed to know it was being done properly. I knew exactly what was needed, so I knew I could write it in such a way that it would be useful to me at the very least. I was also a relatively experienced writer (arts grad, former journalist), so I liked to think that that would help me write well.

We called these ‘Incident Models’ as per ITIL, but they can also be called ‘run books’, ‘crib sheets’, whatever. It doesn’t matter. What mattered was:

  • They were easy to find/search for
  • It was easy to identify whether you got a match
  • They were not duplicated
  • They could be trusted

We put this documentation in plain text within the ticketing system, under a separate JIRA project.

The documentation team got wind of what we were up to and tried to pressure us to use an internal wiki for this. We flat-out refused, and that was critical: the documentation system’s colocation with the ticketing system meant that searching and updating the documentation had no impedance mismatch. Because it was plain text it was fast, simple to update, and uncluttered. We resisted process that jeopardised the utility of what we were doing.

Documentation and the Criticality of De-cluttering

When we started, we designed a schema for these Incident Models which was a thing of beauty, covering every scenario and situation that could crop up.

bureaucracy.jpeg

In the end it was almost a complete waste of time. What we ended up using was a really dumb structure of:

  • Statement of problem
  • Steps 1-n of what to do
  • Further/deeper discussion, related articles

That was it. Attempts to structure it more thoroughly all failed as it was either confusing to newcomers, created too much administrative overhead, or didn’t cover enough. Some articles developed their own schema over time that was appropriate to the task, and new categories (eg the ‘jump-off’ article that told you which article to go to next) evolved over time. We couldn’t design for these things in advance because we didn’t know what would work or what would not.

Call it ‘agile documentation’ if you want – agile’s what sells these days (it was ITIL back then). Again, what was critical was that simplicity and utility trumped everything else.

There Is No Documentation Fairy

Having spent all this time and effort a couple of other things became clear regarding documentation.

 

documentation_fairy

First, we gave up accepting documentation from other teams. If they commented code, great, if there was something useful on the wiki for us to find, also great. But when it came to handing over projects we stopped ‘asking for documentation’. Instead we’d arrange sessions with experienced SREs where the design of the project would be discussed.

Invariably (assuming they had no ops experience), the developer would focus on the things they’d built and how it worked – and these things were often the most thoroughly tested and least likely to fail.

By contrast, the SRE would focus on the weak points, the things that would go wrong. ‘What happens if the network gets partitioned? What if the database runs out of disk? Can we work out from the logs why the user didn’t get paid?’

We’d then go away and write our own documentation and get the engineer to sign off on it – the reverse of the traditional flow! They’d often make useful comments and give us added insights in the process.

The second thing we noticed was that our engineers were still reluctant to update the docs that only they were using. There was still a sense that documentation should be given to them. The leadership had to constantly reinforce that this was their documentation, not tablets of stone handed down from on high, and if they didn’t constantly maintain this, they would become useless.

writer-at-computer.jpg

This was a cultural problem and took a long time to undo. Undoing it also required the documentation changes to be reinforced by process.

In the end, I’d say about 10% of the ongoing working time was spent maintaining and writing documentation. After the initial 7-month burst, most of that 10% was spent on maintenance rather than producing new material.

Documentation – Benefits

After getting all this documentation done, we experienced benefits far in excess of the 10% ongoing cost. To call out a few:

  • Easier onboarding

Before this process started we were reluctant to take on less experienced staff. After, onboarding became a breeze. Among other things the training involved following incidents as they happened and shadowing more experienced staff. New staff were tasked with helping maintain docs, which helped them understand what gaps they had in their knowledge.

  • Better training

The docs gave us a resource that allowed us to identify training requirements. This ended up being a curriculum of tools and techniques that any engineer could aim to get a working knowledge of.

  • Less stress through simpler escalation

These was a big one. Before we had the step-by-step incident models, when to escalate was a stressful decision. Some engineers had a reputation for escalating early, and all were insecure about whether they’d ‘missed something obvious’ before calling a responsible tech lead out of hours. SREs would also get called out for not escalating early enough as well!

The incident models removed that problem. Pretty soon, the first question an escalated-to techie asked was ‘have you followed the incident model’? If so, and there something obvious was missed, then gaps in it became clear and quickly-fixed. Soon, non-SREs were busy updating and maintaining the docs themselves for when they were escalated-to. It became a virtuous circle.

  • Better discipline

The obvious value of documentation to the team helped improve discipline in other respects. Interestingly, SREs previously had the reputation for being the ‘loudest’ team – there was often a lot of ‘lively’ debate, and the team was very social – which made sense, as we relied on each other as a team to cover a large technical area, dealt with often non-technical customer execs, and sharing knowledge and culture was critical.

As time progressed, the team became quieter and quieter – partly due to the advent of chatrooms, increased remote working, and international teams, but also due to the fact that so much of the work became routine: follow the incident model, when you’re done, or don’t understand something, escalate to someone more senior.

  • Automation

Automating the investigations this way meant that the way was clear to further automate them with software.

Having metrics on which tickets were linked to which incident models meant that we knew where best to focus our effort. We wrote scripts to comb through log files in the background, make encoding issues quicker and simpler to figure out, automate responses to customers (‘Issue was caused by a change made by app admin user XXX’), and a lot more.

These automations inspired an automation tool we built for ourselves based on pexpect: http://ianmiell.github.io/shutit/ But that’s another story. Basically, once we got going it was a virtuous circle of continuous improvement.

Back to Process

Given you have all these assets, how do you prevent them from degrading in value over time? This is where process is critical.

Two processes were critical in ensuring everything continued smoothly: triage and post-incident review.

Process – Triage

'Here's a copy of our new triage plan...the order is now walking wounded first, the dying and dead second, lawyers last...'

5%-10% of time was spent on the triage process. Again, it took a long time to get the process right, but it resulted in massive savings:

  • Reduce the steps to the minimum useful steps

It’s so tempting to put as much as possible into your triage process, but it’s vital to keep the value in the process over completeness. Any step that is not often useful tends to get skipped over and ignored by the triager.

  • Focus on saving cost in the process

Looking for duplicates, finding the relevant incident model, reverting quickly to the customer, and escalating early all reduced the cost per ticket significantly. It also saved other engineers the context switch of being asked a question while they’re thinking about something else. It’s hard to evaluate the benefits of these items, but we were able to deal with increased volumes of incidents with fewer people and less difficulty. Senior management and customers noticed.

Recording the details of these efforts also saved time, as (for example) engineers given a triaged ticket could see that the triager searched for previous incidents with a string that maybe they could improve on. It also meant that more experienced staff could review the triage quality.

  • Review triage

Experienced staff need to review the triage process regularly to ensure it’s actually being applied effectively.

When I moved to another operations team (in a domain I knew far less about), I cut the incident queue in half in about 3 days, just by applying these techniques properly. The triage process was there, but it wasn’t being followed with any thought or oversight, and was given to a junior member of staff who was not the most capable. Big mistake. Triage must be done – or overseen – by someone with a lot of experience, as while it looks routine and mechanical it involves a lot of significant decisions that rest on experience in the field.

And yes, I was the new boss, and I chose to spend my first week doing the ‘lowly’ task of triage. That’s how important I thought it was.

  • Rota the task

No-one wants to do triage for long, so we rota’d it per week. This allowed some continuity and consistency, but stopped engineers from going crazy by spending too long doing the same task over and over.

Process – Post-Incident Review

autopsy

The mirror image of triage was the ‘post incident review’. Every ticket was reviewed by an experienced team member. Again, this was a process that took up about 5% of effort, but was also significant.

A standard form was filled out and any recommendations were added to a list of backlog ‘improvement’ tasks which could be prioritised. This gave us a number for technical/process debt that we wanted to look at.

Culture

blame.jpg

I’ve mentioned culture a few times, and it’s what you always return to if you’re trying to enact any kind of change at all, since culture is at root a set of conceptual frameworks that underlie all our actions.

I’ve also mentioned that people often focus on the ‘wrong thing’. Time and again I hear people focus on tools and technology rather than culture. Yes, tools and technology are important, but if you’re not using them effectively then they are worse than useless. You can have the best golf clubs in the world, but if you don’t know how to swing and you’re playing baseball then they won’t help much.

Culture requires investment far more than technology does (I invested over half a year just writing documentation, remember). If the culture is right, people will look for the right tools and technology when they need to.

When given a choice about what to spend time and money on, always go for culture first. It cost me a lot of budget, but forcibly removing an ‘unhelpful’ team member was the best thing I did when I took over another team. The rest of the team flowered once he left, no longer stifled by his aggressive behaviour, and many things got done that didn’t before.

We also built a highly effective team with a budget so small that recruiters would phone me up to yell at me what I was looking for was ‘impossible’, but by focussing on the right behaviours, investing time in the people we found, and having good processes in place, we got an extremely effective and loyal team that all went on to bigger and better things within and outside the company (but mostly within!).

Politics

A quick word on politics. You’ve got to pick your battles. You’re unlikely to get the resources you need, so drop the stuff that wont get done to the floor.

Yes, you need a monitoring solution, better documentation, better trained staff, more testing… you are not going to get all these things unless you have a money machine, so pick the most important and try and solve that first. If you try and improve all these things at once, you will likely fail.

After process, and documentation, I tried to crack the ‘reproducible environment’ puzzle. That led me to Docker, and a complete change of career. I talk about these things a little here and here.

Any Questions?

Reach me on twitter: @ianmiell

Or LinkedIn

My book Docker in Practice:

Get 39% off with the code: 39miell2

 

Posted in Uncategorized | 14 Comments

Clustered VM Testing How-To

Recently I’ve been testing clusters of VMs running on my local host.

I thought that there must be a standard way to test multi-node VM setups, but asking around at work, and on github yielded no answers.

So I came up with my own solution, which I outline here.

ShutItFile

A ShutItFile is a superset of a Dockerfile that allows straighforward automation of automation tasks.

Here’s an example of a ShutItFile that manipulates two VMs, and tests network connectivity between them.

It creates two machines (machine1 and machine2) and logs into them in turn using the ‘VAGRANT_LOGIN’ directive. On each machine it installs python, sets up a simple python http server which serves the text: ‘Hi from machine1’ (from machine1) or ‘Hi from machine2’ from machine2.

It then tests that the output matches expectation from both machines using the ‘ASSERT_OUTPUT’ directive.

To demonstrate the ‘testing’ nature of the ShutItFile, a ‘PAUSE_POINT’ directive is included, which drops you into the run with a terminal, and a deliberately wrong ‘ASSERT_OUTPUT’ directive is included to show what happens when a test fails (and the terminal is interactive). This makes debugging a _lot_ easier.

DELIVERY bash

# Set up trivial webserver on machine1
VAGRANT_LOGIN machine1
INSTALL python
# Add file 
RUN echo 'hi from machine1' > /root/index.html
RUN nohup python -m SimpleHTTPServer 80 &
VAGRANT_LOGOUT

# Set up trivial webserver on machine2
VAGRANT_LOGIN machine2
INSTALL python
RUN echo 'hi from machine2' > /root/index.html
RUN nohup python -m SimpleHTTPServer 80 &
VAGRANT_LOGOUT

# Test machine2 from machine1
VAGRANT_LOGIN machine1
INSTALL python
RUN curl machine2
ASSERT_OUTPUT hi from machine2
VAGRANT_LOGOUT

# Test machine1 from machine2
VAGRANT_LOGIN machine2
INSTALL python
RUN curl machine1
ASSERT_OUTPUT hi from machine1
VAGRANT_LOGOUT

# Example debug
VAGRANT_LOGIN machine1
INSTALL python
PAUSE_POINT 'Have a look around, debug away'
# Trigger a 'failure'
RUN curl machine2
ASSERT_OUTPUT will never happen
VAGRANT_LOGOUT

To run this ShutItFile (which we call here ‘ShutItFile.sf’), you run like this:

# Install shutit
pip install shutit
shutit skeleton --shutitfile ShutItFile.sf \
    --name /tmp/shutitfile_build \
    --domain twovm.twovm \
    --delivery bash\
    --pattern vagrant \
    --vagrant_num_machines 2\
    --vagrant_machine_prefix machine

This code for this example is available here.

Video

There’s a video of the above run here:

Create Your Own

If you want to create your own multinode test:

pip install shutit  #use sudo if needed, --upgrade if upgrading
shutit skeleton

Follow the instructions, choosing ‘shutitfile’ as the pattern, and ‘vagrant’ as the delivery method, eg:

$  shutit skeleton

# Input a name for this module.
# Default: /space/git/shutitfile/examples/vagrant/simple_two_machine/shutit_sabers


# Input a ShutIt pattern.
Default: bash

bash:              a shell script
docker:            a docker image build
vagrant:           a vagrant setup
docker_tutorial:   a docker-based tutorial
shutitfile:        a shutitfile-based project (can be docker, bash, vagrant)

shutitfile


# Input a delivery method from: bash, docker, vagrant.
# Default: ' + default_delivery + '

docker:      build within a docker image
bash:        run commands directly within bash
vagrant:     build an n-node vagrant cluster

vagrant
# ShutIt Started... 
# Loading configs...
# Run:
cd /space/git/shutitfile/examples/vagrant/simple_two_machine/shutit_sabers && ./run.sh
# to run.
# Or
# cd /space/git/shutitfile/examples/vagrant/simple_two_machine/shutit_sabers && ./run.sh -c
# to run while choosing modules to build.

and follow the commands given (at the place in bold above) to run.

Initially you are given empty ShutItFiles. You could start by adding the commands from the example here.

A cheatsheet for the various ShutItFile commands is available here.

Watch me do this here.

Real-world Usage

As an example of real-world usage, this technique is being used to regression test Chef recipes used to provision OpenShift.

The Chef scripts are here, and the regression tests are here.

 

 

 

Posted in Uncategorized | 3 Comments

Easy Shell Automation

Regular readers will be familiar with ShutIt, a framework I work on that allows me to automate all sorts of workflows and tools that I publish on GitHub.

This article demonstrates a new feature that uses this platform to make doing expect-type tasks trivial.

Embedded ShutIt

In response to a request, I recently added a feature which may be useful to others.

All this is available in python scripts if you:

pip install shutit

You can now automate interactions in python scripts. This script just gets the hostname and logs it:

import shutit_standalone
import logging
shutit_obj = shutit_standalone.create_bash_session()
hostname_str = shutit_obj.send_and_get_output('hostname')
shutit_obj.log('Hostname is: ' + hostname_str, 
                loglevel=logging.CRITICAL)

Since ShutIt is a big wrapper/platform built onpexpect, it takes care of setting up the prompt, figuring out when the command is done and a whole load of other stuff you never want to worry about about terminals.

Log Into Server Example

This example logs into a server, taking the password from user input, and ensures git is installed on it before logging out:

import shutit_standalone
import logging
shutit_obj = shutit_standalone.create_bash_session()
username = shutit_obj.get_input('Input username: ')
server = shutit_obj.get_input('Input server: ', ispass=True)
password = shutit_obj.get_input('Input password', ispass=True)
shutit_obj.login('ssh ' + username + '@' + server,
                 password=password)
shutit_obj.install('git')
shutit_obj.logout()

ShutIt takes care of determining what package manager is on the host. If you’re not logged in as root it prompts you for a sudo password before attempting the install.

Pause Mid-Flight to Look Around

If you want to insert yourself in the middle of the run, you can add a ‘pause_point’, which will hand you back the terminal until you hit CTRL+[, after which it continues:

import shutit_standalone
import logging
username = shutit_obj.get_input('Input username: ')
server = shutit_obj.get_input('Input server: ', ispass=True)
password = shutit_obj.get_input('Input password', ispass=True)
shutit_obj.login('ssh ' + username + '@' + server,
                 password=password)
shutit.obj.pause_point('Take a look around!')
shutit_obj.install('git')
shutit_obj.logout()

Send Commands Until Specific Output Seen

If you need to wait for something to happen, you can ‘send_until’ a regexp is seen in the output. This trivial example runs a command to wait 20 seconds and then create a file, and the ‘send_until’ command does not complete until the file is created.

import shutit_standalone
import logging
username = shutit_obj.get_input('Input username: ')
server = shutit_obj.get_input('Input server: ', ispass=True)
password = shutit_obj.get_input('Input password', ispass=True)
shutit_obj.login('ssh ' + username + '@' + server,
                 password=password)
shutit_obj.send('rm -f newfile && sleep 20 && touch newfile &')
shutit.obj.send_until('ls newfile | wc -l','1')
shutit_obj.logout()

Challenge!

This can do a lot more, but I just want to give a flavour here.

I challenge you to give me a real-world automation task I can’t automate!

Ad

My book Docker in Practice:

Get 39% off with the code: 39miell

Posted in Uncategorized | Leave a comment

1-Minute Multi-Node VM Setup

tl;dr

Quickly spin up multiple VMs with useful DNSs on your local machine and automate complex environments easily.

Here’s a video:

Introduction

Maintaining Docker at scale, I’m more frequently concerned with clusters of VMs than the containers themselves.

The irony of this is not lost on me.

Frequently I need to spin up clusters of machines. Either this is very slow/unreliable (Enterprise OpenStack implementation) or expensive (Amazon).

The obvious answer to this is to use Vagrant, but managing this can be challenging.

So I present here a very easy way to set up a useful Vagrant cluster. With this framework, you can then automate your ‘real’ environment and play to your heart’s content.

$ pip install shutit
$ shutit skeleton
# Input a name for this module.
# Default: /Users/imiell/shutit_resins
[hit return to take default]
# Input a ShutIt pattern.
Default: bash
bash: a shell script
docker: a docker image build
vagrant: a vagrant setup
docker_tutorial: a docker-based tutorial
shutitfile: a shutitfile-based project
[type in vagrant]
vagrant
How many machines do you want (default: 3)? 3
[hit return to take default]
What do you want to call the machines (eg superserver) (default: machine)?
[hit return to take default]
Do you want to have open ssh access between machines? (default: yes) yes
Initialized empty Git repository in /Users/imiell/shutit_resins/.git/
Cloning into ‘shutit-library’...
remote: Counting objects: 1322, done.
remote: Compressing objects: 100% (33/33), done.
remote: Total 1322 (delta 20), reused 0 (delta 0), pack-reused 1289
Receiving objects: 100% (1322/1322), 1.12 MiB | 807.00 KiB/s, done.
Resolving deltas: 100% (658/658), done.
Checking connectivity… done.
# Run:
cd /Users/imiell/shutit_resins && ./run.sh
to run.
[follow the instructions to run up your cluster.
$ cd /Users/imiell/shutit_resins && ./run.sh

This will automatically run up an n-node cluster and then finish up.

NOTE: Make sure you have enough resources on your machine to run this!

BTW, if you re-run the run.sh it automatically clears up previous VMs spun up by the script to prevent your machine grinding to a halt with old machines.

Going deeper

What you can do from there is automate the setup of these nodes to your needs.

For example:

def build(self, shutit):
[... go to end of this function ...]
# Install apache
    shutit.login(command='vagrant ssh machine1')
    shutit.login(command='sudo su - ')
    shutit.install('apache2')
    shutit.logout()
    shutit.logout()
# Go to machine2 and call machine1's server
    shutit.login(command='vagrant ssh machine2')
    shutit.login(command='sudo su -')
    shutit.install('curl')
    shutit.send('curl machine1.vagrant.test')
    shutit.logout()
    shutit.logout()

Will set up an apache server and curl a request to the first machine from the second.

Examples

This is obviously a simple example. I’ve used this for these more complex setups which are can be instructive and useful:

Chef server and client

Creates a chef server and client.

Docker Swarm

Creates a 3-node docker swarm

OpenShift Cluster

This one sets up a full OpenShift cluster, setting it up using the standard ansible scripts.

Automation of an etcd migration on OpenShift

This branch of the above code sets up OpenShift using the alternative Chef scripts, and migrates an etcd cluster from one set of nodes to another.

Docker Notary

Setting up of a Docker notary sandbox.

Help Wanted

If you have a need for an environment, or can improve the setup of any of the above please let me know: @ianmiell

Learn More

My book Docker in Practice:

Get 39% off with the code: 39miell

Posted in Uncategorized | 1 Comment

Migrating an OpenShift etcd Cluster

Summary

Following on from my previous post setting up an OpenShift cluster in Vagrant, this post discusses migrating an etcd cluster within a live OpenShift instance to newer servers.

Moving a standalone etcd cluster is relatively straightforward, but when it’s part of an OpenShift cluster — and especially one that’s live and operational — it is a little more involved.

The ordering of actions is important and there are several aspects to consider when planning such a move:

  • Config management preparation
  • Stopping the cluster
  • Creation and distribution of certificates
  • Data migration
  • Update of OpenShift config
  • Update of config management

Here we are using Ansible to provision and maintain the environment.

You can also use Chef to manage your OpenShift cluster.


Code

The code for this is here:

Video

Here’s a video of the upgrade process:

Steps

VM Setup

This section of the code sets up the VMs using Vagrant.

Cluster Setup

The next section sets up the OpenShift cluster. It:

  • sets up ssh access across all the hosts
  • writes the ansible hosts config file
  • triggers the ansible playbook

Take a Backup

Take a backup of etcd on all three nodes, just in case.

Stop the Cluster

Generate New Certs

For each new node, run the commands to generate the certs for the new nodes, and copy to the codes.

Add etcd Nodes One-By-One

Again for each node:

  • add the new node to the etcd cluster
  • go to the node
  • install etcd
  • extract the certificates
  • update the etcd config
  • restart etcd

NOTE: If you have a lot of data in your cluster, you will want to give the new node ample time to receive the data from the other nodes. In this trivial example, there is little data to transfer. Alternatively, you can copy over the data from one of the original nodes.

Drop the Old Members

Now drop the old members from the cluster and remove etcd from those hosts:

Update the Master Config and Bring the OpenShift Cluster Back Up

The /etc/origin/master/master-config.yaml file needs to updated to reflect the new etcd cluster before bringing back the OpenShift cluster.

Update Config Manager and Re-Run

Learn More

My book Docker in Practice:

Get 39% off with the code: 39miell

Posted in Uncategorized | 1 Comment

A Complete OpenShift Cluster on Vagrant, Step by Step

tl;dr

Following on from my Kubernetes post here, I have automated an OpenShift Origin cluster using the same tools.

Video

Here is a video of the whole process.

It gets (relatively) interesting later on, as a lot of the process is Vagrant starting up and yum installs failing on bad mirrors. Also, Ansible needs to be run several times for it to work (I suspect due to resource limitations, see Gotchas below).

Architecture

Here is a layout of the VMs. The host uses the landrush plugin to allow transparent DNS lookup from the host, and between boxes.

OpenShift Vagrant Cluster VM Layout

Code

The code is here:

Run Yourself

You will need at least 6.5G spare memory (maybe more) on your host. Even then it may struggle to provision in a timely way.

Do get in touch if you think you can help improve it.

Tech Used

  • Vagrant (Virtualbox)
  • ShutIt
  • Ansible

I am interested in porting to libvirt also. Please get in touch if you want to help.

Why?

One of the big problems with running OpenShift in production is the complexity of each environment. You can have test, UAT and prod environments, but sometimes you want to quickly spin up a realistic environment for development or

At that point you’re usually offered an ‘all-in-one’ or single-command setup, which, while very convenient, doesn’t represent the reality of the system you’re running elsewhere.

This is less didactic than the Kubernetes post (the steps to set up take a good while to run even if you’re using ansible…) but still has its uses.

Because this is in vagrant and is automated, it gives you a reliable, fast, and realistic representation of a real live infrastructure. This comes in very handy if you’re trying to determine the memory usage of etcd, the effect of tuning some config variables, or failover scenarios.

Gotchas

Here are some of the things I had to overcome to make this work. They’re fairly instructive:

Learn More

My book Docker in Practice:

Get 39% off with the code: 39miell

Posted in Uncategorized | 1 Comment

Learn Kubernetes the Hard Way (the Easy and Cheap Way)

Learn Kubernetes the Hard Way (the Easy and Cheap Way)

tl;dr

Building on Kelsey Hightower’s fantastic work exposing the internals of Kubernetes by setting up Kubernetes on public cloud providers, I’ve automated all the steps to set up a cluster on your local machine, with a walkthrough mode that takes you through step-by-step. Watch a video here (the interesting stuff happens from about 3 minutes in):

It’s free?

There is no charge as it will run on your host, but you need 2G of memory spare on your host by default.

It helps if you have Virtualbox and Vagrant already installed (works on Mac too!), although the script will try and set this up for you.

How do I run it?

Here’s the commands to run it yourself:

sudo pip install shutit
git clone --recursive https://github.com/ianmiell/shutit-k8s-the-hard-way
cd shutit-k8s-the-hard-way
./walkthrough.sh

What’s going on?

Here’s a diagram of the setup.

The host runs Vagrant and Virtualbox. Each box in the host box (the big rectangle) represents a virtual machine. There are workers (which run the pods, controllers, which run the kubernetes cluster) and a client (which has the kubernetes binaries installed on it) and a load balancer (which represents the entry point to the cluster.

Is it safe?

All work (including the Kubernetes client commands) are done within your locally-provisioned VMs, so it should won’t install crazy things to your machine or anything.

How Does it Work?

The script uses ShutIt to automate the steps to bring up the cluster and walk through the build. Contact me for more info: @ianmiell

Code

The code is here:

Help Wanted

I’m sure this can be improved, both in terms of the functionality elucidated once the cluster is up, as well as the descriptions in the notes.

Please help to contribute if you can!

Learn More

My book Docker in Practice:

Get 39% off with the code: 39miell

Posted in Uncategorized | 1 Comment