Continuous Delivery

Service Simulation rather the Integration Testing of Micro Services

2016-10-02T21:33:00.004+02:00

Micro Services or Distributed Monolith

One of the main characteristics of a Micro Service architecture is that each Micro Service has its own lifecycle. This is very important as this is the only way to break the monolith pattern. If the lifecycle of a Micro Service is tied to another entity such as Service, System or Solution then we have a distributed monolith and not a Micro Service architecture.

Ive found that almost everyone agrees in theory but in practice it very wants to do integration testing before release. Integration Testing is so in our DNA, we have been doing it for years and dont really know any other way.

Lets explore the problem of Integration Testing a little bit. If we have Micro Service X and it is consumed by Micro Services Y and Z. Then there is no denying there is a dependency between X, Y and Z. Part of Micro Service architecture is API versioning and backward compatibility. But even with these practices and good test coverage on component X it's hard to deny that the integration between X, Y and Z can still fail. But if we rely on Integration Testing to find these failures then we create a dependency between the teams developing Y and Z and the team of X. That means X no longer has a lifecycle of its own.

Hence if we do rely on Integration Testing we have a distributed monolith which imho is the only thing worse than a monolith.

Consumer Contract Testing

http://martinfowler.com/articles/consumerDrivenContracts.html

Martin Fowler talks about Contract Testing and Consumer Driven Contracts which is a great pattern. In short it means that the developers of Micro Services Y and Z provide contract tests for Micro Service X. The tests provided by the developers of Y and Z correspond to what parts of the X API they consume. These tests are executed as part of the test suite belonging to service X. This way the developers of X know if they broke any backwards compatibility.

I think this is a great pattern and one that helps developers maintain the right amount of backwards compatibility with a fast feedback mechanism. Still I am not sure it is enough. At least in our organization everyone still wants to Integration Test.

There is still a lot of failures in the integration that can fail outside of the component tests of X. In production like environments there can be deployments in different networks, there is service discovery and other factors playing in.

Service Simulation

One Solution is to introduce Service Simulation. Instead of implementing test automation we can implement simulator bots. These bots exercise continuously execute our services. This creates a constant and even load on our solution driven by our services. The monitoring of our Services is used to notify the developers if something failed. No alarms for a given time period after deploy means our solution still works and the redeployment of our Micro Service didnt break anything and we can go on to deploy into the next environment.

Its also relatively easy to implement this as part of the continuous delivery build pipe. Deploy into an environment, check the alarms in that environment for five min and if all green then the Micro Service is verified and the Continuous Delivery implementation continues.

Though its not necessary to have a advanced Continuous Delivery Engine for this pattern to work, just alarm notifications to team channels on slack is powerful in by self.

Another important thing this solves is that the developers and testers start using production grade monitoring to verify services. I think its a very big obstacle in a DevOps transformation that Developers and Testers view Test Reports to understand a system while these arnt available in production. Developers and Testers are usually lost when it comes to production systems. This moves their understanding towards runtime operations of a system which bridges the gap between Dev and Ops in a nice way.

Bots can be deployed into any environment providing same process of deploy and verify in all environments. Personally I like the idea of having a Bot workload only environment as the first full featured environment. The now obsolete Integration Testing environment could be easily converted into a Bot only environment.

While a Bot basically can be implemented in a number of ways I like the idea of using Function as a Service such as AWS Lambda to implement Service SImulation Bots. The scheduled nature and high specialization of a Bot is perfect as a FaaS workload.

Test Pyramid

Simulation - In all Environments Bots Executing Monitoring and Alarms to verify application.
Component & Contract Tests - Deployed Functional Tests using Mocks to stub consumed services. Contract Tests provided by developers consuming this services.
Unit Tests - Well nothing new there.

I think this is what we need to in order to release our micro services with high confidence and with life cycles of their own.

Optimizing Runtime Cost through Continuous Delivery

2016-01-07T17:42:00.000+01:00

We have always had the need to understand our application runtime costs. In legacy enterprises this has often been done at a finance level and someone in the Runtime Operations department has been responsible for that cost. There have been different ways of distributing the costs to the right organization. Either each application has had its own infrastructure and own the full cost of it or applications have shared infrastructure in order to save money, utilize large servers in a better way or just to minimize setup and maintenance work of the servers.

In many legacy enterprises there has been a huge disconnect between the rollout the application evolution and the infrastructure evolution. Application development has changed towards micro services with a lot of small applications with lifecycles of their own. The legacy enterprises runtime operations are largely unable to delivery on these changes due to processes, licenses and hardware leases of huge "enterprise scale servers" for the old monoliths. Requests of several servers per micro services have been waved off as unrealistic.

Several ways around this have been invented in order to cohost multiple applications on the same servers in a somewhat isolated way. Docker draws so much attention partially because legacy enterprises can still have their big old servers and isolate the micro services on them. (Of course not only good thing with Docker).

Regardless of how this was solved the understanding of runtime cost becomes exponentially harder with micro service in legacy enterprise infrastructure.

When transitioning to DevOps and Micro Services each team is responsible for delivering its services end to end in all environments with the right functionality, performance and imho to the right cost. So what is the right cost? Before even answering that question let’s start with what is the cost. The DevOps team needs to know the cost of its services. In the legacy enterprise that cost might at best case arrive as some kind of cost spit formula calculated by a finance guy based on how many micro services shared the servers, a cost split formula of any license costs and a cost split formula of the man hours to maintain the servers. At best a team would get this information once a month.

Part of the reason we want DevOps is so that the team has full competence and ability to improve its services. This needs to include the runtime costs of the services. A performance optimization of a service that cuts resource requirements by 25% should result in lower runtime costs of the service. In the legacy world of cost splits and financial formals to calculate cost this just doesn’t happen.

Thankfully we have cloud providers. Not only do we get the right resources handle our services when we need them but we get billed real money for them.

With auto scaling, elasticity and all other nice features that we get there is also a risk. We get away with writing increasingly bad applications. Bad performance? Doesn’t matter as long as it scales horizontally. We can IO block to ourselves hell and back as long as we scale horizontally and we do get away with it. Well as long as we don’t have to take responsibility for the cost of our service. This is why it’s so important for the DevOps team to take the responsibility of the runtime cost of the service.

For a few years I have had a dream to be able to stamp a cost runtime cost to a load test and correlate the performance measured in the test with the cost of the runtime environment to run the test. We are still not there with our applications since we do run on AWS Auto Scaling Groups and AWS bills us per started hour. This makes the billing data to blunt too give the correlation between performance and cost from a shorter test. With a Micro Service architecture that uses AWS Services such as Lambda, DynamoDB and Kinesis this would actually be achievable today.

With this vision in mind we have been able to integrate the runtime costs of our Micro Services in AWS with our Continuous Delivery as a Service implementation (Delivery Engine) in another way. For us DevOps Teams are the owners of services and Solutions are the drivers of the cost. So everything that we launch in AWS we tag with name of micro service, owner team name and cost driving solution.

So this allows our DevOps Teams to see the cost of their Services.

Here we provide a graph of total costs for the services owned by a DevOps team. The services are listed in a top list below the graph and a long running trend of the cost for each component. This allows the Product Owners and Team Members to act in increasing costs. Links are provided to the Dashboard Page of each service.

This Service Dashboard provides the version history of each version built in delivery engine. Our Delivery Engine builds, black box tests the deployed service, load tests it, bakes AWS AMIs for it and launches it into the right environments. Here on the dashboard we visualize the total cost of the service grouped by environment (delivery engine, exploratory testing, qa and prod environments).

From the same dashboard we visualize service usage. In the below example it’s a visualization on cpu consumption across an auto scaling group.

This way we allow the team to ensure that they have the right scaling rules for its services and that they have the right amount of resources in each environment.

By visualizing the cost of the runtime environment for each service and combining it with Continuous Delivery, Continuous Performance Testing and DevOps we allow our developers to constantly tweak and improve the performance and optimize the cost of our runtime environments. The lead time in the cost reporting still is at the "next day level" as we report the cost per day and per month but I still think this is a reasonable feedback loop when it comes to cost optimizations.

Pipes as Code

2015-02-27T11:12:00.001+01:00

Finally we have started to move away from having build pipes as a chain of Jenkins jobs. There has been alot written of the subject that CI systems arnt well suited to implement CD processes. Let me first give a short recap on why before I get into how we now delivery our Pipes as Code.

First of all pipes in CI systems have bad portability. They are usually a chain of jobs set up through either a manual process or through some sort of automation based on a api provided by the CI system. The inherrited problem here is that the pipe executes in the CI system. This means that it is very hard to test and develope a pipe using Continuous Delivery. Yes we need to use Continuous Delivery when implementing our Continuous Delivery tooling otherwise we will nto be able to deliver our CD Processes in a qualitative, rapid and reliable way.

Then there is the problem of that data that we collect during the pipe. By default the data in a CI system is stored in that CI systems. Often on disk on that instans of that CI server. Adding insult to injury navigation of the build data is often tided to the current implementation of the build pipe. This means that a change to the build pipe means that we can no longer access the build data.

For a few years now we have been off loading all the build data into different types of storages depending on what type of data it is. Meta data around the build we store in a custom database. Logs go to our ELK stack, metrices to Graphite and reports to S3.

Still we have had trouble delivering quality Pipes. Now that has changed.

We still use a CI Server to trigger the Pipe. On the CI server we now have one job "DoIt". The "DoIt" job executes the right build pipe for every application. Lets talk a bit on how we pick the pipe.

Each git repo contains a YML file that says how we should build that repo. Thats more or less the only thing that has to be in the repo for us to start building it. We ingore all repos without the YML files. So we listen to all the gerrit triggers and ignore ones withouth

The YML is simply pretty much just

pipe: application-pipe
jdk: JDK8

We describe our build pipes in YML and implement our tasks in Groovy. Here is a simple definition.

build:
first:
- do: setup.Clean - do: setup.Init
main:
- do: build.Build - do: test.Test
last:
- do: log.ReportBuildStatus
last:
last:
- do: notify.Email

Each task has a lifecycle of first, main, last. The first section is always executed and all of the "do´s" in the first section are executed regardless of result. In the main secion the "do´s" are only execute if everything has gone well so far. Last is always executed regardless of how things went.

The "do´s" are references to groovy classes with the first mandatory part of the package stripped. So there is a com.something.something.something.setup.Clean class.

A Context object is passed through all the execute methods of the "do´s". By setting context.mock=true the main executing process adds the sufix "Mock" to all "do´s". This allows us to unit test the build pipe inorder to assert that all the steps that we expect to happen do happen in the correct order.

When alot of things start happening its not really practicall to have a build task all that verbose especially since we have multiple pipes that share the same build task. So we can create a "build.yml" and a "notify.yml" which we then can include like this.

build:
ref : build
last:
last:
- do: notify

So this is how our build pipes look and we can unit test the pipe, the tasks and each "do" implementaiton.

Looking at a full pipe example we get something like this.

init:
ref: init
build:
parallel:
build:
ref: build.deployable
provision:
ref: provision.create-test-environment
deploy:
ref: deploy.deploy-engine
test:
functional-test:
ref: test.functional-tests
load-test:
ref: test.load-tests
release:
parallel:
release:
ref: release.publish-to-nexus
bake:
ref: bake.ami-with-packer
last:
parallel:
deprovision:
ref: provision.destroy-test-environment
end:

ref: end

Thats it.!

This pipe builds, functional tests, load tests and publishes our artifacts as well as baking images for our AWS environments. All the steps report to our meta data database, elk, graphite, s3 and slack.

And ofcourse we use our build pipes to build our build pipe tooling.

Continuous Delivery of Continuous Delivery through build Pipes as Code. High score on the buzzword bingo!

Continuous Deployment in the Cloud Part 2: The Pipeline Engine in 100 lines of code

2014-07-07T18:58:00.001+02:00

As I talked about in my previous post in this series we need to treat our Continuous Delivery process as a distributed system and as part of that we need to move the Pipe out of Jenkins and into a first class citizen of its own. Aside from the facts that a CI Tool is a very bad Continuous Delivery/Deploy orchestrator I find the potential of executing my pipe from anywhere in the cloud very tempting.

If my pipe is a first class citizen of its own and executable on any plain old node then I can execute it anywhere in the cloud. Which means I can execute it locally on my laptop, a simple minion in my own cloud or in one of all of the managed CI services that have surfaced in the Cloud.

To accomplish this we need five very basic and simple things

a pipeline configuration that defines what tasks to execute for that pipe
a pipeline engine that executes tasks
a library of task implementations
a definition of what pipe to use with my artefact
a way of distributing the pipeline engine to the node where I want to execute my pipeline

Lets have a look.

Define the pipeline

The pipeline is a relatively simple process that executes tasks. For the purpose of this blog series and for simple small deliveries sequential execution of tasks can be sufficient but at my work we do run a lot of parallel sub pipes to improve the throughput on the test parts. So our pipe should be able to handle both.

We also want to be able to run the pipe to a certain stage. Like from start to regression test and then step by step launch in QA and launch in Prod. Obviously of we want to do continuous deployment we don't need to worry too much about that capability. But I include it just to cover a bit more scope.

Defining pipelines is no real rocket science and in most cases a few archetype pipelines will cover like 90% of the pipes needed for a large scale delivery. So I do like to define a few flavours of pipes that gives us the ability to distribute a base set of pipes for CI to CD.

Once we have defined the base set of pipe flavours the each team should configure which pipe they want to handle their deliverables.

I define my pipes something like this.

name: Strawberry

pipe:

- do:

- tasks:

- name: Build

type: mock.Dummy

- do:

- tasks:

- name: Deploy A

type: mock.Dummy

- name: Test A

type: mock.Dummy

- tasks:

- name: Deploy B

type: mock.Dummy

- name: Test B

type: mock.Dummy

parallel: true

- do:

- tasks:

- name: Publish

type: mock.Dummy

A pipe named Strawberry which builds our services then deploys it in two parallel pipes where it executes two test suites and finally publishes the artefacts in our artefact repo. At this stage each task is just executed with a Dummy task implementation.

The pipeline engine

We need a mechanism that understands our yml config and links it to our library of executable tasks.

I use Groovy to build my engine but it can just as easily be built in any language. Ive intentionally stripped down some of the logging I do but this is basically it. In about 80 lines of code we have a engine that loads tasks defined in a yml, executes then in serial or parallel and has the capability to run all tasks, the tasks up to one point or a single task.

@Log

class BalthazarEngine {

def int start(Map context){

def status = 0

def definition = context.get "balthazar-pipe"

for (def doIt: definition["pipe"] ){

status = executePipe(doIt, context)

}

return status

}

def int executePipe(Map doIt, Map context){

def status = 0

if (doIt.parallel == true){

status = doItParallel(doIt,context)

} else {

status = doItSerial(doIt,context)

}

return status

}

def int doItSerial(def doIt, def context){

def status = 0

for (def tasks : doIt.do.tasks){

status = executeTasks(tasks, context)

}

return status

}

def int doItParallel(def doIt, def context){

def status = new AtomicInteger()

def th

for (def tasks : doIt.do.tasks){

def cloneContext = deepcopy(context)

def cloneTasks = deepcopy(tasks)

th = Thread.start {

status = executeTasks(cloneTasks, cloneContext)

}

th.join()

return status

}

def int executeTasks(def tasks, def context){

def status = 0

for (def task : tasks){

//execute if the run-task is not specified or if run-task equqls this task

if (!context["run-task"] || context["run-task"] == task.name){

log.info "execute ${task.name}"

context["this.task"] = task

def impl = loadInstanceForTask task;

status = impl.execute context

}

if (status != BalthazarTask.TASK_SUCCESS){

break

}

if (context["run-to"] == task.name){

log.info "Executed ${context["run-to"]} which is the last task, done executing."

break

}

return status

}

def loadInstanceForTask(def task){

def className = "balthazar.tasks.${task.type}"

def forName = Class.forName className

return forName.newInstance()

}

def deepcopy(orig) {

def bos = new ByteArrayOutputStream()

def oos = new ObjectOutputStream(bos)

oos.writeObject(orig); oos.flush()

def bin = new ByteArrayInputStream(bos.toByteArray())

def ois = new ObjectInputStream(bin)

return ois.readObject()

}

A common question I tend to get is "why not implement it as a lifecycle in maven or gradle". Well I want a process that can support building in maven, gradle or any other tool for any other language. Also as soon as we use another tool to do our job (be it a build tool, a ci server or what ever) we need to adopt to its lifecycle definition of how it executes its processes. Maven has its lifecycle stages quite rigidly defined and I find it a pita to redefine them. Jenkins has its pre, build, post stages where its a pita to share variables. And so on. But most importantly use build tools for what they do well and ci tools for what they do well and none of that is implementing CD pipes.

Task library.

We need tasks for our pipe engine to execute. The interfaces for a task is simple.

public interface BalthazarTask {

int execute(Map<String, Object> context);

}

Then we just implement them. For my purpose I package tasks in "balthazar.task.<type>.<task>" and just define the type and task in my yml.

Writing tasks in a custom framework over say jobs in Jenkins is a joy. You no longer need to do workaround to tooling limitations for simple things such as setting variables during execution.

Anything you want to share you just put it on the context.

Here is an example of how two tasks share data.

- tasks:

- name: Initiate Pipe

type: init.Cerebro

- tasks:

- name: Build

type: build.Gradle

command: gradle clean fatJar

I have two tasks. The first task creates a new version of the artefact we are building in my master data repository that I call Cerebro. (More on Cerebro in the next post). Cerebro is the master of all my build things and hence my version numbers come from there. So the init.Cerebro task takes the version from Cerebro and puts it on the context.

@Log

class Cerebro implements BalthazarTask {

@Override

def int execute(Map<String, Object> context){

def affiliation = context.get("cerebro-affiliation")

def hero = context.get("cerebro-hero")

def key = System.env["CEREBRO_KEY"]

def reincarnation = CerebroClient.createNewHeroReincarnation(affiliation, key, hero)

context.put("cerebro-reincarnation",reincarnation)

return TASK_SUCCESS

}

My build.Gradle task takes the version number from cerebro (called reincarnation) and sends it to the build script. As you can see I can use custom commands and in this case I do as fat jars is what I build. By default the task does gradle build. I can also define what log level I want my gradle script to run.

@Log

class Gradle implements BalthazarTask {

@Override

def int execute(Map<String, Object> context){

def affiliation = context["cerebro-affiliation"]

def hero = context["cerebro-hero"]

def reincarnation = context["cerebro-reincarnation"]

def command = context["this.task"]["command"] == null ? "gradle build": context["this.task"]["command"]

def loglevel = context["this.task"]["loglevel"] == null ? "" : "--${context["this.task"]["loglevel"]}"

def gradleCommand = """${command} ${loglevel} -Dcerebro-affiliation=${affiliation} -Dcerebro-hero=${hero} -Dcerebro-reincarnation=${reincarnation}"""

def proc = gradleCommand.execute()

proc.waitFor()

return proc.exitValue()

}

This is how hard it is to build tasks (jobs) if its done with code instead of configuring it in a CI tool. Sure some tasks like building Amazon AMI´s take a bit more of code. (j/k they don't). But ok a launch task that implements a rolling deploy on amazon using a A/B release pattern does but I will come back to that specific case.

Configure my repository

So I have a build pipe executor, pre built build pipes and tasks that execute in them. Now I need to configure my repository.

In my experience 90% of your teams will be able to use prefab pipes without investing too much effort into building tons of prefabs. A few CI a few simple CD and a few parallelized pipes should cover a lot of demand if you are good enough at putting an interface between the deploy tasks and the deploy as well as the test tasks and the deploy tools.

So in my repo I have a .balthazar.yml which contains.

balthazar-pipe: Strawberry

Distributing the pipeline engine and the task library

First thing we need is a balthazar client that starts the engine using the configuration provided inside my repository. Simply a Groovy script does the trick.

@Log

class BalthazarRunner {

def int start(Map<String, Object> context){

Yaml yaml = new Yaml()

if (!context){

def projectfile = new File(".balthazar.yml")

if (projectfile.exists()){

context = yaml.load projectfile.text

} else {

throw new Exception("No .balthazar.yml in project")

}

def name = context.get "balthazar-pipe"

def definition = yaml.load this.getClass().getResource("/processes/${name}.yml").text

BalthazarEngine engine = new BalthazarEngine()

context["balthazar-pipe"] = definition

context["run-to"] = System.properties["run-to"]

context["run-task"] = System.properties["run-task"]

return engine.start(context)

}

def runner = new BalthazarRunner()

runner.start([:])

Now we need to distribute the client, our engine and our library of tasks to the node where we want to execute the pipeline with our code repository. This can be done in many ways.

We can package balthazar as a installable package and install it using yum or similar tool. This works quite well on build servers but it does limit us a bit on where we can run it as we need "that installer" to be installed on the target environment. In many cases its really isn't a problem because if your a Debian shop then you have your deb everywhere and if your a Redhat shop then you have your yum.

I personally opted for another way of distributing the client. Partially because Im lazy and partially because it works on a lot of environments. When I make my balthazar.yml I also checkout the balthazar client project as a git submodule.

So all my projects have a .balthazar.yml and a balthazar-client folder. In my client folder I have a balthazar.sh and a gradle.build file. I use gradle to fetch the latest artefacts from my repo and then the shell script does the java -jar part. Not all that pretty but it works.

Summary

So now on all my repos I can just do...

>. balthazar-client/balthazar.sh

... and I run the pipe configured in .balthazar.yml on that repo. Since all my tasks integrate with my Build Data Repository I get ONE view of all my pipe executions regardless of where they where executed.

CD made fun! Cheers!

Continuous Deployment in the Cloud Part1: The Distributed Continuous X process

2014-06-27T13:29:00.001+02:00

This is the first part of the Blog Series "Continuous Deployment in the Cloud".

When we started doing Continuous Delivery many of us started building the process around a CI Server. Many of us ran into problems building their pipelines with Jenkins or other CI Tools. There are several reasons to these problems these two blog posts http://www.cloudsidekick.com/blog/stretch-armstrong.html and http://www.alwaysagileconsulting.com/pipeline-antipattern-deployment-build/ outline the problems really well.

CI Server is a bad Continuous Delivery/Deployment Orchestrator

Personally Id like to boil down the core of the problem to lack of portability and separation of concerns.

If you model the process in a CI Tool then the process will never ever be portable. Even if you can distribute the process across multiple instances of the CI Tool through different means of generating and publishing process you always need an instance of that tool to run the process. This makes development quite hard since you need a local development instance of that tool.

In most CI Tools the data collected from each job is stored in the CI Tool itself. To make matters even worse its often stored in that instance of the CI Tool. This means that the only way to access the data gathered by the jobs of the pipe is through navigating that instance of the pipe on that instance of the CI Server. This makes it very hard to distribute the Continuous X process over multiple CI Servers, we can use master/slave setups but the problem still persists with multiple masters.

Since the data is often tied to the implementation of the pipe it becomes very hard to visualise historical data. If the current layout of the pipe has changed then we still want to be able to visualise old pipes of our system.

Another problem that arises is historical data and data retention as it is tied to the CI Tool and visualisation is tied to the CI Tool we need to manage the disk space on the CI Tools. We don't want to mix runtime and historical data in CI Tool.

Separating Process Implementation and Process Data Storage

So the first thing we have to deal with in order to distribute our Continuous X process is to move the data out of the process implementation.

In fact the first problem we encounter when distributing a Continuous X process is the Version Number.
What do we use as a version number and where do we get it?

Using the CI Server Build Number is extremely bad as you cant even reset your Build Job without encountering problems.
Using a property checked in into your source code repository such as version number in the maven pom or similar is almost worse. You will have a gap in time between the repo fetch, update of version and commit back to repository. If other jobs start in this time frame then they will get the same version number.

So the answer is we get it from a central build data repository. A single database that keeps track of all our deliverables, their versions and their state. By delegating the responsibility of the version number to the Build Data Repository we ensure that the Version Number is created through a atomic update.

The first thing our Process Implementation will do is to get its version number from the Build Data Repository. Then everything we do during the Process Implementation we report it to the Build Data Repository.

So for example we report

Init Pipe - Get Version Number,
Build Start - Environment, TimeStamp
Unit Test Start - Environment, TimeStamp
Unit Test Done - Environment, TimeStamp, Report
Build Done - Environment, TimeStamp, Report
Deploy to Test Start - Environment, TimeStamp
Deploy to Test Done - Environment, TimeStamp, Report
Test Start - Environment, TimeStamp
Test Done - Environment, TimeStamp, Report
Promote - Environment, TimeStamp, Promoted to PASSED_TEST
Deployment Production Start - Environment, TimeStamp
Deployment Production Done - Environment, TimeStamp, Report

Now we have a Continuous X process implementation that gets the version number from the same place regardless of where its executed and reports all the data into one repository regardless of where its executed. This means that the same process implementation can be executed on any CI Server instance as well as on any Developer machine. This has enabled us to implement a portable and hence distributed Continuous X process.

There are several reasons we might want to run the process locally on our Dev Environment. We need the capability to develop the process implementation and we need to be able to do it without a CI Server. It gives us a good mindset when making a distributed implementation. If it can run locally then it can run on any build environment. In some cases we maybe don't need a full fledged build environment to run our Continuous X Process. Applications with few developers, few changes and short pipeline execution time don't really need a remote build environment.

Another important reason is that we need to be able to push our code to our customers even if our Continuous X Service is down. Regardless of how high the SLA of the service (internal or cloud based) is we will eventually run into situations where we need to execute it and its down.

Thought its important to note that a pipeline should never allow execution on changes that aren't pushed to the SCM. Every execution has to be traceable to a SCM revision.

Process Visualization

One of the main problem with CI Tools such as Jenkins is the visualization. Its just not built for the purpose of Continuous Delivery/Deploy and as Ive mentioned above its based on local instances of job pipes. This makes it impossible to visualize a distributed process in a good way and it requires the user to have understanding of the CI Tool. A Continuous Delivery/Deploy process needs to be visualised so that non technical employees can view it.

Good thing we have a Build Data Repository then. Based on the information in the Build Data Repository we can visualize our pipe, its status and result no matter where it was executed. We can also visualise the pipe based on how it looked when it was executed, not how its implemented now.

Logging, Monitoring and Metrics vs the Build Data Repository

What do we store in the Build Data Repository? Do we store everything in there? Like execution logs from the test excutions and metrics?

No. The Build Data Repository is to store the reports from all events builds, deployments, tests, ect but the actual runtime logging should go to where it belongs. Logfiles going into a log repository such as logstash, metrics going into a metrics repository such as graphite.

The Process Visualization could/should aggregate the data from the Build Data, Log and Metric repositories to provide a good view of our process and system.

Provisioning Environments and Executing Tests

Its really important that our Continuous X Process Implementation initiates the full provisioning of the full test/prod environment. By assuring that it does so we ensure that all environmental changes go through our pipe.

What do I define as Environment? Everything LoadBalancers, Network Rules, Middleware nodes, caches, databases, scaling rules, ect, ect.

This concept is a great extension of the Immutable Server pattern. By releasing images instead of artefacts out of our build step (actually build+bake) we enable the creation of Immutable Servers through the entire build pipe.

The process builds a new test environment for each blackbox deployed test execution (basically anything beyond unit test) and then it destroys it again once the test is finished. When we deploy into our next environment call it QA/PreProd what ever then we do it by creating new middleware servers in that environment based on the same image and virtualisation spec as we had in our Test Environment. Once they are upp and running we rotate them into our cluster/load balancer, whatever.

The big difference between our Test, QA and Prod environment deploys is that the first one is a create and destroy scenario while the others are an update scenario as we cannot build new full environments for production. We cannot build new databases, bring up new load balancers, ect for each production deploy. So preferably we separate the handling of Mutable and Immutable infrastructure in one environment so that we can create and destroy our Immutable Servers/Clusters and mutate our databases.

So basically a very low level of detail topology of a Distributed Continuous X implementation looks something like this.

This gives us the fundamental basic understanding of what we need to do inorder to build a scaleable distributed system that handles our Continuous X processes in our company.

In my next post in this series I will go into example of how to build a portable Continuous X Process Implementation that is independent of any CI and Build Tools.

Blog Series: Continuous Deployment in the Cloud

2014-06-02T12:50:00.001+02:00

For my next few posts I am going to focus on writing a series of articles how to do Continuous Delivery & Deployment in a cloud environment. Ive always been a bit cautious when it comes to tutorial style blogs, talks and articles. I usually find them to be too shallow and then never reveal the true issues that need to be solved. This often leads to bad, premature and uninformed decisions made by the consumer of the tutorial.

So instead I am going to try provide a much richer series of articles that focus on how to Architect, Test, Deploy and Deliver in a Cloud environment.

In my conference talks I often talk about the key of having a build data repository that is separate from the build engine (CI Tool). More then once I have been asked if I can open source this our tool. Well Im not sure has been my answer. So instead Ive decided to implement a new similar tool and open source it. Im going to over architect it a bit on purpose as its going to be the main example in the article series. The Process Implementation in this series is going to use that tool as its build repo.

This is how the Process Implementation will look like. I will go deeper into this picture as I move forward but a few quick words about it.

The key to a scaleable CD process is for it to be independent of its runtime environment. The CD process drawn here can be executed from a Build Environment and/or a Dev Environment. The dev can push from his own environment right into production or he/she can let the build environment do it from him/her. Regardless of where the process is initiated it will be executed in the same way and it will integrate with the build data repository to which it reports any events that happen on that build and its also from where it gets its version number.

Will I talk about tooling this time? Yes I will. This article will be based on Git as SCM, AWS as Test and Prod Runtime Environments, Travis CI as Build Environment and Gradle as Build Tool but I still have not decided upon test tool most likely it will be RESTAssured.

This series will take some time to write and will mostly be done during the later part of this summer and the fall. If you are interested in this article series and/or the build data repository then please +1 this article to show me that there is interest.

Thanks

Tomorrow is the premiere of "Scaling Continuous Delivery" at GeeCon 2014

2014-05-14T13:04:00.001+02:00

Tomorrow on the 15th of May Ive got a talk at GeeCon in Krakow. Its the first outing of my new talk Scaling Continuous Delivery. The talk is an experience report on all the struggles we have had scaling our continuous delivery rollout. Hopefully the talk will provide an insight to what we have done and the steps we have taken while scaling. Sometimes its not just the end goal that is interesting but also the journey.

Hopefully its will be appreciated.

Here are the slides for the talk http://www.slideshare.net/TomasRiha/scaling-continuous-delivery-geecon-2014

Environment Portability

2014-04-22T21:41:00.002+02:00

I've talked about this a lot before and we have done a lot of work in this area but it cannot be stressed how important it is. In fact I think portability is the key success factor to building a good Continuous Delivery Implementation. Without portability there is no scalability.

There are two things that need to be built with portability in mind the build pipe itself and our dev, test and prod environments. If the build pipe isn't portable then it will become a bottleneck. Infact if the build pipe can run on each individual developer machine without the use of a build server then it is portable enough to scale in a build server environment.

Though in this post I will focus on Environment Portability from Desktop to Production.

Why do we need Portable Production Environments?

For years we have accepted the fact that the Test Environment isn´t really like the Prod Environment. We know that there are differences but we live with it because its too hard and too expensive to do anything about it. When doing Continuous Delivery it becomes very hard to live with it and accept it.

"If its hard do it more often" applies here as well.

The type of problems we run into as a result of non Portable Production Environments are problems that are hard by nature its scalability, clustering, failover, ect, ect. Its non functional requirements that need to be designed in early. By exposing the complexity of the production environment late in the pipe we create a very long feedback loop for the developers. It can be days or weeks depending on how often you hit prod with your deployments.

By increasing the portability of the production environment we increase the productivity of our developers and the quality of our application. This is obvious but there is another very important issue to deal with as well. Every time a deployment in UAT or Prod fails it undermines the credibility of Continuous Delivery. Each time a something that worked in UAT fails in Prod the person responsible for UAT will call for more manual tests before production. Obviously increasing the lead time even further making the problem worse but we need to constantly improve in order to manage fear.

If we have Portable Production Environments then the issues that stem from Environment Complexity will never hit production as they get caught much earlier in the pipe.

Who owns the definition of our Environments?

The different environments that we have in an organization who defines the and who owns them? There are a lot of variations to this as there are a lot of variation to environment definitions and organizations. In most normally defunct organizations out there the Ops team owns the Production environments and often the machines in the earlier environments. How the earlier environments are set up and used is often the responsibility of the Dev team and if the organization is even more defunct then there is often a Delivery Team involved which defines how its environments are used.

This presents us with the problem that developers quite often have no clue how the production environment looks and implements the system based on assumption or ignorance. In the few cases where they actually have to implement something that has to do with scaling, clustering or failover its often just guess work as they don´t have a way to reproduce it.

Going into production this most often creates late requests on infrastructural changes and often even cases where solutions where based on assumptions that cannot be realized in a production environment.

What is portability?

When we talk about Portable Production Environments what do we really mean? Does it mean that we have to move all our development to online servers and all dev teams get their own servers that are identical to production but in smaller scale? Well not really. It is doable especially with the use of Cloud providers but I find it highly inconvenient to force the developer to be online and on a corporate network in order to develop. Having the ability to create a server environment for a developer on a need to have basis is great because it does cover for the gap where local environments cannot fully be portable.

Assuming one cloud provider to solve all your needs for portability across all environments is not short term realistic in most enterprise organizations unless you are already fully committed and fully rolled out to a cloud provider. There is always these legacy environments that need to be dealt with.

I think the key is to have portability on the topology of the environment. A environment built in AWS with ELBs will never be portable since you will never have an ELB locally. But having A load balancer in your dev environment and having multiple application nodes forces you to build for horizontal scalability and it will capture a whole lot more than just running on one local node.

Running a Oracle XE isnt really the same as running Enterprise Oracle but it provides a good enough portability. Firing it up on a Virtual Box of its own will force the DB away from local host.

In our production environment we monitor our applications using things like Graphite, Logstash+Kibana, ect, ect. These tools should be part of the development environment as well to increase the awareness of runtime and enable "design for runtime".

Creating the Development Environment described here with Vagrant is super easy. It can be built using other tools as well such as Docker or a combo of Vagrant and Docker. But if you use Docker then it needs to be used all the way. My example here with just Vagrant and VBox is to show that portability can be achieved without introducing new elements to your production environment.

Individual Environment Specifications and One Specification to rule them all.

To create a portable topology we need one way to describe our environments and then a way to scale them. A common human and machine readable Topology Specification that defines what clusters, stores, gateways and integrations there are in the topology of our production environment gives us the ability to share the definition of our environments. Then a environment specific specification that defines the scale of the environment and any possible mocks for integrations in development and test environments.

In an enterprise organisation we will always have our legacy environments. Over time these might migrate over to a cloud solution but some will most likely outlive their existence in a legacy environment. For these solutions we still benefit hugely from portable production environments and one way to define them. In a cloud environment we can recreate their topology and leverage the benefits of portability even if we cannot really benefit from the cloud in the production environment it self.

For these solutions the Topology Specification and Environment Specification can be used to generate documentation, little bit of graphviz can do wonders here. This documentation can be used for change request contracts and for documentation purposes.

We like Groovy scripts for our Infrastructure as Code. Here is an example of how the above example with the Cluster and the Oracle database could be defined using the Topology Specification. This example covers just clusters, storages and network rules but more can be added such as HTTP Gateways Fronts as 'gateways' and pure network integrations with partners as 'integrations'.

def topologySpec = [name:'SomeService'
   ,clusters:[[name:'SomeServiceCluster'
           ,image_prefix:'some-service'
           ,cluster_port:80
           ,node_port:8080
           ,health_check_uri:'/ping/me'
           ,networks:[[
                   name:'SomeServiceInternal'
                   ,allow_inbound:[
                       [from:'0.0.0.0',ports:'80',protocol:'TCP']
                       ,[from:'192.168.16.0/24',ports:'22',protocol:'SSH']
                   ],allow_outbound:[
                       [from:'0.0.0.0',ports:'80',protocol:'TCP']
                       ,[to:'OracleInternal',ports:'1521',protocol:'TCP']
                   ]
               ]
           ]
       ]
   ],storages:[[name:'Rdbms'
           ,type:'oracle'
           ,network:[
               name:'OracleInternal'
               ,allow:[
                   [from:'SomeServiceInternal',ports:'1521',protocol:'TCP']
                   ,[from:'192.168.16.0/24',ports:'22',protocol:'SSH']
               ]
           ]
       ]
   ]
]

Environment Specifications contain scaling of the Topology but can also contain integration Mocks as 'clusters' defined just for that environment.

def devEnvSpec = [name:'SomeService'

, clusters:[[name:'SomeServiceCluster'

,cluster_size:2

,node_size:nodeSize.SMALL

]

,storages:[

[name:'Rdbms'

,cluster_size:1

,node_size:nodeSize.SMALL

]

def prodEnvSpec = [name:'SomeService'

, clusters:[[name:'SomeServiceCluster'

,cluster_size:3

,node_size:nodeSize.MEDIUM

]

,storages:[

[name:'Rdbms'

,cluster_size:1

,node_size:nodeSize.LARGE

]

Then the definitions are pushed to the Provisioner implementaiton with the input argument of which environment to Provision.

def envSpecs = ['DEV':devEnvSpec

,'PROD':prodEnvSpec

,'LEGACY':prodEnvSpec]

//args[0] is env name 'DEV', 'PROD' or 'LEGACY'

//Provisioner pics the right implementation (Vagrant, AWS or PDF) for the right environment

new Provisioner().provision(topologySpec, envSpecs, args)

Test Environments should be non persistent environments that are provisioned and decommissioned when the test execution is finished.

Development environments should be provisioned in the morning and decommissioned at the end of the day. This also solves the issue of building the Dev Environment which can be a tedious manual process in many organisations.

Production environments on the other hand need to support provision, update and decommission as its not always convenient to build a new environment for each topological change.

Also understand that Provisioning an environment is not the same as deploying the application. There can be many deployments into a provisioned environment. The Topology Specification doesn't specify what version of the application is deployed just what the base name of the artefact is. I find that convenient as that can be used to identify which image should be used to build the cluster.

One Specification, One Team Ownership

The Topology Specification should be owned by one team, the team that is responsible for developing and putting the system into runtime. Yes I do assume that some sort of DevOps like organisation is in place at this stage. If it isn't then I would say that the specification should be owned by the Dev team and the generated documentation should be contract between Dev and Ops. Consolidating ownership of as many environments as possible into one team should be the aim.

Summary

I think using these mechanisms to provision environments in a Continuous Delivery pipe will increase the quality of the software that goes through the pipe immensely. Not only will feedback be faster but we will also be able to start tailoring environments for specific test scenarios. The possibilities of quality increase are enormous.

New Talk: Continuous Testing

2014-04-22T11:14:00.002+02:00

April 29th I will be visiting HiQ here in Gothenburg. I will have the opportunity to talk about the super important subject of Continuous Delivery and Testing. This talk is a intro level talk that describes Continuous Delivery and how we need to change the way we work with Testing.

If anyone else is interested in this talk then please dont hesitate to contact me. The talk can be focused on a practitioner audience as well digging in a bit deeper into the practices.

Here is a brief summary of the talk.

Why do we want to do Continuous Delivery

Introduction to Continuous Delivery and why we want to do it. What we need to do in order to do it, principles and practices and a look at the pipe. (Part of the intro level talk)

Test Automation for Continuous Delivery

Starting with a look on how we have done testing in the past and our efforts to automate it. Moving on to how we need to work with Application Architecture in order to Build Quality In so that we can do fast, scaleable and robust test automation.

Test Driven Development and Continuous Delivery

How we need to look work in a Test Driven way in order to have our features verified as they are completed. A look at how Test Architecture helps us define who does what.

Exploratory Testing and Continuous Delivery

In our strive to automate everything its easy to forget that we still NEED to do Exploratory Testing. We need to understand how to do Exploratory Testing without doing Manual Release Testing. The two are vastly different

Some words on Tooling

Alot of time discussions start with tools. This is probably the best way to fail your test automation efforts.

Areas Not Covered

Before we are done we need to take a quick look at Test Environments, Testing Non Functional Requirements, A/B Testing, Power of Metrics. (This section expands in the practitioner level talk)

Upcoming talks

2014-04-01T11:38:00.001+02:00

I've gotten the honor to speak at two fantastic conferences this spring.

First one is PipelineConf 8th of april in London where I will talk about the people side of Continuous Delivery. This is the talk Ive had at Netlight EDGE and JDays Conferences though its been update with the experiences from the last 6-8 months of working with Continuous Delivery.

The second one is GeeCon 14th-16th may in Krakow Poland where I will be speaking about Scaling Continuous Delivery. This is a new talk that focuses on lessons learned from our journey to scale continuous delivery from a team of 5 to an organization of 100s.

If you are interested in hearing me speak at a conference, seminar or a workshop. Dont hesitate to contact me.

Portability

2014-03-17T10:32:00.000+01:00

I've talked about Portability of the CD process before but it continuously becomes more and more evident for us how important it is. The closer the CD process comes to the developer the higher the understanding of the process. Our increase in portability has gone through stages.

Initially we deployed locally in a way that was totally different from the way we deployed in the continuous delivery process. Our desktop development environments where not part of our CD process at all. Our deploy scripts handle stopping starting of servers, moving artifacts on the server, linking directories and running liquibase to upgrade/migrate database. We did all this manually on the local environments. We ran liquibase but we ran it using the maven plugin (which we don't do in our deploy scripts there we run it using java -jar). We moved artifacts by hand or by other scripts.

Then we created a local bootstrap script which executed the CD process deploy scripts on a local environment. We built in environment specific support in the local bootstrap so that we supported linux and windows. Though in order to start Jboss and Mule we needed to add support for the local environment in the CD process deploy script as well. We moved closer to portability but we diluted our code and increased our complexity. Still this was an improvement but the process was still not truly portable.

In recent time we have decided to shift our packaging of artifacts from zip files to rpms. All our prod and test environments are redhat so the dependency on technology is not really an issue for us here. What this gives us is the ability to manage dependencies between artifacts and infrastructure in a nice way. The war file depends on a jboss version which depends on a java version and all are installed when needed. This also finally gives us a clear separation between install and deploy. The yum installer installs files on the server, our deploy application brings the runtime online, configures it and moves the artifacts into runtime.

In order for us to maintain portability to the development environment this finally forced us to go all in and make the decision "development is done in a linux environment". We won't be moving to linux clients but our local deploy target will be a virtual linux box. This finally puts everything into place for us creating a fully portable model. Its important to understand that we still dont have a cloud environment in our company.

This image, created by my colleague Mikael, is a great visualization of how portability we can build in our environment now and when we get a cloud. By defining a Portability level and its interface we manage to build a mini cloud on each jenkins slave and on a local dev machine using the exactly same process as we would for a QA or test deploy. The Nodes above the Portability level can be local on the workstation/jeknins slave or remote in a Prod Environment. The process is the same, regardless of environment Provision, Install and Deploy.

Scaling Continuous Delivery

2014-02-21T12:07:00.004+01:00

Its been a while since I posted. Main reason is that we have been very focused on our main deliveries and feature development for the last six month. Whenever the feature train hits central station its always work such as build, release, test automation that gets hit first.

Though there are upsides to not touching your Continuous Delivery process for a few months. If you just keep working on your backlog you don't get time to analyze the impact of the changes you just made. Several times we have realized that the number two/three items in the backlog have dropped significantly in priority as we have fixed the most important issue and others rising fast in priority.

Now we have had time to analyse a lot of new issues and its time for us to pick up the pace again.

Scaling the Organization

The good thing, the awesome thing (!) is that during these six or so months our organization has changed and we have actually be able to create a line organization that owns and takes responsibility for the continuous delivery process.

One of the major bottlenecks we found in our process was our platform/tools team. The team was small and resources in that team where always first to go when feature pressure increased. The team became just another "IT function" that didn't have time to be proactive due to all the reactive support work it had to do.

There was a few reasons behind this first it was the way the team worked in the past. It actually built the pipes and processes for all the teams by hand and tailored to the custom needs of each team. On some teams there were individuals who picked up the work and kept on configuring the jenkins jobs to tailor them even more but on some teams there was no interest whatsoever and their jobs degraded.

The result of this was that no one really knew how the pipes looked and how they should look. Introducing process change was a horribly slow process as it was all manual and dependent on the platform/tools team.

One of the first changes we made was to increase the bandwidth of the team and reducing the dependency on that team. @TobiasPalmborg provided a great solution for this over a chat this summer. Instead of the platform/tools team supporting the development teams the development teams put resources into the platform/tools team. Each team was invited to add a 50% resource on a volunteered basis. This way the real life issues got much better attention in the platform/tools team and the competence about the Continuous Delivery process got spread in a much more organic way.

This did not eliminate the bottleneck organization but it gave us bandwidth to change the way we work and long term gave us the ability to scale with the number of teams that use the process.

Scaling the Process

The main issue with why we were a bottleneck was the way we worked. We preached Automate Everything, Test Everything, If its hard do it more often, ect but when it came to the Continuous Delivery process we didn't do what we where teaching.

We had ONE Jenkins Environment so all the changes happened directly in production. Testing plugins and new configurations on a production environment isn't really the way to delivery stability, reliability and performance.

Manually created Jenkins Pipes isnt really a way to create sustainable pace and continuous improvements.

Developing Deploy scripts without explicit unit tests isnt really a good way of creating a stable process. We have been priding ourselves with our deployment being tested hundreds of times pre production deploy which was true but very dumb. Implicit testing means that someone else takes the pain for my mistakes. Deployment scripts are applications and need to be treated as first class citizens.

This had to change.

First thing we did was to use the extra bandwidth we had obtained to build a totally new way of delivering continuous delivery. Automate everything, obvious, hu?

We also decided to deliver a continuous delivery environment per development team and not have them all in one environment. So we started with automating provisioning of Jenkins & Test environments. We dont have a cloud solution in our company at this time so we have a fake cloud that we work with which is a huge pool of virtual servers. This pool we provision and maintain using chef.

Second thing was to automate the build pipe setup. We built us a little simple pipe generator which has defined pipe templates of 5-8 different layouts to support the different needs. We actually managed to get the development teams to adjust to a stricter maven project naming convention to use the generated pipes as everyone saw the benefits of this.

The pipes we have are basically typed by what they build if its libs or deployable components and how they are tested as we still need to initiate our Fitnesse tests a bit differently from our other tests.

We made it the responsibility of the platform/tools team to develop the pipe templates and the responsibility of the development teams to configure their generator to generate the pipes they needed for their components.

Getting to this stage was a lot of work and a lot of migration work for all the teams but the results have been terrific. The support load has gone down alot on the platform/tools team and each bug fix is rolled out within minutes to all the pipes.

We have also be able to take on new development teams very easily. Not all teams in our company are ready to do Continuous Delivery but they are all heading in this direction and we can now provide environments and pipelines that match their maturity.

Summary

We have gone from a process developed as skunkworkz to Continuous Delivery as a Service within our organization. We always run into new bottlenecks and challenges this time the bottleneck was much more us than anything else. I assume that the next big bottleneck is going to be hardware and our inability to deliver on a cloud solution, since we now can roll out to more and more teams. But who knows I can be wrong only time will tell.

Its about the people.

2013-06-18T16:19:00.001+02:00

Last week I attended QCon New York. Fantastic conference as usual and it was comforting to see that basically everyone was saying the same thing. "Continuous Delivery is not about the technology, its about the people". Which also happens to be the title of my talk at Netlight´s EDGE conference in september,

In his talk Steve Smith (@agilestevesmith) talked about how 5% is technology and 95% is organization. While I agree with that I think that the non-technical 95% can be divided into organization, change of role definitions and individual maturity. Its these three that my talk will cover.

Hopefully I will be able to have this talk in Gothenburg as well as its been submitted to JDays.

Talk at HiQ 24th of April

2013-04-08T10:38:00.004+02:00

Continuous Delivery - Enabling Agile.

The key to agile development is a fast feedback loop. Continuous Delivery strives towards always having tested releases in deliverable state. Continuous Delivery is not just a technical process but a change to the entire organization and the individuals within it. This presentation describes the principles of Continuous Delivery, a brief overview on how it can be implemented, how it changes the organization and how it impacts the individuals.

Target audience for this presentation is Developers, Architects, Testers, Scrum Masters, Project Managers and Product Owners in no particular order. The presentation is not rich in technical detail and based on real life experiences.

Please use this post to provide questions and feedback.

Welcome

Tomas

Link to the presentation: http://www.slideshare.net/TomasRiha/continuous-delivery-hi-q

Architect to re-Architect

2013-02-24T22:44:00.000+01:00

We spend so much time trying to make the right decisions. It's one of the downsides of working on a next generation platform. "You better get it right this time!". We have all been there when a current generation solution just doesn't cut it anymore. Implementing that next requirement is going to be so expensive that we might just as well rewrite the whole thing. Thing is they also tried to "get it right this time!".

Why does it "always" go wrong? Why do we always run into dead ends with systems. Sure not always but always when an application is exposed to a lot of changes and new requirements.

Select technology then abstract and isolate it in the architecture.

Historically we have put a lot of thought into selection of technology when we build something new. Its important to not get it wrong so we think a lot about getting it right. We also think a lot about patterns so that we can replace tech A with tech B if the decision has to be reversed. Who hasn't written hundreds of DAOs so that we one day can change our database. How often do we change database? Historically well I have never done it. Change from Oracle to DB2 or what ever other SQL database has never been the reason for a major rewrite. In fact I've been part of more then one rewrite that has thown out everything but the data layer.

In the future we will see more database changes due NoSQL but if and when we do that do we really want to keep our DAO interfaces? If we do then we sure ant going to accomplish much with our rewrite. If we change then we change because we need to solve a bottleneck problem. In order to solve it we need to make an optimization using a niche product. So we need to write and query our data differently.

The cause of a major rewrite is either lack of scale ability or customer requirements that are to hard to expensive or too high risk to implement. The later almost always happens when everything has become so interconnected that the change can no longer be done in a safe and isolated way. We need to refactor so much I order to make the change possible that its cheaper to rewrite.

Distribute system can still be a monolith.

In standard monolithic design we monolithized everything not just the components of the system but also the data model and the business logic. By normalizing our data model and constantly striving towards decreasing code redundancy we entangle all the services of our application into a huge ball of concert. It's when we end up with our services entangled in a solid ball of concert that we need to blow it up, all of it in order to rewrite it. It doesn't matter how well we modeled our database, how nice our DAOs are or how much inversion of control we use. If we don't treat our services independently we will run into trouble down the road.

Decoupling the monolith into subsystems doesn't necessarily help either. If we still normalize our data and strive towards reusing as much code as possible within the components then all we have done is distributed the monolith. Chances are quite high that you will need to rewrite multiple components when the requirement change appears.

Lets take an example.

We have a training application aimed towards running and cycling. We have users, training sessions and races. Training sessions and races are the same thing really they both contain a number of users, equipment, time, distance and a route. We provide views of user training sessions, user races and race results by race. We sell the application to race organizers and its free to users. We have an agreement to keep the race results highly available and to keep all history of previous years.

So we have a simple data model with users and sessions with a many to many relationship and a type defining if its a race or a training session. Simple. Done. Delivered.

Now the application becomes really popular as a training application among users so we start gaining a lot of data. This data is mostly written since no one else then the user really cares about it. Though it does impact on our race data since people tend to look at that more.

Someone realizes that all the training data is interesting since we also added a heart rate integration. So we build queries on the training data to provide to medical studies. Sweet extra income that he sales dudes came up with. It's no real issue performance wise as we run them once a year and that's done over Christmas.

Now someone sells our services of race data, training and fitness trending to UCI (cycling union) as a tool for their fit against doping. We just need to add a query to correlate our sweet training reports with race results, how hard can that be. We add the develop for a sprint or two and go live. So now we get serious tonnage of data and we run our queries more often. *gag* it doesn't work we can't scale and we can't add e new query without totally killing our SLAs with the other races. We need to rewrite.

Components are not the silver bullet.

Components dont really help us

Having our system distributed into a user repository, session storage and a integration component providing rest services to our GUI component wouldn't help us all at much. Sure we have separated users and their equipment from the sessions but its the queries on the sessions that is the problem and that they are killing our SLAs with the other race organizers.

Design by Services

So what we really need is to move the race result service into a service of its own. We need to isolate it. Even though all the data is identical to the race data by the user. Then we need to separate the integration code for the race organizer service into a service of its own so that we can deploy it separately.

Services do help us

Doing this when hitting then wall is both hard, costly and risky. Just the database split is a nightmare if the data has grown big.

If we would have done this from the get go we could just have re architected the user race and training session service. We could have moved that from our MySQL to a big table database or what ever without affecting our race by organizer service. But doing this upfront feels so awkward we would have had duplicate tables and redundant code.

Define and isolate services in the architecture.

If we focus on isolating services across our components instead of isolating technology then we can actually re-architecture our bottlenecks. In fact in our example we could just added a uci services that duplicates the other services and if it would run into performance issues we could just re-architectured it. But that would have forced us to duplicate more upfront and to increase our initial development costs.

Services can be extremely similar and
yet be different services

It's hard to "get it right" when the right can be against everything you have been thought for years. What we must learn to understand better is how we define and isolate services so that we can re-architecture our bottlenecks for the services that experience them and not the entire system.

So it took a year.

2013-02-24T15:51:00.002+01:00

When we first started building our continuous delivery pipe I had no idea that the biggest challenges would be non technical. Well I did expect that we would run into a lot of dev vs ops related issues and that the rest would be just technical issues. I was so naive.

We seriously underestimated how continuous delivery changes the every day work of each individual involved in the delivery of a software service. It affects everyone Developer, Tester, PM, CM, DBA and Operations professionals. Really it shouldn't be a big shocker since it changes the process of how we deliver software. So yes everyone gets affected.

The transition for our developers took about a year. Just over a year ago we scaled up our development and added give or take 15-20 developers. All these developers have been of a very high quality and very responsible individuals. Though none of them had worked in a continuous delivery process before and all where more or less new to our business domain.

When introducing them everyone got the run down of the continuous delivery process, how it works, why we have it and that they need to make sure to check in quality code. So off you go make code, check in tested stuff and if something still breaks you fix it. How hard can it be?

Much much harder then we thought. As I said all our developers are very responsible individuals. Still it was a change for them. What once was considered responsible like if it "compiles and unit tests check it in so that it doesn't get lost" leads broken builds. Doing this before leaving early on Friday becomes a huge issue because others have to fix the build pipe. But it goes for a lot of things like having to ensure that database scripts work all the time, everything with the database is versioned, roll backs work, ect, ect. So everyone has had to step up their game a notch or two.

Continuous delivery really forces the developer to test much more before he/she checks in the code. Even for the developers that like to work test driven with their junit tests this is a step up. For many its a change of behavior. Changing a behavior that has become second nature doesnt happen over night.

We had a few highly responsible developers that took on this change seamlessly. These individuals had to carry a huge load during this first year. When responsibility was dropped by one individual it was these who always ensured that the pipe was green. This has been the biggest source of frustration. I get angry, frustrated and mad when the lack of responsibility by one individual affects another individual. They get angry and frustrated as well because they don't want to lave it in a bad state and their responsibility prevents them from going home to their families. I'm so happy that we didn't loose any of these individuals during this period.

Now after about a year things have actually changed everyone takes much more responsibility and fixing the build pipe is much more of a shared effort. Which is soo nice. But why did it take such a long time? Id really like to figure out if this transition could have been made smoother and faster.

Key things why it took so much time.

A change to behavior.
Developers need to test much more, not just now and then but all the time. No matter how much you talk about "test before check in" , "test", "test", "test" the day the feature pressure increases a developer will fall back on second nature behavior and check in what he/she believes is done. We can talk lean, kanban, queues, push and pull all we want but fact is still there will always be situations of stress. Its not before a behavior change has become second nature we do it under pressure.

Immature process.
Visibility, portability and scale ability issues have made it hard to take responsibility. Knowing when, where and how to take responsibility is super important. Realizing that lack of responsibility is tied to these took us quite some time to figure out. If its hard to debug a testcase its going to a lot of time to figure out why things are failing and its going to require more senior developers to figure it out. Its also hard to be proactive with testing if the portability between development environment and test environment is bad.

Lot of new things at once
When you tell a developer about a new system, domain and a new process Im quite sure the developer will always listen more to the system and domain specific talks.
Developer has head full of this system communicates with that system and its that type of interface. Then I start going on about "Jira, bla bla bla, test bla, checkin bla bla, Jenkins bla, deploy, bla, fitnesse, test bla, bla" and developer goes "Yeah yeah yeah Ill check in and it gets tested I hear you, sweet!".

I defiantly think its much easier for a developer to make the transition if the process is more mature, has optimized feedback loops, scales and is portable. Honestly I think its easily going to take 3-6 months of the learning curve. But its still going to take a lot of time in range of months if we don´t become better at understanding behavioral changes.

Today we go straight from intro session (slides or whiteboard) to live scenario in one step. Here is the info now go and use it. At least now we are becoming better at mentoring. So there is help to get so that you can be talked through the process and the new developer is usually not working alone, which they where a year ago. Still I dont think its enough.

Continuous Delivery Training Dojos

I think we really need to start thinking about having training dojos where we learn the process from start to finish. I also think this is extremely important when transitioning to acceptance test driven development. But just for the reason of getting a feeling for the process. What is tested where, how and what happens when I change this and that. How should I test things before comiting and what should be done in which order.

I think if we practiced this and worked on how to break and unbreak the process in a non live scenario the transition would go much faster. In fact I dont think these dojos should be just to train new team members but they would also be a extremely effective way of sharing information and consequences of process change over time.

Talk at ÅF Consult 2013-01-12

2013-02-11T18:58:00.002+01:00

On Tuesday the 12 January I have a talk about Continuous Delivery at ÅF Consult, Gothenburg.

This is the agenda of the day.

Intro to Continuous Delivery
Principles of Continuous Delivery
Look at a Pipe
Impact on Scrum
Feature Driven Development
Impact on Developers and Testers

Participants please use this post for feedback and any questions that you didn't get a chance to ask and would like me to answer.

The slides from the presentation can be found here.

The world upside down.

2013-02-07T17:52:00.002+01:00

Sometimes the world just goes upside down. I talked in a previous post about how continuous delivery and test driven development changes the role of the tester. I retract that post. Well maybe not fully but let me elaborate.

Our testers are asking HOW should we automate and our developers are asking WHAT are you trying to do.

I´ve thought that our problem has been that we haven't been able to find testers that know HOW to automate. Its not our problem. Our problem is that we are asking our testers to automate instead of asking them WHAT to test. If a tester figures out what to test then any developer can solve the how to automate with ease.

I still believe that the role of the tester will change over time and that anyone who can answer the what and how question will be the most desired team member in the future. Until then we need to have testers thinking about WHAT and developers thinking about HOW.

Frustrating how such a simple fact can stare us in the eyes for such a long time without us noticing it.

Next time a tester asks how and a developer has to question what then its time to stop the line and get everyone into the same room asap!

My dear Java what have you done?!?!

2013-02-05T15:14:00.000+01:00

What is this diagram? Well its a generic lightweight storage component. In previous posts on this blog Ive talked about how we have refactored our main components into smaller specialized components and how fruitful that has been for us. It really has and its been one of the best architectural steps we have taken and the components are super clean and very well written. They are very well written based on our requirements and on how we as a community expect Java applications to be built. But they have really made me think, are we actually putting the right requirements on them??

Lets take an example if our system was a fitness application then we could have four of these components one for user profiles, one for assets (bikes, shoes, skis, ect), one for track routes (run, bike, ski routes) and one for training schedules. Small simple components that store and maintain well defined data sets. Then we have a service component that aggregates these data sets into a consumer service.

So say that our route track component has the responsibility of storing a set of GPS track points and classifying them as run, bike or ski. How hard can it be? Well look at the diagram.

First of all what is the value of the component? Its providing a set of track points. There could be some rules on what data is required and how missing data or irregular data is handled since GPS can be out of reach. But its still very simple receive, validate, store and retrieve. The value is the data. The request explicitly asks for data related to a user and an activity type. So why does the request go through layer upon layer of frameworks and other crap in order to even get to the data? Shouldn't we strive to keep the data as close as possible? Shouldn't we always strive to keep the value as close to the implementation as possible?

I've always been a huge fan of persistence frameworks. I worked back in the good old dot com era when everyone and his granny wanted into IT and become programmers. Ive seen these developers trying to work directly with JDBC and SQL. The mess they created and the mess much better developers created even after the dot com era has made me believe that abstracting away the database through persistence frameworks is a must. The gain on 90% of the vanilla code written by the avg Joe heavily out weighs the corner cases where you need a seasoned developers to write the native queries and mappings.

Though Im starting to believe that JPA and Hibernate has to be the worst thing that has happened to Java since EJB2. Adding a blanket over a problem doesnt make the problem go away. The false notion of not having to care about the database can have catastrophic consequences. Good developers understand this and have learned to understand the mapping between the JPA model and the DB model. They have learned how the framework works and they have learned how hql maps to sql. By trying to mitigate the bad code written by the average Joes we have created even more complexity and developers require to not just master databases but also a new framework on top of them.

So for us to get our data from our database to our REST interface we create objects that map to a model of another system. Then we write queries in a query language that is super hard to test and experiment with and that still doesn't give us feedback on compile time. These queries and mapped objects get translated into a query language of a remote system. Note we transform it in the java application to the language of the other system which means that we in our system need to have full notion and understanding of the other system. Then we take this translated query and feed it through a straw to that other system. Down in that system the query is executed in the query engine, which we have to understand in our system since what we generate maps directly onto it. Then this query engine executes on the logical data model. Which happens to be the core value of our application. The only think that does make sense is the mapping between logical and physical storage of the data, since this we don't have to care about and its actually been abstracted out pretty well.

The database engine then feeds us the data back through our straw where Hibernate maps it back to our Java Objects. Then of course we have been good students and listened to the preaching about patterns so we transform our entity objects into DTOs that we then feed through multiple layers of frameworks so that they can be written to the http response.

To quote a colleague "Im so not impressed by the data industry". I totally agree. We have really made a huge mess of things. Not in our delivery for being what it is, a delivery on the Java platform with a RDBMS its a very well written and solid application.

Am I saying that we should just throw out all the frameworks and go back to using JDBC and Servlets straight of? Well no obviously (even though I say it when Im frustrated). The problem needs to be solved at the root cause. JDBC and SQL was equally bad because it was still pushing data through a straw. Queries written in one system pushed through a straw into another system where they are execute is never ever going to be a good model. Then the fact that the data structure of the database system doesnt match the object structure of the requesting application is another huge issue.

I really think that the query engine and the logical data storage model need to become part of the application and not just mapping frameworks. Some of the NoSQL database do this but most of them still work too much like JDBC where you create your connection to the database engine and then you send it a query. Instead Id like to just query my data. I want data not a connection. The connection has to be there but it should be between the query engine and the persistent store not between me and the query engine.

Query q = Query.createQuery(TrackPoint.class);
List<TrackPoint> trackpoints = q.where("userId").equals("triha")
.and().where("date")
.between("2012-01-01", "2013-01-30");

This should be it. My object should be queried and logically stored as close to me as possible. This way we would mitigate the problems of JDBC and SQL, by removing them not coverying them up with yet another framework. This would give us a development platform easy enough for average Joes and yet not dumbed down to a level that restricts our bright minds.

We could use more of our time actually writing code that adds value rather then managing frameworks that are just a necessity and not a value. Our simple components would look something like the diagram to the right. Actually pretty simple.

How hard can it be? Obviously quite hard. The most amazing thing is that we get all this complexity to work. No wonder its super expensive to build systems.

Working the trunk

2013-02-03T15:51:00.001+01:00

When my colleague Tomas brought up the idea of continuous delivery he first thing that really caught my attention was "we do all work on the trunk". I've always hated branches. I've worked with many different branching strategies and honestly they have all felt wrong.

My main issue has always been that regardless of branch strategy (Release Branches or Feature Branches) its a lot of double testing and debuging after merge is always horrible. Its also hard to have a clear view of a "working system", what is the system, which branch do you refer to? Always having a clean and tested version of the trunk felt very compelling. No double work and a clear notion of "the system"! I'm game!

So we test everything all the time. How hard can it be.

Well it has proven to be a lot harder then we thought, not to continuously test but to manage everyone's desire to branch. Somehow people just love branches. Developers want their feature branches where they can work in their sandbox. Managers want their branches so that they don't get anything else then just that explicit bug fix for their delivery and not risk impact from anything else.

These are two different core problems one is about taking responsibility and one is about trust.

Managers don't trust "Jenkins".

Managers don't trust developers but somehow they do trust testers. Its interesting how much more credit a QA manager has when he/she says "I've tested everything" then a blue light on jenkins. In fact managers have MORE confidence in a manual regression test that has executed "most of the test-cases on the current build" then an automated process which executes "all the test-cases on every build". I think the reasons are twofold one is that the process is "something that the devs cooked up" and the other is that jenkins cant look a manager in the eyes. It would be much easier if Jenkins was actually a person who had a formal responsibility in the organisation and could be blamed, shouted on and fired if things went wrong.

It takes alot of hard work to sell "everything we have promised works as we have promised it". For each new build that we push into user acceptance testing we need to fight the desire to branch the previous release. Each time we have to go through the same discussion.

"I just want my bug fix"
"you get the newest version"
"I don't want the other changes"
"Everything we have promised works as we have promised it"
"How can you guarantee that"
"We run the tests on each check in"
"Doesn't matter I don't want the other changes they can break something, I want you to branch"

I dint know how many times we have had this argument. Interesting is that we are yet to break something in a production deploy as a result of releasing bug fix from the trunk (and hence including half done features). Though we have had a failed deploy due to having subdued to the urge to branch. We made a bad call and subdued to the branch pressure. By doing that we branched but we didn't build a full pipe for the branch which resulted in us not picking up a incompatible configuration change.

Developers love sandboxes

Its interesting, developers push for more releases, smaller work packages yet they love their feature branches. I despise feature branches even more then release branches. Reason is that they make it very hard to refactor an application and the merging process is very error prone. The design and implementation of a feature is based on a state of the "working system" which can be totally different from the system its merged onto. Also it breaks all the intentions to do smaller work packages and test them often, a merge is always bigger then a small commit.

The desire to feature branch comes from "all the repo updates we have to do all the time slow us down so much" and "we cant just check stuff in that doesn't work without breaking stuff". The later one isn't just from developers wanting to be irresponsible its also from us running SVN and not GIT. Developers do want to share code in a simple way. Small packages that two team mates want to share without touching the trunk is a viable concern. So the ability to micro branch would be nice. So yes I do recommend GIT if you can but its not a viable option for us. Though I'm quite sure that if we where using GIT we would end up having problems related to micro branches turning into stealth feature branches.

I think the complaint "all the repo updates we have to do all the time slow us down so much" is a much more interesting one. In general I think that developers need to adopt more continuous integration patterns in their daily work but this is actually a scale-ability issue. If you have too many developers working in the same part of the repo you are gonna get problems. When developers do adopt good continuous integration patterns in their daily work and their productivity drops then there is an issue. This is one of the reasons why we have seen feature branches in the past.

Distribute development across the repository

When we started building our delivery platform we based it on an industry standard pattern that clearly defines component responsibility. Early on in the life cycle we saw no issues of developers contesting the same repository space as we had just one or two working on each component. But as we scaled we started to see more and more of this. We also saw that some of the components where to widely defined in their responsibility which made them bloated and hard to understand. So we decided to refactor the main culprit component into several smaller more well defined components.

The result of this was very good. The less bloated a component is the easier it is to understand and to test, which leads to increased stability. By creating sub components we also spread out the developers across the repository. So we actually created stable contextual sandboxes that are easy to understand and manage.

Obviously it we shouldn't just create components to spread out our developers but I think that if developers start stepping on each other then its a symptom of either bad architecture or over staffing. If a component needs so many developers that they are in the way of each other then the chance is quite good that the component does way too much or that management is trying to meet feature demand by just pouring in more developers.

Backwards compatibly

Another key to working on the trunk has been our interface versioning strategy. Since we mostly provide REST services we actually where forced into branching once or twice where we had not other option and that was due to not being backwards compatible on our interfaces. We couldn't take the trunk into production because the changes where not backwards compliant and our tests had been changed to map the new reality. This is what lead to our new interface strategy where we among things never ever change interfaces or payloads, just add new ones and deprecate old ones.

Everything that interfaces the outside world needs to be kept backwards compatible or program management and timing issues will force inevitable branching.

Not what I expected

When we first deiced to work solely on the trunk I thought it was gonna be all about testing. Its important but I think people management has been a bigger investment (at least measured in mental energy drain) and importance of good architecture was under rated.

Package power!

2013-01-15T12:34:00.000+01:00

We often talk about pipe design and how to implement it in jenkins or other ci tools, that everything should be versioned and that everything should be tested all the time. These things are very important but something I didn't realize for quite some time was how important packaging is.

Our packaging was giving us problems.

Early on when building our continuous delivery pipe we where a bit worried about the number of artifacts we where spewing out of our pipe and the impact it would have on our nexus repo. So we did release our war and jar files into our repo but the final deliverable assembly we released was just a property file containing versions. These property files where used by our rudimentary bash deploy scripts. The scripts basically did a bunch of wgets to retrieve the artifacts from the nexus repo before deploying them. Yeah laugh you I now know how dumb this was.

Our main problem due to this was that our scripts where very delivery specific. For delivery Y we had components A, B and C while for delivery Z we had components A, D and E. We couldn't reuse things well enough so we had duplicates of our scripts. Another issue we had was that there was no portability in this what so ever. We didn't really make the connection between lack of packaging and our huge developer environment problems. Switching between working on delivery X and Z was tedious because we where managing the local deployments in eclipse with the JBoss plugin. It also required full understanding of what components needed to be deployed.

Manual tasks and a required high level of domain knowledge didn't make things easy for our new developers. In act it also made life a pita for our architects that develop less hours a week then the developers. For them the rotting of the development environment was a huge issue. Since all components where managed manually all had to be updated, built and deployed.

Inspiration and goals.

When me and my colleague where at QCon NY (awesome conf that everyone should try to attend) we listened to talks by Netflix and Etzy. We where totally blown away by two things. Etzy's practice that a new developer should code and deploy a production change on the first day and Netflix baking of images instead of releasing wars and ears. These where two of the main things we brought back with us and two things that we keep revisiting as we iterate our process.

Since we don't do continuous deploy we set the goal that a new developer should be able to commit a change that is ready for delivery on the first day. The continuous delivery part of the goal wasn't the problem since we already had that in place. It's the most obvious part of that goal. The next obvious task for us was that we really had to do something about our dev env setup. Then with some thought we realized that this wasn't enough we needed to do something about our entire on boarding process with mentoring and level of knowledge in the team. In order to mentor someone a developer needs to have a good understanding of most tasks in jira. At this stage this wasn't the case.

We made the knowledge increase our priority since this was biting us in many ways. I won't go much more into that. Then we tried to prioritize the setup of our developer environment but doing something about our deploy scripts ended up being a higher priority. This was a very good and honestly lucky decision. We knew how to do our deploy script changes and our production deployments where really more important. But we where also not sure how to do our developer environment changes so sleeping on it was what we decided, even though our devs where literally screaming in frustration.

Addressing the problems.

First thing we did when we started to rewrite our scripts was to sort out our packaging once and for all. We killed the property file and started using maven for everything. We had already been using maven to release all components and most configurations. But we where not using maven to package our final deployables and we where not using it to release our deploy scripts. We had already been made very well aware that we had to tie our deploy scripts to our deployable assembly. We changed both these things. We started to release everything and not just versioning everything. This imho is very important thing that's not mentioned enough. Blogs, articles and demos talk about versioning everything but not so much about the importance of actually releasing everything and treating each release as an artifact even if its "just" a httpdconf.

Once we started building these packages and setting our structure it was so clear how Netflix came to the conclusion that they should bake images. The package contains war files, config files, deploy scripts, liquibase scripts, custom JBoss control scripts, httpdconf, ect, ect. The more we package and the more servers we get in our park the more things we notice that we need to put into the package. Then it becomes even more obvious since we take this package and transfers it to tons of servers for different test purposes. Once at the server we run our deploy scripts that copy and link stuf into place on the server. Remind me why are we doing this over and over? Wouldn't it be better to just do this once and make an image out of it and mount this image on different nodes. Of course it would be, Netflix know what they are talking about! Most importantly it would bring the final missing pieces into the package JBoss, Java and Linux distributions. Giving us the power to actually roll out and test even OS patches through the same process as any other change. We arnt there yet, but the path is obvious and its nice to feel that what was once an overwhelming w000t is now a definite possibility.

So through a good packaging strategy we managed to improve and solve our deploy script problems. We now had one script to distribute and deploy them all! This also resulted in much fewer changes to the deploy scripts which in turn made them more stable. A lot of changes that previously required changes to deployment scripts now just requires a change to the packaging which makes the entire deployment process much more robust.

Portability!

Still though we hadn't solved our issues with our developer environments. I had the hunch for some time that our packaging could help us. Still it took us some time before we realized that we actually had created an almost fully portable deployment solution. Our increased maven usage had made us so portable that we could actually just write a simple script that combined the essence of the assembly job and the deployment job of our jenkins pipe into a local dev env script. By adding "snapshots true" to our maven version properties update we allowed our assemblies to be built including snapshots. Then we could just use our deploy scripts and voila our local JBosses and Mule ESBs where deployed with artifacts containing our code changes and most importantly our rebel.xmls, giving us full JRebel power with our production deploy scripts.

Our packaging strategy had made our continuous delivery process portable to our development environment allowing us to use the same assemble+deploy from local dev env to prod. Our developers now just need to know what assembly to deploy and they don't need to rebuild all included components just the ones they are currently working with, the others are added by maven for he nexus repo. So now our developers can quickly and easily switch between single component deploys and full deliveries.

Getting closer to our goals.

By adding JBoss & Mule installations to the script we further simplified the setup process for the new developers. We still have a few things we want to add to the script such as IDE install and initial source code checkout in order to simplify things further but at will have to rest it a bit since we have other higher priorities. Still we have taken huge steps towards our Etzy inspired goal of having new developers commit a code change on the first day.

It feels like all these levels of improvement have been unlocked by a good packaging strategy!

If its one thing I would change about the way we have gone by our implementation its the packaging. It's easy to say in hindsight but I'd really try to do it properly of the bat.

Test for runtime

2013-01-09T15:51:00.000+01:00

Traditionally our testers have been responsible fore functional testing, load testing and in some cases for some failover testing. This covers our functional requirements and some of our supplemental requirements as well. Though it doesn't cover the full set of supplemental requirements and we haven't really taken many stabs at automating these in the past.

The fact that we haven't really tested all the supplemental requirements also leaves a big question, who's responsibility is verification of supplemental requirements? Lets park that question for a little bit. To be truthful we don't really design for runtime either. Our supplemental requirements almost always come as an afterthought and after the system is in production. They always tend to get lost in the race for features to get ready.

In our current project we try to improve on this but we are still not doing it well enough. We added some of the logging related requirements early but we have no requirement spec and no verification of the requirements.

The logging we added was checkpoint logging and performance logging. Both these are requirements from our operations department. The checkpoint logging is a functional log which just contain key events in an integration. It's used by our first line support to do initial investigation. The performance log is for monitoring performance of defined parts of the system. It's used by operation for monitoring the application.

Lets use user registration as an example (its a fictive example).

1. User enters name, username, password and email into a web form.
2. System verifies the form.
3. System calls a legacy system to see if the email is registered in that system as well.
3a. If user registered in legacy system with username and password matching the userid is returned from that system.
4. System persists user.
5. Email is sent to user.
6. Confirmation view displayed.

From this we can derive some good checkpoints.

2013-01-07 21:30:07:974 | null | Verified user form name=Ted, username=JohnDoe, email=joe@some.tst
2013-01-07 21:30:08:234 | usr123 | User found in legacy system
2013-01-07 21:30:08:567 | usr123 | User persisted
2013-01-07 21:30:08:961 | usr123 | User notified at joe@some.tst

The performance log could look something like this.

2013-01-07 21:30:07:974 | usr123 | Legacy lookup completed | 250 | ms
2013-01-07 21:30:08:566 | usr123 | User persisted | 92 | ms
2013-01-07 21:30:08:961 | usr123 | User registration completed | 976 | ms

This is all nice but who decides what checkpoints should be logged? Who verifies it?

Personally I would like to make the verification the responsibility of the testers. Though I've never been in a project where testers have owned the verification of any kind of logging. This logging is in fact not "just" logging but system output, hence should definitely be verified by the testers. By making this the responsibility of the tester it also trains the tester in how the system is monitored in production.

So how do can this be tested?

Lets make a pseudo Fitnesse table to describe the test case .

| our functional fixture |
| go to | user registration form |
| enter | name | Ted | username | JohnDoe | email | joe@some.tst |
| verify | status | Registration completed |
| verify | email account | registration mail received |

This is how most functional tests would end. But let's expand the responsibility of the tester to also include the supplemental requirements.

| checkpoint fixture |
| verify | Verified user form name=Ted, username=JohnDoe, email=joe@some.tst |
| verify | User found in legacy system |
| verify | User persisted |
| verify | User notified at joe@some.tst |

So now we are verifying that our first line support can see a registration main flow in their tool that imports the checkpoint log. We have also taken responsibility of officially defining how a main flow is logged and we are regression testing it as part of our continuous delivery process.

That leaves us with the performance log. How should we verify that? How long should it take to register a user? Well we should have an SLA on each use case. The SLA should define the performance under load and we should definitely not do load testing as part of our functional tests. But we could ensure that the function can be executed within the SLA. More importantly we ensure that we CAN monitor the SLA in production.

| performance fixture |
| verify | Legacy lookup completed | sub | 550 | ms |
| verify | User persisted | sub | 100 | ms |
| verify | User registration completed | sub | 1000 | ms |

Now we take responsibility that the system is monitor able in production. We also take responsibility and officially define what measuring points we officially support and since we do continuous regression testing we make sure we don't break the monitor ability.

If all our functional test cases look like this then we Test for runtime.

| our functional fixture |
| go to | user registration form |
| enter | name | Ted | username | JohnDoe | email | joe@some.tst |
| verify | status | Registration completed |
| verify | email account | registration mail received |

| checkpoint fixture |
| verify | Verified user form name=Ted, username=JohnDoe, email=joe@some.tst |
| verify | User found in legacy system |
| verify | User persisted |
| verify | User notified at joe@some.tst |

| checkpoint fixture || performance fixture |
| verify | Legacy lookup completed | sub | 550 | ms |
| verify | User persisted | sub | 100 | ms |
| verify | User registration completed | sub | 1000 | ms |

Continuous Delivery and DevOps in a legacy organization

2013-01-05T22:51:00.001+01:00

I've been using the term legacy organization. My definition of a legacy organization is a slow changing organization that separates professions in silos. The slow changing nature can but doesn't have to be due to sizes. The separation of professions into silos materializes into a process where responsibility is handed over from profession to profession.

I have intentionally put development and test into same box. In some legacy organization you see theses separated into two silos where development hands over to a QA department which tests the application. I don't want to say its impossible to do continuous delivery with that type of setup because nothing is impossible. It requires the development organization to start taking responsibility for testing. It can be done by smart recruiting of developers with test focus but its going to be hard.

I refere to the above setup as legacy noDevOps organization because it separates development and operations and suffers heavily from the wall of confusion syndrome but it is an organization where test driven development is possible. Two of the biggest issues in a legacy noDevOps organization is the gunpoint standoff and droped responsibility at the wall. The standoff results in unconstructive blame games and lack of constructive change.

The dropped responsibility comes when development just wants out of responsibility at the point of handoff. Project managers want to close the project. Developers want to do new cool stuff. So development picks a few members who get to run at the wall when the rest hide. At the wall the mudball of a deliverable is tossed over the wall hoping that someone on the other side catches it.

A lot of talks and writeups on continuous delivery more or less assume a DevOps organization. Its definately much easier since continuous delivery requires the uses of same deployment mechanisms in all environment, which in turn puts a high requirement on similarity in infrastructure . Building a good process without the help of the direct involvement of the infrastructure experts in the operations organization is extremely hard. Doing continuous delivery well requires a higher level of continuous responsibility by the developers. DevOps allows developers to take responsibility in production, which is hard in legacy noDevOps organizations. So yes obviously continuous delivery is made so much easier with DevOps.

But what should we do? Should we just sitt there and wait till a manager calls a meeting and says we are gonna start doing DevOps and CD. If that happens then the DevOps is gonna be so full of friction because our professions are still at gunpoint standoff. So before anything gets done everyone needs to lower their guns and start trusting each other, this will take time.

Its my firm belief that the standoff is always the "fault" of the development organization. If we would have been delivering high enough quality in a stable enough application then there would not have been any standoff and there would have been trust. We can argue all we want that it's not possible to deliver enough stability and quality from a development organization without help and change from operations but it's beside my point. We can only change our own behavior and we can only do that by being the change we want to see.

If we want to deploy more often in order to archive higher quality then we make sure to hold our end of the bargain, higher quality and stability deliveries. We start by taking active responsibility for quality and stability through continuous regression testing. We test our deployments one million times if that's what it takes to make a stable deployment. We I prove with each delivery. We take pride in learning from our mistakes and automating tests to ensure they don't happen again. Then overtime the trust will increase and the teams will start cooperating more and more.

The development organization is in charge of the full delivery process up to the wall of confusion. So make it the best delivery possible and take pride in delivering high quality out of the development silo. For each successful delivery you bring the wall down one brick at the time.

Also remember that we are talking about continuous delivery, not deployment. It's super important not to ever speak about continuous deployment because it scares the living crap out of the ops team when in a standoff situation. Though always having a deliverable ready and tested at the wall is always going to be appreciated. Then transition into production can happen with less confusion.

I have to confess I'm one of the developers who hates to support an in production application. I fear to be on call and once an application is in production I want to change assignment. Reason is how legacy noDevOps organizations go about developer support. Developers have zero trust so we can't access logs, databases or anything in production. So each time a developer needs to help out with a production issue its with tied hands and a blind fold. It ends up becoming a hostage situation where the developer is held hostage. I love to trace down bugs, solve issues and improve stuff but to be able to do that I need my eyes, my brain and my hands.

We can take charge of this situation as well and stop beeing victims. We can drastically improve our situation by building monitoring and metrics into our application and verifying them as part of our continuous regression testing. This way we build tools that are gonna be available in production that operations are gonna require anyway. Usually these are added late and as low priority supplemental requirements from operations. By being proactive we can build this into our architecture, test process and use it through out our entire delivery process. This way we build more useful monitoring tools that we understand much better. In return we arnt as blind and handicapped when helping out with production issues. Once again we make active steps towards cooperation and trust between organizations while making our own life better.

DevOps makes continuous delivery easier. But continuous delivery can be how we drop the guns and tear down the wall of confusion in a legacy organization and move towards DevOps. Ultimately they should both exist in an organization and I think they will both become as common as agile even in large old organizations.

How ever until we are there I think that continuous delivery is a fantastic tool to enable change in an organization suffering from a deadlock. It requires courage, vision, ambition and patience but all the tools are there for us to start making that change today!

Upps our Continuous Delivery process became mission critical

2013-01-01T17:37:00.000+01:00

At some point something changed with our Continuous Delivery process, it became mission critical. When we started working on the process it was basically a side project that another Tomas and I had. We added a consultant early in our project and he ended up doing some of the work on the first version of our deployment scripts but it wasnt anything organized and not part of any proccess or tools team.

When we increased the number of developers and started seeing issues with stability and scaleability we also started to realize that our process had become mission critical. In fact or continuous delivery process had become more important to us then our mail system.

Now we had a mission critical hobby project with the following setup.

No official Owner.
No official Developers.
No official Operations professionals involved

Operations only supporting the OS of the Jenkins and nexus instance.

One "live" instance of Jenkins on a super small virtual node.

All development done on live instance.

One "live" instance of Nexus with a very small disk.

All development done on live instance.

Small number of test servers, virtual but not cloud nodes.

Having about 30 developers really depending on a process that is setup like this is obviously a no go.

We started to figure we need to put more effort into it when we where to do our first rewrite of our deploy scripts. Still we didn't think in terms of production mission critical system. We needed a resource and I kept insisting we needed a CM, more on that in an upcoming post. We had architecture and test working together building the application around the process. But we needed some more hands building the deploy scripts and also someone who could help us with the complexity of our system configuration. As I wrote in the entry on deploy scripts this didn't work out well at all. Mostly because the CM ended up working alone in a corner of the organization but also because he didn't share our vision of continuous delivery. Between all discussions trying to get us to implement branching strategies he was writing deployment scripts without any JBoss or DB competence. Obviously this didn't work out all that well and it was during this script rewrite that we started to realize that our process was mission critical. The new deploy scripts where very unstable and as mentioned our tests had stability issues.

Now we started realizing that we have a mission critical system at our hands and we need to start treating it as such. Still this was a bit of an unknown entity in our landscape operations only support our office it and our customer deliveries while development supports tooling. While this for sure falls into tooling department the development organization isnt equipped to support a mission critical system. Still we had to do something about it so this was when we created our tools team, we refer to it as a platform team as it was intended to own certain components such as logging, help desk, ect. But main focus was to be continuous delivery. Our lacking development environment was another area of responsibility that we moved to this team, more on that as well in another entry.

The team consisted of our CM, application DBA, a newly added senior Java developer and my self as architect/lead. It was obvious from the onset how effective it is when you have resources (with full range of competence) that can focus on the process. This made us much more responsive to bugs in the process and faster in implementing changes.

We still at this date have not solved all the infra structure issues but most of it is being worked by the tools team and a new resource in our operations department who is responsible for our tooling serves. Still we don't have a Jenkins test environment and still the operations responsibility of Jenkins and Nexus aren't really well defined. But we have resources dedicated to the process and when something isn't working we handle it as bugs.

The biggest lesson is that its really important to get dedicated resources from dev and ops early. Getting two 50% resources is better then one full time as one isolated resource is a huge bottleneck and has a hard time prioritizing his work. Also make sure to have a bug/enhancement process in place early. Priorities should be made based on user experience, same as with any system in production. Also as soon as the process is in use by the developers you need a test environment for Jenkins (or what ever build server you use to drive the process) as its a production system after all.

I think the reason we got a bit blindsided by the process becoming mission critical is that we haven't had anything similar in our landscape before. There is actually one thing that has grown mission critical at about the same rate hand in hand with our CD process and that's our JIRA server. In fact we have an even bigger dependency on our JIRA if it goes down our developers have no clue what to work on and get stranded very quickly. For us this is a new type of mission critical systems. Previously they have only been supporting systems.

Another reason is that the continuous delivery community talks about how easy it is to get started and how we can just take small baby steps from our nightly build CI. It is both true and the way to go. I just guess I wasn't reading the fine print which says "and then it becomes mission critical".

Process Scaleability

2012-12-26T18:19:00.002+01:00

When we started working on our continuous delivery process our team was very small, three devs in two sites in different time zones. During the first six months we added two or three developers. So we where quite small for quite some time.

Then we grew very quickly to our current size of about thirty developers and eight or so testers. We grew the team in about six months. Obviously is provided huge issues for us with getting everyone up to speed. This exposed all the flaws we have with setting up and handling our dev environment. But not only that it also exposed issues with scaleability of our continuous delivery process.

With the increased number of developers the number of code commits increased. Since we test everything on every code commit our process started stacking test jobs. For each test type we had a dedicated server. So each deploy and the following test jobs had to be synchronized resulting in a single threaded process. This didn't bother us when we where just a few code committers but when we grew this became a huge issue.

Dedicated Test Server beeing the bottleneck

The biggest issue we had was that the devs didn't know when to take responsibility. If the process scales then the time it takes for a commit to go through the pipe is identical regardless of how many commits where made simultaneously. The time of our pipe was about 25-30 min. Bit long but durable IF it would be the same time for each commit. But since the process didn't scale the time for a developers checkin to go through was X*25 min where X=number concurrent commits.

This was perticularily bad in the afternoon when developers wanted to checkin before leaving. Sometime a checkin could take up to two three hours to go through and obviously devs wouldn't wait it out before leaving. So we almost always started the day with a broken pipe that needed fixing. Worse yet our colleagues in other timezones always had broken pipes during their day and they usually lacked the competence to fix the pipe.

Since the hardest thing with continuous delivery is training developers to take responsibility it's key that its easy to take responsibility. Visibility and feedback is very important factors but its also important to know WHEN to take responsibility.

The solution was obviously to start working with non dedicated test servers. Though this was easier said then done. If we would have had cloud nodes this would have been a walk in the park to solve. Spawning up a new node for each assembly and hence having a dedicated test node per assembly would scale very well. But our world isn't that easy. We don't use any cloud architecture. Our legacy organization isn't a very fast adopter of new infrastructure. This is quite common for most large old organizations and something we need to work around.

Our solution was to take all the test servers we had and put them into a pool of servers and assign them to testing of an assembly at the time.

Pipe 1 has to finish before any other thread can use
that pooled server instance.

This solves scaling but provides another problem we need to return servers into the pool. With cloud nodes you just destroy them when done and never reuse. Since we do reuse we need to make sure that once a deploy starts on a pooled server all the test jobs get to finish before next deploy starts.

We where quite uncertain how we wanted to approach the pooling. Did we really want to build some sort of pool manager of our own? We really, really didn't because we felt that there has to be some kind of tool that already does this.

Then it hit us. Could we do is with jenkins slaves? Could our pool of test servers be jenkins slaves? Yes they could! Our deploy jobs would just do a local host deploy and our test jobs would target local host instead of the ip of a test server.

The hard part was to figure out how to keep a pipe on the same slave and not have another pipe hijack that slave between jobs. But we finally managed to find a setup that worked for us where an entire pipe is executed on the same slave and jenkins blocks that slave for the duration of the pipe.

As of writing this post we are just about to start re-configuring our jobs to set this up. Hopefully when we have this fully implemented in two weeks or so we will have a process that scales. For our developers this will be a huge improvement as they will always get feedback within 25 min of hit checkin.