Pages

Showing posts with label Portability. Show all posts
Showing posts with label Portability. Show all posts

Friday, February 27, 2015

Pipes as Code

Finally we have started to move away from having build pipes as a chain of Jenkins jobs. There has been alot written of the subject that CI systems arnt well suited to implement CD processes. Let me first give a short recap on why before I get into how we now delivery our Pipes as Code.

First of all pipes in CI systems have bad portability. They are usually a chain of jobs set up through either a manual process or through some sort of automation based on a api provided by the CI system. The inherrited problem here is that the pipe executes in the CI system. This means that it is very hard to test and develope a pipe using Continuous Delivery. Yes we need to use Continuous Delivery when implementing our Continuous Delivery tooling otherwise we will nto be able to deliver our CD Processes in a qualitative, rapid and reliable way.

Then there is the problem of that data that we collect during the pipe. By default the data in a CI system is stored in that CI systems. Often on disk on that instans of that CI server. Adding insult to injury navigation of the build data is often tided to the current implementation of the build pipe. This means that a change to the build pipe means that we can no longer access the build data.

For a few years now we have been off loading all the build data into different types of storages depending on what type of data it is. Meta data around the build we store in a custom database. Logs go to our ELK stack, metrices to Graphite and reports to S3.

Still we have had trouble delivering quality Pipes. Now that has changed.

We still use a CI Server to trigger the Pipe. On the CI server we now have one job "DoIt". The "DoIt" job executes the right build pipe for every application. Lets talk a bit on how we pick the pipe.

Each git repo contains a YML file that says how we should build that repo. Thats more or less the only thing that has to be in the repo for us to start building it. We ingore all repos without the YML files. So we listen to all the gerrit triggers and ignore ones withouth

The YML is simply pretty much just

pipe: application-pipe
jdk: JDK8


We describe our build pipes in YML and implement our tasks in Groovy. Here is a simple definition. 

build: 
    first:
     - do: setup.Clean     - do: setup.Init
    main:
     - do: build.Build     - do: test.Test
    last:    
     - do: log.ReportBuildStatus
last: 
   last:    
     - do: notify.Email

Each task has a lifecycle of first, main, last. The first section is always executed and all of the "do´s" in the first section are executed regardless of result. In the main secion the "do´s" are only execute if everything has gone well so far. Last is always executed regardless of how things went.

The "do´s" are references to groovy classes with the first mandatory part of the package stripped. So there is a com.something.something.something.setup.Clean class.

A Context object is passed through all the execute methods of the "do´s". By setting context.mock=true the main executing process adds the sufix "Mock" to all "do´s". This allows us to unit test the build pipe inorder to assert that all the steps that we expect to happen do happen in the correct order.

When alot of things start happening its not really practicall to have a build task all that verbose especially since we have multiple pipes that share the same build task. So we can create a "build.yml"  and a "notify.yml" which we then can include like this.

build: 
    ref : build
last: 
   last:    
     - do: notify

So this is how our build pipes look and we can unit test the pipe, the tasks and each "do" implementaiton.

Looking at a full pipe example we get something like this.

init:
    ref: init
build:
    parallel:
        build:
            ref: build.deployable
        provision:
            ref: provision.create-test-environment  
deploy:            
    ref: deploy.deploy-engine
test:            
    functional-test
        ref: test.functional-tests
    load-test: 
        ref: test.load-tests
release: 
    parallel:
        release:
            ref: release.publish-to-nexus
        bake:
            ref: bake.ami-with-packer
last: 
    parallel:
        deprovision:
            ref: provision.destroy-test-environment
        end:

            ref: end

Thats it.!

This pipe builds, functional tests, load tests and publishes our artifacts as well as baking images for our AWS environments. All the steps report to our meta data database, elk, graphite, s3 and slack.

And ofcourse we use our build pipes to build our build pipe tooling.

Continuous Delivery of Continuous Delivery through build Pipes as Code. High score on the buzzword bingo!

Monday, July 7, 2014

Continuous Deployment in the Cloud Part 2: The Pipeline Engine in 100 lines of code

As I talked about in my previous post in this series we need to treat our Continuous Delivery process as a distributed system and as part of that we need to move the Pipe out of Jenkins and into a first class citizen of its own. Aside from the facts that a CI Tool is a very bad Continuous Delivery/Deploy orchestrator I find the potential of executing my pipe from anywhere in the cloud very tempting.

If my pipe is a first class citizen of its own and executable on any plain old node then I can execute it anywhere in the cloud. Which means I can execute it locally on my laptop, a simple minion in my own cloud or in one of all of the managed CI services that have surfaced in the Cloud.

To accomplish this we need five very basic and simple things

  1. a pipeline configuration that defines what tasks to execute for that pipe
  2. a pipeline engine that executes tasks
  3. a library of task implementations
  4. a definition of what pipe to use with my artefact
  5. a way of distributing the pipeline engine to the node where I want to execute my pipeline
Lets have a look.

Define the pipeline

The pipeline is a relatively simple process that executes tasks. For the purpose of this blog series and for simple small deliveries sequential execution of tasks can be sufficient but at my work we do run a lot of parallel sub pipes to improve the throughput on the test parts. So our pipe should be able to handle both.

We also want to be able to run the pipe to a certain stage. Like from start to regression test and then step by step launch in QA and launch in Prod. Obviously of we want to do continuous deployment we don't need to worry too much about that capability. But I include it just to cover a bit more scope.

Defining pipelines is no real rocket science and in most cases a few archetype pipelines will cover like 90% of the pipes needed for a large scale delivery. So I do like to define a few flavours of pipes that gives us the ability to distribute a base set of pipes for CI to CD.

Once we have defined the base set of pipe flavours the each team should configure which pipe they want to handle their deliverables.

I define my pipes something like this.

name: Strawberry
pipe: 
  - do:
     - tasks:
       - name: Build
         type: mock.Dummy
  - do:
     - tasks:
       - name: Deploy A
         type: mock.Dummy
       - name: Test A
         type: mock.Dummy
     - tasks:
       - name: Deploy B
         type: mock.Dummy
       - name: Test B
         type: mock.Dummy
    parallel: true     
  - do:
     - tasks:
       - name: Publish
         type: mock.Dummy
  
A pipe named Strawberry which builds our services then deploys it in two parallel pipes where it executes two test suites and finally publishes the artefacts in our artefact repo.  At this stage each task is just executed with a Dummy task implementation.

The pipeline engine

We need a mechanism that understands our yml config and links it to our library of executable tasks.

I use Groovy to build my engine but it can just as easily be built in any language. Ive intentionally stripped down some of the logging I do but this is basically it.  In about 80 lines of code we have a engine that loads tasks defined in a yml, executes then in serial or parallel and has the capability to run all tasks, the tasks up to one point or a single task.

@Log
class BalthazarEngine {

    def int start(Map context){
        def status = 0
        def definition = context.get "balthazar-pipe" 
        
        for (def doIt:  definition["pipe"] ){
            status = executePipe(doIt, context)
        }
        return status
    }
    def int executePipe(Map doIt, Map context){
        def status = 0
        if (doIt.parallel == true){
            status = doItParallel(doIt,context)
        } else {
            status = doItSerial(doIt,context)
        }
        return status
    }
    
    def int doItSerial(def doIt, def context){
        def status = 0
        for (def tasks : doIt.do.tasks){
            status = executeTasks(tasks, context)
        }
        return status
    }
    
    def int doItParallel(def doIt, def context){
        def status = new AtomicInteger()
        def th
        for (def tasks : doIt.do.tasks){
            def cloneContext = deepcopy(context)
            def cloneTasks = deepcopy(tasks)
            th = Thread.start {
                status = executeTasks(cloneTasks, cloneContext)
            }
        }
        th.join()
        return status
    }   
    
    def int executeTasks(def tasks, def context){
        def status = 0
        for (def task : tasks){
            //execute if the run-task is not specified or if run-task equqls this task
            if (!context["run-task"] || context["run-task"] == task.name){
                log.info "execute ${task.name}"
                context["this.task"] = task
                def impl = loadInstanceForTask task;
                status = impl.execute context 
            }
            
            if (status != BalthazarTask.TASK_SUCCESS){
                break
            }
            
            if (context["run-to"] == task.name){
                log.info "Executed ${context["run-to"]} which is the last task, done executing."
                break
            }
        }
        return status
    }
    def loadInstanceForTask(def task){
        def className = "balthazar.tasks.${task.type}"
        def forName = Class.forName className
        return forName.newInstance()
    
    }
    def deepcopy(orig) {
        def bos = new ByteArrayOutputStream()
        def oos = new ObjectOutputStream(bos)
        oos.writeObject(orig); oos.flush()
        def bin = new ByteArrayInputStream(bos.toByteArray())
        def ois = new ObjectInputStream(bin)
        return ois.readObject()
    }
}

A common question I tend to get is "why not implement it as a lifecycle in maven or gradle". Well I want a process that can support building in maven, gradle or any other tool for any other language.  Also as soon as we use another tool to do our job (be it a build tool, a ci server or what ever) we need to adopt to its lifecycle definition of how it executes its processes. Maven has its lifecycle stages quite rigidly defined and I find it a pita to redefine them. Jenkins has its pre, build, post stages where its a pita to share variables. And so on. But most importantly use build tools for what they do well and ci tools for what they do well and none of that is implementing CD pipes.

Task library.

We need tasks for our pipe engine to execute. The interfaces for a task is simple.

public interface BalthazarTask {
    int execute(Map<String, Object> context);
}

Then we just implement them. For my purpose I package tasks in "balthazar.task.<type>.<task>" and just define the type and task in my yml. 

Writing tasks in a custom framework over say jobs in Jenkins is a joy.  You no longer need to do workaround to tooling limitations for simple things such as setting variables during execution.  

Anything you want to share you just put it on the context.

Here is an example of how two tasks share data.

  - tasks: 
    - name: Initiate Pipe
      type: init.Cerebro
  - tasks: 
    - name: Build
      type: build.Gradle
      command: gradle clean fatJar

I have two tasks. The first task creates a new version of the artefact we are building in my master data repository that I call Cerebro. (More on Cerebro in the next post). Cerebro is the master of all my build things and hence my version numbers come from there. So the init.Cerebro task takes the version from Cerebro and puts it on the context.

@Log
class Cerebro implements BalthazarTask {
    @Override
    def int execute(Map<String, Object> context){
        def affiliation = context.get("cerebro-affiliation")
        def hero = context.get("cerebro-hero")
        def key = System.env["CEREBRO_KEY"]
        def reincarnation =  CerebroClient.createNewHeroReincarnation(affiliation, key, hero)
        context.put("cerebro-reincarnation",reincarnation)
        return TASK_SUCCESS
    }
}


My build.Gradle task takes the version number from cerebro (called reincarnation) and sends it to the build script. As you can see I can use custom commands and in this case I do as fat jars is what I build. By default the task does gradle build. I can also define what log level I want my gradle script to run.

@Log
class Gradle implements BalthazarTask {
    @Override
    def int execute(Map<String, Object> context){
        def affiliation = context["cerebro-affiliation"]
        def hero = context["cerebro-hero"]
        def reincarnation = context["cerebro-reincarnation"]
        def command = context["this.task"]["command"] == null ? "gradle build": context["this.task"]["command"]
        def loglevel = context["this.task"]["loglevel"] == null ? "" : "--${context["this.task"]["loglevel"]}"
        
        def gradleCommand = """${command} ${loglevel} -Dcerebro-affiliation=${affiliation} -Dcerebro-hero=${hero} -Dcerebro-reincarnation=${reincarnation}"""

        def proc = gradleCommand.execute()                 
        proc.waitFor()                               
        return proc.exitValue()
    }
}

This is how hard it is to build tasks (jobs) if its done with code instead of configuring it in a CI tool. Sure some tasks like building Amazon AMI´s take a bit more of code. (j/k they don't). But ok a launch task that implements a rolling deploy on amazon using a A/B release pattern does but I will come back to that specific case.

Configure my repository

So I have a build pipe executor, pre built build pipes and tasks that execute in them. Now I need to configure my repository.

In my experience 90% of your teams will be able to use prefab pipes without investing too much effort into building tons of prefabs. A few CI a few simple CD and a few parallelized pipes should cover a lot of demand if you are good enough at putting an interface between the deploy tasks and the deploy as well as the test tasks and the deploy tools.

So in my repo I have a .balthazar.yml which contains.

balthazar-pipe: Strawberry

Distributing the pipeline engine and the task library

First thing we need is a balthazar client that starts the engine using the configuration provided inside my repository.  Simply a Groovy script does the trick.

@Log
class BalthazarRunner {
    def int start(Map<String, Object> context){
        Yaml yaml = new Yaml()
        if (!context){
            def projectfile = new File(".balthazar.yml")            
            if (projectfile.exists()){
                context = yaml.load projectfile.text
            } else {
                throw new Exception("No .balthazar.yml in project")
            }
            
        }
        def name = context.get "balthazar-pipe" 
        def definition = yaml.load this.getClass().getResource("/processes/${name}.yml").text 
        BalthazarEngine engine = new BalthazarEngine()
        
        context["balthazar-pipe"] =  definition
        context["run-to"] = System.properties["run-to"]
        context["run-task"] = System.properties["run-task"]
        return engine.start(context)
    }
}
def runner = new BalthazarRunner()
runner.start([:])

Now we need to distribute the client, our engine and our library of tasks to the node where we want to execute the pipeline with our code repository. This can be done in many ways.

We can package balthazar as a installable package and install it using yum or similar tool. This works quite well on build servers but it does limit us a bit on where we can run it as we need "that installer" to be installed on the target environment. In many cases its really isn't a problem because if your a Debian shop then you have your deb everywhere and if your a Redhat shop then you have your yum. 

I personally opted for another way of distributing the client. Partially because Im lazy and partially because it works on a lot of environments. When I make my balthazar.yml I also checkout the balthazar client project as a git submodule.

So all my projects have a .balthazar.yml and a balthazar-client folder. In my client folder I have a balthazar.sh and a gradle.build file. I use gradle to fetch the latest artefacts from my repo and then the shell script does the java -jar part. Not all that pretty but it works.

Summary

So now on all my repos I can just do...

>. balthazar-client/balthazar.sh

... and I run the pipe configured in .balthazar.yml on that repo. Since all my tasks integrate with my Build Data Repository I get ONE view of all my pipe executions regardless of where they where executed.

CD made fun! Cheers!

Tuesday, April 22, 2014

Environment Portability

I've talked about this a lot before and we have done a lot of work in this area but it cannot be stressed how important it is. In fact I think portability is the key success factor to building a good Continuous Delivery Implementation. Without portability there is no scalability.

There are two things that need to be built with portability in mind the build pipe itself and our dev, test and prod environments. If the build pipe isn't portable then it will become a bottleneck. Infact if the build pipe can run on each individual developer machine without the use of a build server then it is portable enough to scale in a build server environment.

Though in this post I will focus on Environment Portability from Desktop to Production.

Why do we need Portable Production Environments?

For years we have accepted the fact that the Test Environment isn´t really like the Prod Environment. We know that there are differences but we live with it because its too hard and too expensive to do anything about it. When doing Continuous Delivery it becomes very hard to live with it and accept it.

"If its hard do it more often" applies here as well.

The type of problems we run into as a result of non Portable Production Environments are problems that are hard by nature its scalability, clustering, failover, ect, ect. Its non functional requirements that need to be designed in early. By exposing the complexity of the production environment late in the pipe we create a very long feedback loop for the developers. It can be days or weeks depending on how often you hit prod with your deployments.


By increasing the portability of the production environment we increase the productivity of our developers and the quality of our application. This is obvious but there is another very important issue to deal with as well. Every time a deployment in UAT or Prod fails it undermines the credibility of Continuous Delivery. Each time a something that worked in UAT fails in Prod the person responsible for UAT will call for more manual tests before production. Obviously increasing the lead time even further making the problem worse but we need to constantly improve in order to manage fear.

If we have Portable Production Environments then the issues that stem from Environment Complexity will never hit production as they get caught much earlier in the pipe.

Who owns the definition of our Environments?

The different environments that we have in an organization who defines the and who owns them? There are a lot of variations to this as there are a lot of  variation to environment definitions and organizations. In most normally defunct organizations out there the Ops team owns the Production environments and often the machines in the earlier environments. How the earlier environments are set up and used is often the responsibility of the Dev team and if the organization is even more defunct then there is often a Delivery Team involved which defines how its environments are used.

This presents us with the problem that developers quite often have no clue how the production environment looks and implements the system based on assumption or ignorance. In the few cases where they actually have to implement something that has to do with scaling, clustering or failover its often just guess work as they don´t have a way to reproduce it.

Going into production this most often creates late requests on infrastructural changes and often even cases where solutions where based on assumptions that cannot be realized in a production environment.

What is portability?

When we talk about Portable Production Environments what do we really mean? Does it mean that we have to move all our development to online servers and all dev teams get their own servers that are identical to production but in smaller scale? Well not really. It is doable especially with the use of Cloud providers but I find it highly inconvenient to force the developer to be online and on a corporate network in order to develop. Having the ability to create a server environment for a developer on a need to have basis is great because it does cover for the gap where local environments cannot fully be portable.

Assuming one cloud provider to solve all your needs for portability across all environments is not short term realistic in most enterprise organizations unless you are already fully committed and fully rolled out to a cloud provider. There is always these legacy environments that need to be dealt with.

I think the key is to have portability on the topology of the environment. A environment built in AWS with ELBs will never be portable since you will never have an ELB locally. But having A load balancer in your dev environment and having multiple application nodes forces you to build for horizontal scalability and it will capture a whole lot more than just running on one local node.

Running a Oracle XE isnt really the same as running Enterprise Oracle but it provides a good enough portability. Firing it up on a Virtual Box of its own will force the DB away from local host.

In our production environment we monitor our applications using things like Graphite, Logstash+Kibana, ect, ect. These tools should be part of the development environment as well to increase the awareness of runtime and enable "design for runtime".

Creating the Development Environment described here with Vagrant is super easy. It can be built using other tools as well such as Docker or a combo of Vagrant and Docker. But if you use Docker then it needs to be used all the way. My example here with just Vagrant and VBox is to show that portability can be achieved without introducing new elements to your production environment.

Individual Environment Specifications and One Specification to rule them all.

To create a portable topology we need one way to describe our environments and then a way to scale them. A common human and machine readable Topology Specification that defines what clusters, stores, gateways and integrations there are in the topology of our production environment gives us the ability to share the definition of our environments. Then a environment specific specification that defines the scale of the environment and any possible mocks for integrations in development and test environments.

In an enterprise organisation we will always have our legacy environments. Over time these might migrate over to a cloud solution but some will most likely outlive their existence in a legacy environment. For these solutions we still benefit hugely from portable production environments and one way to define them. In a cloud environment we can recreate their topology and leverage the benefits of portability even if we cannot really benefit from the cloud in the production environment it self.

For these solutions the Topology Specification and Environment Specification can be used to generate documentation, little bit of graphviz can do wonders here. This documentation can be used for change request contracts and for documentation purposes.


We like Groovy scripts for our Infrastructure as Code. Here is an example of how the above example with the Cluster and the Oracle database could be defined using the Topology Specification. This example covers just clusters, storages and network rules but more can be added such as HTTP Gateways Fronts as 'gateways'  and pure network integrations with partners as 'integrations'.


def topologySpec = [name:'SomeService'
   ,clusters:[[name:'SomeServiceCluster'
           ,image_prefix:'some-service'
           ,cluster_port:80
           ,node_port:8080
           ,health_check_uri:'/ping/me'
           ,networks:[[
                   name:'SomeServiceInternal'
                   ,allow_inbound:[
                       [from:'0.0.0.0',ports:'80',protocol:'TCP']
                       ,[from:'192.168.16.0/24',ports:'22',protocol:'SSH']
                   ],allow_outbound:[
                       [from:'0.0.0.0',ports:'80',protocol:'TCP']
                       ,[to:'OracleInternal',ports:'1521',protocol:'TCP']
                   ]
               ]
           ]
       ]
   ],storages:[[name:'Rdbms'
           ,type:'oracle'
           ,network:[
               name:'OracleInternal'
               ,allow:[
                   [from:'SomeServiceInternal',ports:'1521',protocol:'TCP']
                   ,[from:'192.168.16.0/24',ports:'22',protocol:'SSH']
               ]
           ]
       ]
   ]
]



Environment Specifications contain scaling of the Topology but can also contain integration Mocks as 'clusters' defined just for that environment.

def devEnvSpec = [name:'SomeService'
   , clusters:[[name:'SomeServiceCluster'
           ,cluster_size:2
           ,node_size:nodeSize.SMALL
       ]
   ]
   ,storages:[
       [name:'Rdbms'
           ,cluster_size:1
           ,node_size:nodeSize.SMALL
       ]
   ]
]


def prodEnvSpec = [name:'SomeService'
   , clusters:[[name:'SomeServiceCluster'
           ,cluster_size:3
           ,node_size:nodeSize.MEDIUM
       ]
   ]
   ,storages:[
       [name:'Rdbms'
           ,cluster_size:1
           ,node_size:nodeSize.LARGE
       ]
   ]
]


Then the definitions are pushed to the Provisioner implementaiton with the input argument of which environment to Provision.

def envSpecs = ['DEV':devEnvSpec
              ,'PROD':prodEnvSpec
              ,'LEGACY':prodEnvSpec]


//args[0] is env name 'DEV', 'PROD' or 'LEGACY'
//Provisioner pics the right implementation (Vagrant, AWS or PDF) for the right environment

new Provisioner().provision(topologySpec, envSpecs, args)


Test Environments should be non persistent environments that are provisioned and decommissioned when the test execution is finished.

Development environments should be provisioned in the morning and decommissioned at the end of the day. This also solves the issue of building the Dev Environment which can be a tedious manual process in many organisations.

Production environments on the other hand need to support provision, update and decommission as its not always convenient to build a new environment for each topological change.

Also understand that Provisioning an environment is not the same as deploying the application. There can be many deployments into a provisioned environment. The Topology Specification doesn't specify what version of the application is deployed just what the base name of the artefact is. I find that convenient as that can be used to identify which image should be used to build the cluster.

One Specification, One Team Ownership

The Topology Specification should be owned by one team, the team that is responsible for developing and putting the system into runtime. Yes I do assume that some sort of DevOps like organisation is in place at this stage. If it isn't then I would say that the specification should be owned by the Dev team and the generated documentation should be contract between Dev and Ops. Consolidating ownership of as many environments as possible into one team should be the aim.

Summary

I think using these mechanisms to provision environments in a Continuous Delivery pipe will increase the quality of the software that goes through the pipe immensely. Not only will feedback be faster but we will also be able to start tailoring environments for specific test scenarios. The possibilities of quality increase are enormous.

Monday, March 17, 2014

Portability

I've talked about Portability of the CD process before but it continuously becomes more and more evident for us how important it is. The closer the CD process comes to the developer the higher the understanding of the process. Our increase in portability has gone through stages.

Initially we deployed locally in a way that was totally different from the way we deployed in the continuous delivery process. Our desktop development environments where not part of our CD process at all. Our deploy scripts handle stopping starting of servers, moving artifacts on the server, linking directories and running liquibase to upgrade/migrate database. We did all this manually on the local environments. We ran liquibase but we ran it using the maven plugin (which we don't do in our deploy scripts there we run it using java -jar). We moved artifacts by hand or by other scripts.

Then we created a local bootstrap script which executed the CD process deploy scripts on a local environment. We built in environment specific support in the local bootstrap so that we supported linux and windows. Though in order to start Jboss and Mule we needed to add support for the local environment in the CD process deploy script as well. We moved closer to portability but we diluted our code and increased our complexity. Still this was an improvement but the process was still not truly portable.

In recent time we have decided to shift our packaging of artifacts from zip files to rpms. All our prod and test environments are redhat so the dependency on technology is not really an issue for us here. What this gives us is the ability to manage dependencies between artifacts and infrastructure in a nice way. The war file depends on a jboss version which depends on a java version and all are installed when needed. This also finally gives us a clear separation between install and deploy. The yum installer installs files on the server, our deploy application brings the runtime online, configures it and moves the artifacts into runtime.

In order for us to maintain portability to the development environment this finally forced us to go all in and make the decision "development is done in a linux environment". We won't be moving to linux clients but our local deploy target will be a virtual linux box. This finally puts everything into place for us creating a fully portable model. Its important to understand that we still dont have a cloud environment in our company.


This image, created by my colleague Mikael, is a great visualization of how portability we can build in our environment now and when we get a cloud. By defining a Portability level and its interface we manage to build a mini cloud on each jenkins slave and on a local dev machine using the exactly same process as we would for a QA or test deploy. The Nodes above the Portability level can be local on the workstation/jeknins slave or remote in a Prod Environment. The process is the same, regardless of environment Provision, Install and Deploy.


Sunday, February 24, 2013

So it took a year.

When we first started building our continuous delivery pipe I had no idea that the biggest challenges would be non technical. Well I did expect that we would run into a lot of dev vs ops related issues and that the rest would be just technical issues. I was so naive.

We seriously underestimated how continuous delivery changes the every day work of each individual involved in the delivery of a software service. It affects everyone Developer, Tester, PM, CM, DBA and Operations professionals. Really it shouldn't be a big shocker since it changes the process of how we deliver software. So yes everyone gets affected.

The transition for our developers took about a year. Just over a year ago we scaled up our development and added give or take 15-20 developers. All these developers have been of a very high quality and very responsible individuals. Though none of them had worked in a continuous delivery process before and all where more or less new to our business domain.

When introducing them everyone got the run down of the continuous delivery process, how it works, why we have it and that they need to make sure to check in quality code. So off you go make code, check in tested stuff and if something still breaks you fix it. How hard can it be?

Much much harder then we thought. As I said all our developers are very responsible individuals. Still it was a change for them. What once was considered responsible like if it "compiles and unit tests check it in so that it doesn't get lost" leads broken builds. Doing this before leaving early on Friday becomes a huge issue because others have to fix the build pipe. But it goes for a lot of things like having to ensure that database scripts work all the time, everything with the database is versioned, roll backs work, ect, ect. So everyone has had to step up their game a  notch or two.

Continuous delivery really forces the developer to test much more before he/she checks in the code. Even for the developers that like to work test driven with their junit tests this is a step up. For many its a change of behavior. Changing a behavior that has become second nature  doesnt happen over night.

We had a few highly responsible developers that took on this change seamlessly. These individuals had to carry a huge load during this first year. When responsibility was dropped by one individual it was these who always ensured that the pipe was green. This has been the biggest source of frustration. I get angry, frustrated and mad when the lack of responsibility by one individual affects another individual. They get angry and frustrated as well because they don't want to lave it in a bad state and their responsibility prevents them from going home to their families. I'm so happy that we didn't loose any of these individuals during this period.

Now after about a year things have actually changed everyone takes much more responsibility and fixing the build pipe is much more of a shared effort. Which is soo nice. But why did it take such a long time? Id really like to figure out if this transition could have been made smoother and faster.

Key things why it took so much time.

A change to behavior.
Developers need to test much more, not just now and then but all the time. No matter how much you talk about "test before check in" , "test", "test", "test" the day the feature pressure increases a developer will fall back on second nature behavior and check in what he/she believes is done. We can talk lean, kanban, queues, push and pull all we want but fact is still there will always be situations of stress. Its not before a behavior change has become second nature we do it under pressure.

Immature process.
Visibility, portability and scale ability issues have made it hard to take responsibility. Knowing when, where and how to take responsibility is super important. Realizing that lack of responsibility is tied to these took us quite some time to figure out. If its hard to debug a testcase its going to a lot of time to figure out why things are failing and its going to require more senior developers to figure it out. Its also hard to be proactive with testing if the portability between development environment and test environment is bad.

Lot of new things at once
When you tell a developer about a new system, domain and a new process Im quite sure the developer will always listen more to the system and domain specific talks.
Developer has head full of this system communicates with that system and its that type of interface. Then I start going on about "Jira, bla bla bla, test bla, checkin bla bla, Jenkins bla, deploy, bla, fitnesse, test bla, bla" and developer goes "Yeah yeah yeah Ill check in and it gets tested I hear you, sweet!".

I defiantly think its much easier for a developer to make the transition if the process is more mature, has optimized feedback loops, scales and is portable. Honestly I think its easily going to take 3-6 months of the learning curve. But its still going to take a lot of time in range of months if we don´t become better at understanding behavioral changes.

Today we go straight from intro session (slides or whiteboard) to live scenario in one step. Here is the info now go and use it. At least now we are becoming better at mentoring. So there is help to get so that you can be talked through the process and the new developer is usually not working alone, which they where a year ago. Still I dont think its enough.

Continuous Delivery Training Dojos

I think we really need to start thinking about having training dojos where we learn the process from start to finish. I also think this is extremely important when transitioning to acceptance test driven development. But just for the reason of getting a feeling for the process. What is tested where, how and what happens when I change this and that. How should I test things before comiting and what should be done in which order.

I think if we practiced this and worked on how to break and unbreak the process in a non live scenario the transition would go much faster. In fact I dont think these dojos should be just to train new team members but they would also be a extremely effective way of sharing information and consequences of process change over time.


Tuesday, January 15, 2013

Package power!

We often talk about pipe design and how to implement it in jenkins or other ci tools, that everything should be versioned and that everything should be tested all the time. These things are very important but something I didn't realize for quite some time was how important packaging is.

Our packaging was giving us problems.

Early on when building our continuous delivery pipe we where a bit worried about the number of artifacts we where spewing out of our pipe and the impact it would have on our nexus repo. So we did release our war and jar files into our repo but the final deliverable assembly we released was just a property file containing versions. These property files where used by our rudimentary bash deploy scripts. The scripts basically did a bunch of wgets to retrieve the artifacts from the nexus repo before deploying them. Yeah laugh you I now know how dumb this was.

Our main problem due to this was that our scripts where very delivery specific. For delivery Y we had components A, B and C while for delivery Z we had components A, D and E. We couldn't reuse things well enough so we had duplicates of our scripts. Another issue we had was that there was no portability in this what so ever. We didn't really make the connection between lack of packaging and our huge developer environment problems. Switching between working on delivery X and Z was tedious because we where managing the local deployments in eclipse with the JBoss plugin. It also required full understanding of what components needed to be deployed.

Manual tasks and a required high level of domain knowledge didn't make things easy for our new developers. In act it also made life a pita for our architects that develop less hours a week then the developers. For them the rotting of the development environment was a huge issue. Since all components where managed manually all had to be updated, built and deployed.

Inspiration and goals.

When me and my colleague where at QCon NY (awesome conf that everyone should try to attend) we listened to talks by Netflix and Etzy. We where totally blown away by two things. Etzy's practice that a new developer should code and deploy a production change on the first day and Netflix baking of images instead of releasing wars and ears. These where two of the main things we brought back with us and two things that we keep revisiting as we iterate our process.

Since we don't do continuous deploy we set the goal that a new developer should be able to commit a change that is ready for delivery on the first day. The continuous delivery part of the goal wasn't the problem since we already had that in place. It's the most obvious part of that goal. The next obvious task for us was that we really had to do something about our dev env setup. Then with some thought we realized that this wasn't enough we needed to do something about our entire on boarding process with mentoring and level of knowledge in the team. In order to mentor someone a developer needs to have a good understanding of most tasks in jira. At this stage this wasn't the case.

We made the knowledge increase our priority since this was biting us in many ways. I won't go much more into that. Then we tried to prioritize the setup of our developer environment but doing something about our deploy scripts ended up being a higher priority. This was a very good and honestly lucky decision. We knew how to do our deploy script changes and our production deployments where really more important. But we where also not sure how to do our developer environment changes so sleeping on it was what we decided, even though our devs where literally screaming in frustration.

Addressing the problems.

First thing we did when we started to rewrite our scripts was to sort out our packaging once and for all. We killed the property file and started using maven for everything. We had already been using maven to release all components and most configurations. But we where not using maven to package our final deployables and we where not using it to release our deploy scripts. We had already been made very well aware that we had to tie our deploy scripts to our deployable assembly. We changed both these things. We started to release everything and not just versioning everything. This imho is very important thing that's not mentioned enough. Blogs, articles and demos talk about versioning everything but not so much about the importance of actually releasing everything and treating each release as an artifact even if its "just" a httpdconf.

Once we started building these packages and setting our structure it was so clear how Netflix came to the conclusion that they should bake images. The package contains war files, config files, deploy scripts, liquibase scripts, custom JBoss control scripts, httpdconf, ect, ect. The more we package and the more servers we get in our park the more things we notice that we need to put into the package. Then it becomes even more obvious since we take this package and transfers it to tons of servers for different test purposes. Once at the server we run our deploy scripts that copy and link stuf into place on the server. Remind me why are we doing this over and over? Wouldn't it be better to just do this once and make an image out of it and mount this image on different nodes. Of course it would be, Netflix know what they are talking about! Most importantly it would bring the final missing pieces into the package JBoss, Java and Linux distributions. Giving us the power to actually roll out and test even OS patches through the same process as any other change. We arnt there yet, but the path is obvious and its nice to feel that what was once an overwhelming w000t is now a definite possibility.

So through a good packaging strategy we managed to improve and solve our deploy script problems. We now had one script to distribute and deploy them all! This also resulted in much fewer changes to the deploy scripts which in turn made them more stable. A lot of changes that previously required changes to deployment scripts now just requires a change to the packaging which makes the entire deployment process much more robust.

Portability!

Still though we hadn't solved our issues with our developer environments. I had the hunch for some time that our packaging could help us. Still it took us some time before we realized that we actually had created an almost fully portable deployment solution. Our increased maven usage had made us so portable that we could actually just write a simple script that combined the essence of the assembly job and the deployment job of our jenkins pipe into a local dev env script. By adding "snapshots true" to our maven version properties update we allowed our assemblies to be built including snapshots. Then we could just use our deploy scripts and voila our local JBosses and Mule ESBs where deployed with artifacts containing our code changes and most importantly our rebel.xmls, giving us full JRebel power with our production deploy scripts.

Our packaging strategy had made our continuous delivery process portable to our development environment allowing us to use the same assemble+deploy from local dev env to prod. Our developers now just need to know what assembly to deploy and they don't need to rebuild all included components just the ones they are currently working with, the others are added by maven for he nexus repo. So now our developers can quickly and easily switch between single component deploys and full deliveries.

Getting closer to our goals.

By adding JBoss & Mule installations to the script we further simplified the setup process for the new developers. We still have a few things we want to add to the script such as IDE install and initial source code checkout in order to simplify things further but at will have to rest it a bit since we have other higher priorities. Still we have taken huge steps towards our Etzy inspired goal of having new developers commit a code change on the first day.

It feels like all these levels of improvement have been unlocked by a good packaging strategy!

If its one thing I would change about the way we have gone by our implementation its the packaging. It's easy to say in hindsight but I'd really try to do it properly of the bat.