Continuous Delivery: Stability

Showing posts with label Stability. Show all posts

Friday, February 27, 2015

Pipes as Code

Finally we have started to move away from having build pipes as a chain of Jenkins jobs. There has been alot written of the subject that CI systems arnt well suited to implement CD processes. Let me first give a short recap on why before I get into how we now delivery our Pipes as Code.

First of all pipes in CI systems have bad portability. They are usually a chain of jobs set up through either a manual process or through some sort of automation based on a api provided by the CI system. The inherrited problem here is that the pipe executes in the CI system. This means that it is very hard to test and develope a pipe using Continuous Delivery. Yes we need to use Continuous Delivery when implementing our Continuous Delivery tooling otherwise we will nto be able to deliver our CD Processes in a qualitative, rapid and reliable way.

Then there is the problem of that data that we collect during the pipe. By default the data in a CI system is stored in that CI systems. Often on disk on that instans of that CI server. Adding insult to injury navigation of the build data is often tided to the current implementation of the build pipe. This means that a change to the build pipe means that we can no longer access the build data.

For a few years now we have been off loading all the build data into different types of storages depending on what type of data it is. Meta data around the build we store in a custom database. Logs go to our ELK stack, metrices to Graphite and reports to S3.

Still we have had trouble delivering quality Pipes. Now that has changed.

We still use a CI Server to trigger the Pipe. On the CI server we now have one job "DoIt". The "DoIt" job executes the right build pipe for every application. Lets talk a bit on how we pick the pipe.

Each git repo contains a YML file that says how we should build that repo. Thats more or less the only thing that has to be in the repo for us to start building it. We ingore all repos without the YML files. So we listen to all the gerrit triggers and ignore ones withouth

The YML is simply pretty much just

pipe: application-pipe
jdk: JDK8

We describe our build pipes in YML and implement our tasks in Groovy. Here is a simple definition.

build:
first:
- do: setup.Clean - do: setup.Init
main:
- do: build.Build - do: test.Test
last:
- do: log.ReportBuildStatus
last:
last:
- do: notify.Email

Each task has a lifecycle of first, main, last. The first section is always executed and all of the "do´s" in the first section are executed regardless of result. In the main secion the "do´s" are only execute if everything has gone well so far. Last is always executed regardless of how things went.

The "do´s" are references to groovy classes with the first mandatory part of the package stripped. So there is a com.something.something.something.setup.Clean class.

A Context object is passed through all the execute methods of the "do´s". By setting context.mock=true the main executing process adds the sufix "Mock" to all "do´s". This allows us to unit test the build pipe inorder to assert that all the steps that we expect to happen do happen in the correct order.

When alot of things start happening its not really practicall to have a build task all that verbose especially since we have multiple pipes that share the same build task. So we can create a "build.yml" and a "notify.yml" which we then can include like this.

build:
ref : build
last:
last:
- do: notify

So this is how our build pipes look and we can unit test the pipe, the tasks and each "do" implementaiton.

Looking at a full pipe example we get something like this.

init:
ref: init
build:
parallel:
build:
ref: build.deployable
provision:
ref: provision.create-test-environment
deploy:
ref: deploy.deploy-engine
test:
functional-test:
ref: test.functional-tests
load-test:
ref: test.load-tests
release:
parallel:
release:
ref: release.publish-to-nexus
bake:
ref: bake.ami-with-packer
last:
parallel:
deprovision:
ref: provision.destroy-test-environment
end:

ref: end

Thats it.!

This pipe builds, functional tests, load tests and publishes our artifacts as well as baking images for our AWS environments. All the steps report to our meta data database, elk, graphite, s3 and slack.

And ofcourse we use our build pipes to build our build pipe tooling.

Continuous Delivery of Continuous Delivery through build Pipes as Code. High score on the buzzword bingo!

Wednesday, May 14, 2014

Tomorrow is the premiere of "Scaling Continuous Delivery" at GeeCon 2014

Tomorrow on the 15th of May Ive got a talk at GeeCon in Krakow. Its the first outing of my new talk Scaling Continuous Delivery. The talk is an experience report on all the struggles we have had scaling our continuous delivery rollout. Hopefully the talk will provide an insight to what we have done and the steps we have taken while scaling. Sometimes its not just the end goal that is interesting but also the journey.

Hopefully its will be appreciated.

Here are the slides for the talk http://www.slideshare.net/TomasRiha/scaling-continuous-delivery-geecon-2014

Friday, February 21, 2014

Scaling Continuous Delivery

Its been a while since I posted. Main reason is that we have been very focused on our main deliveries and feature development for the last six month. Whenever the feature train hits central station its always work such as build, release, test automation that gets hit first.

Though there are upsides to not touching your Continuous Delivery process for a few months. If you just keep working on your backlog you don't get time to analyze the impact of the changes you just made. Several times we have realized that the number two/three items in the backlog have dropped significantly in priority as we have fixed the most important issue and others rising fast in priority.

Now we have had time to analyse a lot of new issues and its time for us to pick up the pace again.

Scaling the Organization

The good thing, the awesome thing (!) is that during these six or so months our organization has changed and we have actually be able to create a line organization that owns and takes responsibility for the continuous delivery process.

One of the major bottlenecks we found in our process was our platform/tools team. The team was small and resources in that team where always first to go when feature pressure increased. The team became just another "IT function" that didn't have time to be proactive due to all the reactive support work it had to do.

There was a few reasons behind this first it was the way the team worked in the past. It actually built the pipes and processes for all the teams by hand and tailored to the custom needs of each team. On some teams there were individuals who picked up the work and kept on configuring the jenkins jobs to tailor them even more but on some teams there was no interest whatsoever and their jobs degraded.

The result of this was that no one really knew how the pipes looked and how they should look. Introducing process change was a horribly slow process as it was all manual and dependent on the platform/tools team.

One of the first changes we made was to increase the bandwidth of the team and reducing the dependency on that team. @TobiasPalmborg provided a great solution for this over a chat this summer. Instead of the platform/tools team supporting the development teams the development teams put resources into the platform/tools team. Each team was invited to add a 50% resource on a volunteered basis. This way the real life issues got much better attention in the platform/tools team and the competence about the Continuous Delivery process got spread in a much more organic way.

This did not eliminate the bottleneck organization but it gave us bandwidth to change the way we work and long term gave us the ability to scale with the number of teams that use the process.

Scaling the Process

The main issue with why we were a bottleneck was the way we worked. We preached Automate Everything, Test Everything, If its hard do it more often, ect but when it came to the Continuous Delivery process we didn't do what we where teaching.

We had ONE Jenkins Environment so all the changes happened directly in production. Testing plugins and new configurations on a production environment isn't really the way to delivery stability, reliability and performance.

Manually created Jenkins Pipes isnt really a way to create sustainable pace and continuous improvements.

Developing Deploy scripts without explicit unit tests isnt really a good way of creating a stable process. We have been priding ourselves with our deployment being tested hundreds of times pre production deploy which was true but very dumb. Implicit testing means that someone else takes the pain for my mistakes. Deployment scripts are applications and need to be treated as first class citizens.

This had to change.

First thing we did was to use the extra bandwidth we had obtained to build a totally new way of delivering continuous delivery. Automate everything, obvious, hu?

We also decided to deliver a continuous delivery environment per development team and not have them all in one environment. So we started with automating provisioning of Jenkins & Test environments. We dont have a cloud solution in our company at this time so we have a fake cloud that we work with which is a huge pool of virtual servers. This pool we provision and maintain using chef.

Second thing was to automate the build pipe setup. We built us a little simple pipe generator which has defined pipe templates of 5-8 different layouts to support the different needs. We actually managed to get the development teams to adjust to a stricter maven project naming convention to use the generated pipes as everyone saw the benefits of this.

The pipes we have are basically typed by what they build if its libs or deployable components and how they are tested as we still need to initiate our Fitnesse tests a bit differently from our other tests.

We made it the responsibility of the platform/tools team to develop the pipe templates and the responsibility of the development teams to configure their generator to generate the pipes they needed for their components.

Getting to this stage was a lot of work and a lot of migration work for all the teams but the results have been terrific. The support load has gone down alot on the platform/tools team and each bug fix is rolled out within minutes to all the pipes.

We have also be able to take on new development teams very easily. Not all teams in our company are ready to do Continuous Delivery but they are all heading in this direction and we can now provide environments and pipelines that match their maturity.

Summary

We have gone from a process developed as skunkworkz to Continuous Delivery as a Service within our organization. We always run into new bottlenecks and challenges this time the bottleneck was much more us than anything else. I assume that the next big bottleneck is going to be hardware and our inability to deliver on a cloud solution, since we now can roll out to more and more teams. But who knows I can be wrong only time will tell.

Monday, December 17, 2012

Test stability

The key to a good Continuous Delivery process is a stable regression suite. There are a few different types of instability on can encounter.

• Lack of robustness
• Random test result
• Fluctuation in execution time

Lack of robustness often comes from incorrectly defined touch points. If a test needs to include a lot of functionality in order to verify a small detail then its bound to lack robustness. Even worse is if a test directly touches implementation.

Adding extra interfaces to create additional Touch Points.

I'm sure everyone has broken unit tests when refactoring without actually breaking any functionality. This almost always happens when unit tests evaluate implementation and not functionality. For example when you just split methods to clean up responsibility.

Tough this doesn't only happen to unit tests it can just as easily, if not easier to functional acceptance tests. If they for instance touch database in validation, instrumentation or data setup you will break your tests every time you refactor your database.

Testers seem to be very keen on verifying against the database. This is understandable since in an old monolith system with manual verification you only have two touch points GUI and db. It's important to work with test architecture and to educate both developers and testers to prevent validation of implementation.

Another culprit is dependencies between implementation and test code. For example Fitnesse fixtures that directly access, model objects, DAOs or business logic implementation.

Tests must be robust enough to survive a refactoring and addition of functionality. Change of functionality is obviously another matter.

We actually managed this quite well with our touch points.

Random test failures are a horrible thing because it will result in either unnecessary stoppage of the line or people loosing respect for the color red.

This is where we have had most our issues. At one point our line was failing 8 of 10 times it was executed and at least 7 of these where false negatives. "Just run em again" was the most commonly heard phrase. Obviously the devs totally lost respect for the color red. The result was that when now and again a true bug caused the failures no on cared to fix it and they started stacking.

Result of our Jenkins Job History for
Functional Tests could look like this.

So why did we have these problems? First of all our application is heavily asynchronous so timing is a huge issue in our tests. Many of our touch points are asynchronous triggers that fire off some stuff in the application and then we wait for another trigger from the application to the test before we validate. We don't really have much architectural room here as its an industry standard pattern. This in it self isn't a big deal except that each asynchronous task schedules other asynchronous tasks. So a lot of things happen at the same time.

Since its desirable to keep execution times short we reconfigure the system at test time to use much shorter timeouts and delays. This further increases the amount of simultaneous stuff that happens for each request.

When all these simultaneous threads hit the same record in one table you get transactional issues. We had solved this through use of optimistic locking. So we had a lot of rollbacks and retries. But it "worked". But our execution times where very unpredictable and since our tests where sensitive to timing they failed randomly.

Really though did we really congest our test scenarios so much that this became a problem? I mean how on earth could we hit the optimistic lock so often that it resulted in 7 out of 10 regressions failing due to it?

Wasn't it actually so that the tests where trying to tell us something? Eventually we found that our get requests where actually creating transactions due to an incorrectly instances empty list, marking an object as new. We also changed our pool sizing as we exhausted it causing a lot of wait.

So we had a lot of bugs that we blamed on the nature of our application.

Listen to your tests!! They speak wise things to you!

Eventually we refactored that table everyone was hitting so that we don't do any updates on it but rather track changes in a detail table. Now we didn't need any locking at all. Sweet free feeling. Almost like dropping in alone on a powder day. ;)

Fluctuation in execution time. I find it just as important that a test executes equally fast every time as it is that it always gives the same result.

First of all a failed test should be faster then a success, give me feed back ASAP! Second if it takes 15 sec to run it should always take 15 sec, not 5 or 25 at times. This is important when building your suites. You want to have your very fast smoke test suites your bang per min short suites and your longer running suites. Being able to easily group tests by time is important.

It's also important to be able to do a trend analysis on the suite execution time. It's a simplistic but effective way to monitor decrease in performance.

We have actually nailed quite a few performance bugs this way.

Asynchronous nature makes everything harder to test but don't make it an excuse for your self. Random behavior is tricky and frustrating in a pipe but remember TRUST YOUR TESTS!

About Me

I´ve worked as a Java Developer/Architect for 15 years. I´ve worked as part of a consulting organization and as part of a line organization.

Over the last 6 years I´ve had an ever increasing interest in the quality of the delivery. Initially this interest lead me to work with automation of system tests. Then more and more towards automation release and deploy processes. Now for the last two years Ive focused alot of my work on the full Continuous Delivery process.

This blog will server as a collections of lessons learned from my work. Mostly just for my self but Im happy to share my experiences if anyone is interested.

Follow @TomasRihaSE

Pages