Continuous Delivery: Scaling

Friday, February 21, 2014

Scaling Continuous Delivery

Its been a while since I posted. Main reason is that we have been very focused on our main deliveries and feature development for the last six month. Whenever the feature train hits central station its always work such as build, release, test automation that gets hit first.

Though there are upsides to not touching your Continuous Delivery process for a few months. If you just keep working on your backlog you don't get time to analyze the impact of the changes you just made. Several times we have realized that the number two/three items in the backlog have dropped significantly in priority as we have fixed the most important issue and others rising fast in priority.

Now we have had time to analyse a lot of new issues and its time for us to pick up the pace again.

Scaling the Organization

The good thing, the awesome thing (!) is that during these six or so months our organization has changed and we have actually be able to create a line organization that owns and takes responsibility for the continuous delivery process.

One of the major bottlenecks we found in our process was our platform/tools team. The team was small and resources in that team where always first to go when feature pressure increased. The team became just another "IT function" that didn't have time to be proactive due to all the reactive support work it had to do.

There was a few reasons behind this first it was the way the team worked in the past. It actually built the pipes and processes for all the teams by hand and tailored to the custom needs of each team. On some teams there were individuals who picked up the work and kept on configuring the jenkins jobs to tailor them even more but on some teams there was no interest whatsoever and their jobs degraded.

The result of this was that no one really knew how the pipes looked and how they should look. Introducing process change was a horribly slow process as it was all manual and dependent on the platform/tools team.

One of the first changes we made was to increase the bandwidth of the team and reducing the dependency on that team. @TobiasPalmborg provided a great solution for this over a chat this summer. Instead of the platform/tools team supporting the development teams the development teams put resources into the platform/tools team. Each team was invited to add a 50% resource on a volunteered basis. This way the real life issues got much better attention in the platform/tools team and the competence about the Continuous Delivery process got spread in a much more organic way.

This did not eliminate the bottleneck organization but it gave us bandwidth to change the way we work and long term gave us the ability to scale with the number of teams that use the process.

Scaling the Process

The main issue with why we were a bottleneck was the way we worked. We preached Automate Everything, Test Everything, If its hard do it more often, ect but when it came to the Continuous Delivery process we didn't do what we where teaching.

We had ONE Jenkins Environment so all the changes happened directly in production. Testing plugins and new configurations on a production environment isn't really the way to delivery stability, reliability and performance.

Manually created Jenkins Pipes isnt really a way to create sustainable pace and continuous improvements.

Developing Deploy scripts without explicit unit tests isnt really a good way of creating a stable process. We have been priding ourselves with our deployment being tested hundreds of times pre production deploy which was true but very dumb. Implicit testing means that someone else takes the pain for my mistakes. Deployment scripts are applications and need to be treated as first class citizens.

This had to change.

First thing we did was to use the extra bandwidth we had obtained to build a totally new way of delivering continuous delivery. Automate everything, obvious, hu?

We also decided to deliver a continuous delivery environment per development team and not have them all in one environment. So we started with automating provisioning of Jenkins & Test environments. We dont have a cloud solution in our company at this time so we have a fake cloud that we work with which is a huge pool of virtual servers. This pool we provision and maintain using chef.

Second thing was to automate the build pipe setup. We built us a little simple pipe generator which has defined pipe templates of 5-8 different layouts to support the different needs. We actually managed to get the development teams to adjust to a stricter maven project naming convention to use the generated pipes as everyone saw the benefits of this.

The pipes we have are basically typed by what they build if its libs or deployable components and how they are tested as we still need to initiate our Fitnesse tests a bit differently from our other tests.

We made it the responsibility of the platform/tools team to develop the pipe templates and the responsibility of the development teams to configure their generator to generate the pipes they needed for their components.

Getting to this stage was a lot of work and a lot of migration work for all the teams but the results have been terrific. The support load has gone down alot on the platform/tools team and each bug fix is rolled out within minutes to all the pipes.

We have also be able to take on new development teams very easily. Not all teams in our company are ready to do Continuous Delivery but they are all heading in this direction and we can now provide environments and pipelines that match their maturity.

Summary

We have gone from a process developed as skunkworkz to Continuous Delivery as a Service within our organization. We always run into new bottlenecks and challenges this time the bottleneck was much more us than anything else. I assume that the next big bottleneck is going to be hardware and our inability to deliver on a cloud solution, since we now can roll out to more and more teams. But who knows I can be wrong only time will tell.

Wednesday, December 26, 2012

Process Scaleability

When we started working on our continuous delivery process our team was very small, three devs in two sites in different time zones. During the first six months we added two or three developers. So we where quite small for quite some time.

Then we grew very quickly to our current size of about thirty developers and eight or so testers. We grew the team in about six months. Obviously is provided huge issues for us with getting everyone up to speed. This exposed all the flaws we have with setting up and handling our dev environment. But not only that it also exposed issues with scaleability of our continuous delivery process.

With the increased number of developers the number of code commits increased. Since we test everything on every code commit our process started stacking test jobs. For each test type we had a dedicated server. So each deploy and the following test jobs had to be synchronized resulting in a single threaded process. This didn't bother us when we where just a few code committers but when we grew this became a huge issue.

Dedicated Test Server beeing the bottleneck

The biggest issue we had was that the devs didn't know when to take responsibility. If the process scales then the time it takes for a commit to go through the pipe is identical regardless of how many commits where made simultaneously. The time of our pipe was about 25-30 min. Bit long but durable IF it would be the same time for each commit. But since the process didn't scale the time for a developers checkin to go through was X*25 min where X=number concurrent commits.

This was perticularily bad in the afternoon when developers wanted to checkin before leaving. Sometime a checkin could take up to two three hours to go through and obviously devs wouldn't wait it out before leaving. So we almost always started the day with a broken pipe that needed fixing. Worse yet our colleagues in other timezones always had broken pipes during their day and they usually lacked the competence to fix the pipe.

Since the hardest thing with continuous delivery is training developers to take responsibility it's key that its easy to take responsibility. Visibility and feedback is very important factors but its also important to know WHEN to take responsibility.

The solution was obviously to start working with non dedicated test servers. Though this was easier said then done. If we would have had cloud nodes this would have been a walk in the park to solve. Spawning up a new node for each assembly and hence having a dedicated test node per assembly would scale very well. But our world isn't that easy. We don't use any cloud architecture. Our legacy organization isn't a very fast adopter of new infrastructure. This is quite common for most large old organizations and something we need to work around.

Our solution was to take all the test servers we had and put them into a pool of servers and assign them to testing of an assembly at the time.

Pipe 1 has to finish before any other thread can use
that pooled server instance.

This solves scaling but provides another problem we need to return servers into the pool. With cloud nodes you just destroy them when done and never reuse. Since we do reuse we need to make sure that once a deploy starts on a pooled server all the test jobs get to finish before next deploy starts.

We where quite uncertain how we wanted to approach the pooling. Did we really want to build some sort of pool manager of our own? We really, really didn't because we felt that there has to be some kind of tool that already does this.

Then it hit us. Could we do is with jenkins slaves? Could our pool of test servers be jenkins slaves? Yes they could! Our deploy jobs would just do a local host deploy and our test jobs would target local host instead of the ip of a test server.

The hard part was to figure out how to keep a pipe on the same slave and not have another pipe hijack that slave between jobs. But we finally managed to find a setup that worked for us where an entire pipe is executed on the same slave and jenkins blocks that slave for the duration of the pipe.

As of writing this post we are just about to start re-configuring our jobs to set this up. Hopefully when we have this fully implemented in two weeks or so we will have a process that scales. For our developers this will be a huge improvement as they will always get feedback within 25 min of hit checkin.

About Me

I´ve worked as a Java Developer/Architect for 15 years. I´ve worked as part of a consulting organization and as part of a line organization.

Over the last 6 years I´ve had an ever increasing interest in the quality of the delivery. Initially this interest lead me to work with automation of system tests. Then more and more towards automation release and deploy processes. Now for the last two years Ive focused alot of my work on the full Continuous Delivery process.

This blog will server as a collections of lessons learned from my work. Mostly just for my self but Im happy to share my experiences if anyone is interested.

Follow @TomasRihaSE

Pages

Friday, February 21, 2014

Scaling Continuous Delivery

Wednesday, December 26, 2012

Process Scaleability