Pages

Wednesday, December 26, 2012

Process Scaleability

When we started working on our continuous delivery process our team was very small, three devs in two sites in different time zones. During the first six months we added two or three developers. So we where quite small for quite some time.

Then we grew very quickly to our current size of about thirty developers and eight or so testers. We grew the team in about six months. Obviously is provided huge issues for us with getting everyone up to speed. This exposed all the flaws we have with setting up and handling our dev environment. But not only that it also exposed issues with scaleability of our continuous delivery process.

With the increased number of developers the number of code commits increased. Since we test everything on every code commit our process started stacking test jobs. For each test type we had a dedicated server. So each deploy and the following test jobs had to be synchronized resulting in a single threaded process. This didn't bother us when we where just a few code committers but when we grew this became a huge issue.

Dedicated Test Server beeing the bottleneck
The biggest issue we had was that the devs didn't know when to take responsibility. If the process scales then the time it takes for a commit to go through the pipe is identical regardless of how many commits where made simultaneously. The time of our pipe was about 25-30 min. Bit long but durable IF it would be the same time for each commit. But since the process didn't scale the time for a developers checkin to go through was X*25 min where X=number concurrent commits.

This was perticularily bad in the afternoon when developers wanted to checkin before leaving. Sometime a checkin could take up to two three hours to go through and obviously devs wouldn't wait it out before leaving. So we almost always started the day with a broken pipe that needed fixing. Worse yet our colleagues in other timezones always had broken pipes during their day and they usually lacked the competence to fix the pipe.

Since the hardest thing with continuous delivery is training developers to take responsibility it's key that its easy to take responsibility. Visibility and feedback is very important factors but its also important to know WHEN to take responsibility.

The solution was obviously to start working with non dedicated test servers. Though this was easier said then done. If we would have had cloud nodes this would have been a walk in the park to solve. Spawning up a new node for each assembly and hence having a dedicated test node per assembly would scale very well. But our world isn't that easy. We don't use any cloud architecture. Our legacy organization isn't a very fast adopter of new infrastructure. This is quite common for most large old organizations and something we need to work around.

Our solution was to take all the test servers we had and put them into a pool of servers and assign them to testing of an assembly at the time.
Pipe 1 has to finish before any other thread can use
 that pooled server instance.
This solves scaling but provides another problem we need to return servers into the pool. With cloud nodes you just destroy them when done and never reuse. Since we do reuse we need to make sure that once a deploy starts on a pooled server all the test jobs get to finish before next deploy starts.

We where quite uncertain how we wanted to approach the pooling. Did we really want to build some sort of pool manager of our own? We really, really didn't because we felt that there has to be some kind of tool that already does this.

Then it hit us. Could we do is with jenkins slaves? Could our pool of test servers be jenkins slaves? Yes they could! Our deploy jobs would just do a local host deploy and our test jobs would target local host instead of the ip of a test server.

The hard part was to figure out how to keep a pipe on the same slave and not have another pipe hijack that slave between jobs. But we finally managed to find a setup that worked for us where an entire pipe is executed on the same slave and jenkins blocks that slave for the duration of the pipe.

As of writing this post we are just about to start re-configuring our jobs to set this up. Hopefully when we have this fully implemented in two weeks or so we will have a process that scales. For our developers this will be a huge improvement as they will always get feedback within 25 min of hit checkin.

5 comments:

  1. Thanks Tomas, very informative.

    Getting developers to gain ownership is indeed, a challenge that we are facing as well on our way to CD.

    How does 1 tester to 4 developers ratio work for you so far? Do you think you could do better or worst with less?

    Do you think giving ownership of features to Developers end-to-end could work better for you? That would mean that they would have to own the acceptance tests, performance tests etc.

    ReplyDelete
  2. Konstantinidis

    Its actually funny that you ask. We have just recently (week before christmas) made a change to who owns feature implementation.

    We have teams focused around components so each developer is seasoned in one or more of our components. Then we have rollout teams that are basicly requirements and test teams. This hasnt worked too well for us as the testing has become to detached from development and our component teams have been too hard to coordinate in their work.

    So we just changed it so that we have feature teams. Developers are still part of component teams and our rollout team still owns the requirements and tests. But all development will be done in feature where the production of the feature is owned by a architect/senior dev. For us this is quite easy as we have quite a senior team.

    But ownership of tests is interesting. We are still on the fence on this if the feature owning dev should be responsible for the full feature from req to test or if there should be a joint ownership between a architect/dev and a tester.

    Problem is that if its owned solely by a dev/architect then the tests them self will turnout crap but there will be test driven develoment. If its owned or coowned by a tester then there will be test last and non responsible testers.

    So to answer your question about ratio. I think the more testers we have the less test driven the devlopment and our continuous delivery process gets contaminated with more bugs in mainflows. On the other hand the if we have less testers we work more test driven but we dont test enough details and corner cases and endup with a delivery that is only tested on mainflows.

    What we need is a new type of testers. Basicly devs who have become so interested in quality and testing that they have learned to actually write quality tests. Until we get that I really think that testers should work early in the process helping developers design testcases but not really do any actual work on testing.

    ReplyDelete
  3. This comment has been removed by the author.

    ReplyDelete
  4. Great article, thanks for sharing your experiences.

    I'd like to know more about the acceptance test failures feedback. Do you send an email to a group list or do you get a list of committers from the build phase?

    About running Jenkins agents - which was a great idea - on application servers, do you see any possible impact having it running there?

    Thanks.

    ReplyDelete
  5. Yes Henrique that is a good question.

    Method of feedback is another (damn we have many :/ ) weakness in our process. Atm its actually through big wall monitors using Jenkins StatusView plugin. That way we see when it breaks and if its not fixed soonish we stop work and try to chase down the problem. This is far from adequate and we will add sending mail to the code commiter. But our problem is that our email and our development environment arnt in the same IT environment, so we need to access our mail through a citrix client. Yeah IT can be hard 2013.. :/

    As I said we are just about to start rebuilding our pipe to use the Jenkins slaves as pooled test server nodes so I dont have much real evidence to provide just yet. But I think the main issue can come from all deploy and test jobs targeting localhost and not a remote IP. This means that we can go quite far in our process without realizing that a change required a infrastructure change as well, ie firewall opening.

    Localhost could also become a source of bugs in our deploy scripts as they are bash written and *should* be implemented in a way where source (jenkins) and target (app server) are two remote systems. If someone starts copying things localy without using rsync it will not show until first remote deploy which now will be our partner integration environment (step before user acceptance testing).

    ReplyDelete