Tomorrow on the 15th of May Ive got a talk at GeeCon in Krakow. Its the first outing of my new talk Scaling Continuous Delivery. The talk is an experience report on all the struggles we have had scaling our continuous delivery rollout. Hopefully the talk will provide an insight to what we have done and the steps we have taken while scaling. Sometimes its not just the end goal that is interesting but also the journey.
Hopefully its will be appreciated.
Here are the slides for the talk http://www.slideshare.net/TomasRiha/scaling-continuous-delivery-geecon-2014
A blog sharing experience of working with Continuous Delivery, Test Driven Development, Architecture and Agile Methodologies.
Wednesday, May 14, 2014
Tuesday, April 22, 2014
Environment Portability
I've talked about this a lot before and we have done a lot of work in this area but it cannot be stressed how important it is. In fact I think portability is the key success factor to building a good Continuous Delivery Implementation. Without portability there is no scalability.
There are two things that need to be built with portability in mind the build pipe itself and our dev, test and prod environments. If the build pipe isn't portable then it will become a bottleneck. Infact if the build pipe can run on each individual developer machine without the use of a build server then it is portable enough to scale in a build server environment.
Though in this post I will focus on Environment Portability from Desktop to Production.
Why do we need Portable Production Environments?
For years we have accepted the fact that the Test Environment isn´t really like the Prod Environment. We know that there are differences but we live with it because its too hard and too expensive to do anything about it. When doing Continuous Delivery it becomes very hard to live with it and accept it.
"If its hard do it more often" applies here as well.
The type of problems we run into as a result of non Portable Production Environments are problems that are hard by nature its scalability, clustering, failover, ect, ect. Its non functional requirements that need to be designed in early. By exposing the complexity of the production environment late in the pipe we create a very long feedback loop for the developers. It can be days or weeks depending on how often you hit prod with your deployments.
By increasing the portability of the production environment we increase the productivity of our developers and the quality of our application. This is obvious but there is another very important issue to deal with as well. Every time a deployment in UAT or Prod fails it undermines the credibility of Continuous Delivery. Each time a something that worked in UAT fails in Prod the person responsible for UAT will call for more manual tests before production. Obviously increasing the lead time even further making the problem worse but we need to constantly improve in order to manage fear.
If we have Portable Production Environments then the issues that stem from Environment Complexity will never hit production as they get caught much earlier in the pipe.
Who owns the definition of our Environments?
The different environments that we have in an organization who defines the and who owns them? There are a lot of variations to this as there are a lot of variation to environment definitions and organizations. In most normally defunct organizations out there the Ops team owns the Production environments and often the machines in the earlier environments. How the earlier environments are set up and used is often the responsibility of the Dev team and if the organization is even more defunct then there is often a Delivery Team involved which defines how its environments are used.
This presents us with the problem that developers quite often have no clue how the production environment looks and implements the system based on assumption or ignorance. In the few cases where they actually have to implement something that has to do with scaling, clustering or failover its often just guess work as they don´t have a way to reproduce it.
Going into production this most often creates late requests on infrastructural changes and often even cases where solutions where based on assumptions that cannot be realized in a production environment.
What is portability?
When we talk about Portable Production Environments what do we really mean? Does it mean that we have to move all our development to online servers and all dev teams get their own servers that are identical to production but in smaller scale? Well not really. It is doable especially with the use of Cloud providers but I find it highly inconvenient to force the developer to be online and on a corporate network in order to develop. Having the ability to create a server environment for a developer on a need to have basis is great because it does cover for the gap where local environments cannot fully be portable.
Assuming one cloud provider to solve all your needs for portability across all environments is not short term realistic in most enterprise organizations unless you are already fully committed and fully rolled out to a cloud provider. There is always these legacy environments that need to be dealt with.
I think the key is to have portability on the topology of the environment. A environment built in AWS with ELBs will never be portable since you will never have an ELB locally. But having A load balancer in your dev environment and having multiple application nodes forces you to build for horizontal scalability and it will capture a whole lot more than just running on one local node.
Running a Oracle XE isnt really the same as running Enterprise Oracle but it provides a good enough portability. Firing it up on a Virtual Box of its own will force the DB away from local host.
In our production environment we monitor our applications using things like Graphite, Logstash+Kibana, ect, ect. These tools should be part of the development environment as well to increase the awareness of runtime and enable "design for runtime".
Creating the Development Environment described here with Vagrant is super easy. It can be built using other tools as well such as Docker or a combo of Vagrant and Docker. But if you use Docker then it needs to be used all the way. My example here with just Vagrant and VBox is to show that portability can be achieved without introducing new elements to your production environment.
Individual Environment Specifications and One Specification to rule them all.
To create a portable topology we need one way to describe our environments and then a way to scale them. A common human and machine readable Topology Specification that defines what clusters, stores, gateways and integrations there are in the topology of our production environment gives us the ability to share the definition of our environments. Then a environment specific specification that defines the scale of the environment and any possible mocks for integrations in development and test environments.
In an enterprise organisation we will always have our legacy environments. Over time these might migrate over to a cloud solution but some will most likely outlive their existence in a legacy environment. For these solutions we still benefit hugely from portable production environments and one way to define them. In a cloud environment we can recreate their topology and leverage the benefits of portability even if we cannot really benefit from the cloud in the production environment it self.
For these solutions the Topology Specification and Environment Specification can be used to generate documentation, little bit of graphviz can do wonders here. This documentation can be used for change request contracts and for documentation purposes.
We like Groovy scripts for our Infrastructure as Code. Here is an example of how the above example with the Cluster and the Oracle database could be defined using the Topology Specification. This example covers just clusters, storages and network rules but more can be added such as HTTP Gateways Fronts as 'gateways' and pure network integrations with partners as 'integrations'.
Environment Specifications contain scaling of the Topology but can also contain integration Mocks as 'clusters' defined just for that environment.
Then the definitions are pushed to the Provisioner implementaiton with the input argument of which environment to Provision.
new Provisioner().provision(topologySpec, envSpecs, args)
Test Environments should be non persistent environments that are provisioned and decommissioned when the test execution is finished.
Development environments should be provisioned in the morning and decommissioned at the end of the day. This also solves the issue of building the Dev Environment which can be a tedious manual process in many organisations.
Production environments on the other hand need to support provision, update and decommission as its not always convenient to build a new environment for each topological change.
Also understand that Provisioning an environment is not the same as deploying the application. There can be many deployments into a provisioned environment. The Topology Specification doesn't specify what version of the application is deployed just what the base name of the artefact is. I find that convenient as that can be used to identify which image should be used to build the cluster.
One Specification, One Team Ownership
The Topology Specification should be owned by one team, the team that is responsible for developing and putting the system into runtime. Yes I do assume that some sort of DevOps like organisation is in place at this stage. If it isn't then I would say that the specification should be owned by the Dev team and the generated documentation should be contract between Dev and Ops. Consolidating ownership of as many environments as possible into one team should be the aim.
There are two things that need to be built with portability in mind the build pipe itself and our dev, test and prod environments. If the build pipe isn't portable then it will become a bottleneck. Infact if the build pipe can run on each individual developer machine without the use of a build server then it is portable enough to scale in a build server environment.
Though in this post I will focus on Environment Portability from Desktop to Production.
Why do we need Portable Production Environments?
For years we have accepted the fact that the Test Environment isn´t really like the Prod Environment. We know that there are differences but we live with it because its too hard and too expensive to do anything about it. When doing Continuous Delivery it becomes very hard to live with it and accept it.
"If its hard do it more often" applies here as well.
The type of problems we run into as a result of non Portable Production Environments are problems that are hard by nature its scalability, clustering, failover, ect, ect. Its non functional requirements that need to be designed in early. By exposing the complexity of the production environment late in the pipe we create a very long feedback loop for the developers. It can be days or weeks depending on how often you hit prod with your deployments.
By increasing the portability of the production environment we increase the productivity of our developers and the quality of our application. This is obvious but there is another very important issue to deal with as well. Every time a deployment in UAT or Prod fails it undermines the credibility of Continuous Delivery. Each time a something that worked in UAT fails in Prod the person responsible for UAT will call for more manual tests before production. Obviously increasing the lead time even further making the problem worse but we need to constantly improve in order to manage fear.
If we have Portable Production Environments then the issues that stem from Environment Complexity will never hit production as they get caught much earlier in the pipe.
Who owns the definition of our Environments?
The different environments that we have in an organization who defines the and who owns them? There are a lot of variations to this as there are a lot of variation to environment definitions and organizations. In most normally defunct organizations out there the Ops team owns the Production environments and often the machines in the earlier environments. How the earlier environments are set up and used is often the responsibility of the Dev team and if the organization is even more defunct then there is often a Delivery Team involved which defines how its environments are used.
This presents us with the problem that developers quite often have no clue how the production environment looks and implements the system based on assumption or ignorance. In the few cases where they actually have to implement something that has to do with scaling, clustering or failover its often just guess work as they don´t have a way to reproduce it.
Going into production this most often creates late requests on infrastructural changes and often even cases where solutions where based on assumptions that cannot be realized in a production environment.
What is portability?
When we talk about Portable Production Environments what do we really mean? Does it mean that we have to move all our development to online servers and all dev teams get their own servers that are identical to production but in smaller scale? Well not really. It is doable especially with the use of Cloud providers but I find it highly inconvenient to force the developer to be online and on a corporate network in order to develop. Having the ability to create a server environment for a developer on a need to have basis is great because it does cover for the gap where local environments cannot fully be portable.
Assuming one cloud provider to solve all your needs for portability across all environments is not short term realistic in most enterprise organizations unless you are already fully committed and fully rolled out to a cloud provider. There is always these legacy environments that need to be dealt with.
I think the key is to have portability on the topology of the environment. A environment built in AWS with ELBs will never be portable since you will never have an ELB locally. But having A load balancer in your dev environment and having multiple application nodes forces you to build for horizontal scalability and it will capture a whole lot more than just running on one local node.
Running a Oracle XE isnt really the same as running Enterprise Oracle but it provides a good enough portability. Firing it up on a Virtual Box of its own will force the DB away from local host.
In our production environment we monitor our applications using things like Graphite, Logstash+Kibana, ect, ect. These tools should be part of the development environment as well to increase the awareness of runtime and enable "design for runtime".
Creating the Development Environment described here with Vagrant is super easy. It can be built using other tools as well such as Docker or a combo of Vagrant and Docker. But if you use Docker then it needs to be used all the way. My example here with just Vagrant and VBox is to show that portability can be achieved without introducing new elements to your production environment.
Individual Environment Specifications and One Specification to rule them all.
To create a portable topology we need one way to describe our environments and then a way to scale them. A common human and machine readable Topology Specification that defines what clusters, stores, gateways and integrations there are in the topology of our production environment gives us the ability to share the definition of our environments. Then a environment specific specification that defines the scale of the environment and any possible mocks for integrations in development and test environments.
In an enterprise organisation we will always have our legacy environments. Over time these might migrate over to a cloud solution but some will most likely outlive their existence in a legacy environment. For these solutions we still benefit hugely from portable production environments and one way to define them. In a cloud environment we can recreate their topology and leverage the benefits of portability even if we cannot really benefit from the cloud in the production environment it self.
For these solutions the Topology Specification and Environment Specification can be used to generate documentation, little bit of graphviz can do wonders here. This documentation can be used for change request contracts and for documentation purposes.
We like Groovy scripts for our Infrastructure as Code. Here is an example of how the above example with the Cluster and the Oracle database could be defined using the Topology Specification. This example covers just clusters, storages and network rules but more can be added such as HTTP Gateways Fronts as 'gateways' and pure network integrations with partners as 'integrations'.
def topologySpec = [name:'SomeService'
,clusters:[[name:'SomeServiceCluster'
,image_prefix:'some-service'
,cluster_port:80
,node_port:8080
,health_check_uri:'/ping/me'
,networks:[[
name:'SomeServiceInternal'
,allow_inbound:[
[from:'0.0.0.0',ports:'80',protocol:'TCP']
,[from:'192.168.16.0/24',ports:'22',protocol:'SSH']
],allow_outbound:[
[from:'0.0.0.0',ports:'80',protocol:'TCP']
,[to:'OracleInternal',ports:'1521',protocol:'TCP']
]
]
]
]
],storages:[[name:'Rdbms'
,type:'oracle'
,network:[
name:'OracleInternal'
,allow:[
[from:'SomeServiceInternal',ports:'1521',protocol:'TCP']
,[from:'192.168.16.0/24',ports:'22',protocol:'SSH']
]
]
]
]
]
,clusters:[[name:'SomeServiceCluster'
,image_prefix:'some-service'
,cluster_port:80
,node_port:8080
,health_check_uri:'/ping/me'
,networks:[[
name:'SomeServiceInternal'
,allow_inbound:[
[from:'0.0.0.0',ports:'80',protocol:'TCP']
,[from:'192.168.16.0/24',ports:'22',protocol:'SSH']
],allow_outbound:[
[from:'0.0.0.0',ports:'80',protocol:'TCP']
,[to:'OracleInternal',ports:'1521',protocol:'TCP']
]
]
]
]
],storages:[[name:'Rdbms'
,type:'oracle'
,network:[
name:'OracleInternal'
,allow:[
[from:'SomeServiceInternal',ports:'1521',protocol:'TCP']
,[from:'192.168.16.0/24',ports:'22',protocol:'SSH']
]
]
]
]
]
Environment Specifications contain scaling of the Topology but can also contain integration Mocks as 'clusters' defined just for that environment.
def devEnvSpec = [name:'SomeService'
, clusters:[[name:'SomeServiceCluster'
,cluster_size:2
,node_size:nodeSize.SMALL
]
]
,storages:[
[name:'Rdbms'
,cluster_size:1
,node_size:nodeSize.SMALL
]
]
]
def prodEnvSpec = [name:'SomeService'
, clusters:[[name:'SomeServiceCluster'
,cluster_size:3
,node_size:nodeSize.MEDIUM
]
]
,storages:[
[name:'Rdbms'
,cluster_size:1
,node_size:nodeSize.LARGE
]
]
]
Then the definitions are pushed to the Provisioner implementaiton with the input argument of which environment to Provision.
def envSpecs = ['DEV':devEnvSpec
,'PROD':prodEnvSpec
,'LEGACY':prodEnvSpec]
//args[0] is env name 'DEV', 'PROD' or 'LEGACY'
//Provisioner pics the right implementation (Vagrant, AWS or PDF) for the right environment
new Provisioner().provision(topologySpec, envSpecs, args)
Test Environments should be non persistent environments that are provisioned and decommissioned when the test execution is finished.
Development environments should be provisioned in the morning and decommissioned at the end of the day. This also solves the issue of building the Dev Environment which can be a tedious manual process in many organisations.
Production environments on the other hand need to support provision, update and decommission as its not always convenient to build a new environment for each topological change.
Also understand that Provisioning an environment is not the same as deploying the application. There can be many deployments into a provisioned environment. The Topology Specification doesn't specify what version of the application is deployed just what the base name of the artefact is. I find that convenient as that can be used to identify which image should be used to build the cluster.
One Specification, One Team Ownership
The Topology Specification should be owned by one team, the team that is responsible for developing and putting the system into runtime. Yes I do assume that some sort of DevOps like organisation is in place at this stage. If it isn't then I would say that the specification should be owned by the Dev team and the generated documentation should be contract between Dev and Ops. Consolidating ownership of as many environments as possible into one team should be the aim.
Summary
I think using these mechanisms to provision environments in a Continuous Delivery pipe will increase the quality of the software that goes through the pipe immensely. Not only will feedback be faster but we will also be able to start tailoring environments for specific test scenarios. The possibilities of quality increase are enormous.
New Talk: Continuous Testing
April 29th I will be visiting HiQ here in Gothenburg. I will have the opportunity to talk about the super important subject of Continuous Delivery and Testing. This talk is a intro level talk that describes Continuous Delivery and how we need to change the way we work with Testing.
If anyone else is interested in this talk then please dont hesitate to contact me. The talk can be focused on a practitioner audience as well digging in a bit deeper into the practices.
Here is a brief summary of the talk.
If anyone else is interested in this talk then please dont hesitate to contact me. The talk can be focused on a practitioner audience as well digging in a bit deeper into the practices.
Here is a brief summary of the talk.
Why do we want to do Continuous Delivery
Introduction to Continuous Delivery and why we want to do it. What we need to do in order to do it, principles and practices and a look at the pipe. (Part of the intro level talk)
Test Automation for Continuous Delivery
Starting with a look on how we have done testing in the past and our efforts to automate it. Moving on to how we need to work with Application Architecture in order to Build Quality In so that we can do fast, scaleable and robust test automation.
Test Driven Development and Continuous Delivery
How we need to look work in a Test Driven way in order to have our features verified as they are completed. A look at how Test Architecture helps us define who does what.
Exploratory Testing and Continuous Delivery
In our strive to automate everything its easy to forget that we still NEED to do Exploratory Testing. We need to understand how to do Exploratory Testing without doing Manual Release Testing. The two are vastly different
Some words on Tooling
Alot of time discussions start with tools. This is probably the best way to fail your test automation efforts.
Areas Not Covered
Before we are done we need to take a quick look at Test Environments, Testing Non Functional Requirements, A/B Testing, Power of Metrics. (This section expands in the practitioner level talk)
Tuesday, April 1, 2014
Upcoming talks
I've gotten the honor to speak at two fantastic conferences this spring.
First one is PipelineConf 8th of april in London where I will talk about the people side of Continuous Delivery. This is the talk Ive had at Netlight EDGE and JDays Conferences though its been update with the experiences from the last 6-8 months of working with Continuous Delivery.
The second one is GeeCon 14th-16th may in Krakow Poland where I will be speaking about Scaling Continuous Delivery. This is a new talk that focuses on lessons learned from our journey to scale continuous delivery from a team of 5 to an organization of 100s.
If you are interested in hearing me speak at a conference, seminar or a workshop. Dont hesitate to contact me.
First one is PipelineConf 8th of april in London where I will talk about the people side of Continuous Delivery. This is the talk Ive had at Netlight EDGE and JDays Conferences though its been update with the experiences from the last 6-8 months of working with Continuous Delivery.
The second one is GeeCon 14th-16th may in Krakow Poland where I will be speaking about Scaling Continuous Delivery. This is a new talk that focuses on lessons learned from our journey to scale continuous delivery from a team of 5 to an organization of 100s.
If you are interested in hearing me speak at a conference, seminar or a workshop. Dont hesitate to contact me.
Monday, March 17, 2014
Portability
I've talked about Portability of the CD process before but it continuously becomes more and more evident for us how important it is. The closer the CD process comes to the developer the higher the understanding of the process. Our increase in portability has gone through stages.
Initially we deployed locally in a way that was totally different from the way we deployed in the continuous delivery process. Our desktop development environments where not part of our CD process at all. Our deploy scripts handle stopping starting of servers, moving artifacts on the server, linking directories and running liquibase to upgrade/migrate database. We did all this manually on the local environments. We ran liquibase but we ran it using the maven plugin (which we don't do in our deploy scripts there we run it using java -jar). We moved artifacts by hand or by other scripts.
Then we created a local bootstrap script which executed the CD process deploy scripts on a local environment. We built in environment specific support in the local bootstrap so that we supported linux and windows. Though in order to start Jboss and Mule we needed to add support for the local environment in the CD process deploy script as well. We moved closer to portability but we diluted our code and increased our complexity. Still this was an improvement but the process was still not truly portable.
In recent time we have decided to shift our packaging of artifacts from zip files to rpms. All our prod and test environments are redhat so the dependency on technology is not really an issue for us here. What this gives us is the ability to manage dependencies between artifacts and infrastructure in a nice way. The war file depends on a jboss version which depends on a java version and all are installed when needed. This also finally gives us a clear separation between install and deploy. The yum installer installs files on the server, our deploy application brings the runtime online, configures it and moves the artifacts into runtime.
In order for us to maintain portability to the development environment this finally forced us to go all in and make the decision "development is done in a linux environment". We won't be moving to linux clients but our local deploy target will be a virtual linux box. This finally puts everything into place for us creating a fully portable model. Its important to understand that we still dont have a cloud environment in our company.
This image, created by my colleague Mikael, is a great visualization of how portability we can build in our environment now and when we get a cloud. By defining a Portability level and its interface we manage to build a mini cloud on each jenkins slave and on a local dev machine using the exactly same process as we would for a QA or test deploy. The Nodes above the Portability level can be local on the workstation/jeknins slave or remote in a Prod Environment. The process is the same, regardless of environment Provision, Install and Deploy.
Initially we deployed locally in a way that was totally different from the way we deployed in the continuous delivery process. Our desktop development environments where not part of our CD process at all. Our deploy scripts handle stopping starting of servers, moving artifacts on the server, linking directories and running liquibase to upgrade/migrate database. We did all this manually on the local environments. We ran liquibase but we ran it using the maven plugin (which we don't do in our deploy scripts there we run it using java -jar). We moved artifacts by hand or by other scripts.
Then we created a local bootstrap script which executed the CD process deploy scripts on a local environment. We built in environment specific support in the local bootstrap so that we supported linux and windows. Though in order to start Jboss and Mule we needed to add support for the local environment in the CD process deploy script as well. We moved closer to portability but we diluted our code and increased our complexity. Still this was an improvement but the process was still not truly portable.
In recent time we have decided to shift our packaging of artifacts from zip files to rpms. All our prod and test environments are redhat so the dependency on technology is not really an issue for us here. What this gives us is the ability to manage dependencies between artifacts and infrastructure in a nice way. The war file depends on a jboss version which depends on a java version and all are installed when needed. This also finally gives us a clear separation between install and deploy. The yum installer installs files on the server, our deploy application brings the runtime online, configures it and moves the artifacts into runtime.
In order for us to maintain portability to the development environment this finally forced us to go all in and make the decision "development is done in a linux environment". We won't be moving to linux clients but our local deploy target will be a virtual linux box. This finally puts everything into place for us creating a fully portable model. Its important to understand that we still dont have a cloud environment in our company.
This image, created by my colleague Mikael, is a great visualization of how portability we can build in our environment now and when we get a cloud. By defining a Portability level and its interface we manage to build a mini cloud on each jenkins slave and on a local dev machine using the exactly same process as we would for a QA or test deploy. The Nodes above the Portability level can be local on the workstation/jeknins slave or remote in a Prod Environment. The process is the same, regardless of environment Provision, Install and Deploy.
Friday, February 21, 2014
Scaling Continuous Delivery
Its been a while since I posted. Main reason is that we have been very focused on our main deliveries and feature development for the last six month. Whenever the feature train hits central station its always work such as build, release, test automation that gets hit first.
Though there are upsides to not touching your Continuous Delivery process for a few months. If you just keep working on your backlog you don't get time to analyze the impact of the changes you just made. Several times we have realized that the number two/three items in the backlog have dropped significantly in priority as we have fixed the most important issue and others rising fast in priority.
Now we have had time to analyse a lot of new issues and its time for us to pick up the pace again.
Scaling the Organization
The good thing, the awesome thing (!) is that during these six or so months our organization has changed and we have actually be able to create a line organization that owns and takes responsibility for the continuous delivery process.
One of the major bottlenecks we found in our process was our platform/tools team. The team was small and resources in that team where always first to go when feature pressure increased. The team became just another "IT function" that didn't have time to be proactive due to all the reactive support work it had to do.
There was a few reasons behind this first it was the way the team worked in the past. It actually built the pipes and processes for all the teams by hand and tailored to the custom needs of each team. On some teams there were individuals who picked up the work and kept on configuring the jenkins jobs to tailor them even more but on some teams there was no interest whatsoever and their jobs degraded.
The result of this was that no one really knew how the pipes looked and how they should look. Introducing process change was a horribly slow process as it was all manual and dependent on the platform/tools team.
One of the first changes we made was to increase the bandwidth of the team and reducing the dependency on that team. @TobiasPalmborg provided a great solution for this over a chat this summer. Instead of the platform/tools team supporting the development teams the development teams put resources into the platform/tools team. Each team was invited to add a 50% resource on a volunteered basis. This way the real life issues got much better attention in the platform/tools team and the competence about the Continuous Delivery process got spread in a much more organic way.
This did not eliminate the bottleneck organization but it gave us bandwidth to change the way we work and long term gave us the ability to scale with the number of teams that use the process.
Scaling the Process
The main issue with why we were a bottleneck was the way we worked. We preached Automate Everything, Test Everything, If its hard do it more often, ect but when it came to the Continuous Delivery process we didn't do what we where teaching.
We had ONE Jenkins Environment so all the changes happened directly in production. Testing plugins and new configurations on a production environment isn't really the way to delivery stability, reliability and performance.
Manually created Jenkins Pipes isnt really a way to create sustainable pace and continuous improvements.
Developing Deploy scripts without explicit unit tests isnt really a good way of creating a stable process. We have been priding ourselves with our deployment being tested hundreds of times pre production deploy which was true but very dumb. Implicit testing means that someone else takes the pain for my mistakes. Deployment scripts are applications and need to be treated as first class citizens.
This had to change.
First thing we did was to use the extra bandwidth we had obtained to build a totally new way of delivering continuous delivery. Automate everything, obvious, hu?
We also decided to deliver a continuous delivery environment per development team and not have them all in one environment. So we started with automating provisioning of Jenkins & Test environments. We dont have a cloud solution in our company at this time so we have a fake cloud that we work with which is a huge pool of virtual servers. This pool we provision and maintain using chef.
Second thing was to automate the build pipe setup. We built us a little simple pipe generator which has defined pipe templates of 5-8 different layouts to support the different needs. We actually managed to get the development teams to adjust to a stricter maven project naming convention to use the generated pipes as everyone saw the benefits of this.
The pipes we have are basically typed by what they build if its libs or deployable components and how they are tested as we still need to initiate our Fitnesse tests a bit differently from our other tests.
We made it the responsibility of the platform/tools team to develop the pipe templates and the responsibility of the development teams to configure their generator to generate the pipes they needed for their components.
Getting to this stage was a lot of work and a lot of migration work for all the teams but the results have been terrific. The support load has gone down alot on the platform/tools team and each bug fix is rolled out within minutes to all the pipes.
We have also be able to take on new development teams very easily. Not all teams in our company are ready to do Continuous Delivery but they are all heading in this direction and we can now provide environments and pipelines that match their maturity.
Summary
We have gone from a process developed as skunkworkz to Continuous Delivery as a Service within our organization. We always run into new bottlenecks and challenges this time the bottleneck was much more us than anything else. I assume that the next big bottleneck is going to be hardware and our inability to deliver on a cloud solution, since we now can roll out to more and more teams. But who knows I can be wrong only time will tell.
Though there are upsides to not touching your Continuous Delivery process for a few months. If you just keep working on your backlog you don't get time to analyze the impact of the changes you just made. Several times we have realized that the number two/three items in the backlog have dropped significantly in priority as we have fixed the most important issue and others rising fast in priority.
Now we have had time to analyse a lot of new issues and its time for us to pick up the pace again.
Scaling the Organization
The good thing, the awesome thing (!) is that during these six or so months our organization has changed and we have actually be able to create a line organization that owns and takes responsibility for the continuous delivery process.
One of the major bottlenecks we found in our process was our platform/tools team. The team was small and resources in that team where always first to go when feature pressure increased. The team became just another "IT function" that didn't have time to be proactive due to all the reactive support work it had to do.
There was a few reasons behind this first it was the way the team worked in the past. It actually built the pipes and processes for all the teams by hand and tailored to the custom needs of each team. On some teams there were individuals who picked up the work and kept on configuring the jenkins jobs to tailor them even more but on some teams there was no interest whatsoever and their jobs degraded.
The result of this was that no one really knew how the pipes looked and how they should look. Introducing process change was a horribly slow process as it was all manual and dependent on the platform/tools team.
One of the first changes we made was to increase the bandwidth of the team and reducing the dependency on that team. @TobiasPalmborg provided a great solution for this over a chat this summer. Instead of the platform/tools team supporting the development teams the development teams put resources into the platform/tools team. Each team was invited to add a 50% resource on a volunteered basis. This way the real life issues got much better attention in the platform/tools team and the competence about the Continuous Delivery process got spread in a much more organic way.
This did not eliminate the bottleneck organization but it gave us bandwidth to change the way we work and long term gave us the ability to scale with the number of teams that use the process.
Scaling the Process
The main issue with why we were a bottleneck was the way we worked. We preached Automate Everything, Test Everything, If its hard do it more often, ect but when it came to the Continuous Delivery process we didn't do what we where teaching.
We had ONE Jenkins Environment so all the changes happened directly in production. Testing plugins and new configurations on a production environment isn't really the way to delivery stability, reliability and performance.
Manually created Jenkins Pipes isnt really a way to create sustainable pace and continuous improvements.
Developing Deploy scripts without explicit unit tests isnt really a good way of creating a stable process. We have been priding ourselves with our deployment being tested hundreds of times pre production deploy which was true but very dumb. Implicit testing means that someone else takes the pain for my mistakes. Deployment scripts are applications and need to be treated as first class citizens.
This had to change.
First thing we did was to use the extra bandwidth we had obtained to build a totally new way of delivering continuous delivery. Automate everything, obvious, hu?
We also decided to deliver a continuous delivery environment per development team and not have them all in one environment. So we started with automating provisioning of Jenkins & Test environments. We dont have a cloud solution in our company at this time so we have a fake cloud that we work with which is a huge pool of virtual servers. This pool we provision and maintain using chef.
Second thing was to automate the build pipe setup. We built us a little simple pipe generator which has defined pipe templates of 5-8 different layouts to support the different needs. We actually managed to get the development teams to adjust to a stricter maven project naming convention to use the generated pipes as everyone saw the benefits of this.
The pipes we have are basically typed by what they build if its libs or deployable components and how they are tested as we still need to initiate our Fitnesse tests a bit differently from our other tests.
We made it the responsibility of the platform/tools team to develop the pipe templates and the responsibility of the development teams to configure their generator to generate the pipes they needed for their components.
Getting to this stage was a lot of work and a lot of migration work for all the teams but the results have been terrific. The support load has gone down alot on the platform/tools team and each bug fix is rolled out within minutes to all the pipes.
We have also be able to take on new development teams very easily. Not all teams in our company are ready to do Continuous Delivery but they are all heading in this direction and we can now provide environments and pipelines that match their maturity.
Summary
We have gone from a process developed as skunkworkz to Continuous Delivery as a Service within our organization. We always run into new bottlenecks and challenges this time the bottleneck was much more us than anything else. I assume that the next big bottleneck is going to be hardware and our inability to deliver on a cloud solution, since we now can roll out to more and more teams. But who knows I can be wrong only time will tell.
Etiketter:
Continuous Delivery,
Scaling,
Service,
Stability
Tuesday, June 18, 2013
Its about the people.
Last week I attended QCon New York. Fantastic conference as usual and it was comforting to see that basically everyone was saying the same thing. "Continuous Delivery is not about the technology, its about the people". Which also happens to be the title of my talk at Netlight´s EDGE conference in september,
In his talk Steve Smith (@agilestevesmith) talked about how 5% is technology and 95% is organization. While I agree with that I think that the non-technical 95% can be divided into organization, change of role definitions and individual maturity. Its these three that my talk will cover.
Hopefully I will be able to have this talk in Gothenburg as well as its been submitted to JDays.
In his talk Steve Smith (@agilestevesmith) talked about how 5% is technology and 95% is organization. While I agree with that I think that the non-technical 95% can be divided into organization, change of role definitions and individual maturity. Its these three that my talk will cover.
Hopefully I will be able to have this talk in Gothenburg as well as its been submitted to JDays.
Subscribe to:
Posts (Atom)