Pages

Showing posts with label Responsiblity. Show all posts
Showing posts with label Responsiblity. Show all posts

Thursday, January 7, 2016

Optimizing Runtime Cost through Continuous Delivery

We have always had the need to understand our application runtime costs. In legacy enterprises this has often been done at a finance level and someone in the Runtime Operations department has been responsible for that cost. There have been different ways of distributing the costs to the right organization. Either each application has had its own infrastructure and own the full cost of it or applications have shared infrastructure in order to save money, utilize large servers in a better way or just to minimize setup and maintenance work of the servers. 

In many legacy enterprises there has been a huge disconnect between the rollout the application evolution and the infrastructure evolution. Application development has changed towards micro services with a lot of small applications with lifecycles of their own. The legacy enterprises runtime operations are largely unable to delivery on these changes due to processes, licenses and hardware leases of huge "enterprise scale servers" for the old monoliths. Requests of several servers per micro services have been waved off as unrealistic. 

Several ways around this have been invented in order to cohost multiple applications on the same servers in a somewhat isolated way. Docker draws so much attention partially because legacy enterprises can still have their big old servers and isolate the micro services on them. (Of course not only good thing with Docker).

Regardless of how this was solved the understanding of runtime cost becomes exponentially harder with micro service in legacy enterprise infrastructure. 

When transitioning to DevOps and Micro Services each team is responsible for delivering its services end to end in all environments with the right functionality, performance and imho to the right cost. So what is the right cost? Before even answering that question let’s start with what is the cost. The DevOps team needs to know the cost of its services. In the legacy enterprise that cost might at best case arrive as some kind of cost spit formula calculated by a finance guy based on how many micro services shared the servers, a cost split formula of any license costs and a cost split formula of the man hours to maintain the servers. At best a team would get this information once a month.

Part of the reason we want DevOps is so that the team has full competence and ability to improve its services. This needs to include the runtime costs of the services. A performance optimization of a service that cuts resource requirements by 25% should result in lower runtime costs of the service. In the legacy world of cost splits and financial formals to calculate cost this just doesn’t happen.

Thankfully we have cloud providers. Not only do we get the right resources handle our services when we need them but we get billed real money for them.

With auto scaling, elasticity and all other nice features that we get there is also a risk. We get away with writing increasingly bad applications. Bad performance? Doesn’t matter as long as it scales horizontally. We can IO block to ourselves hell and back as long as we scale horizontally and we do get away with it. Well as long as we don’t have to take responsibility for the cost of our service. This is why it’s so important for the DevOps team to take the responsibility of the runtime cost of the service.

For a few years I have had a dream to be able to stamp a cost runtime cost to a load test and correlate the performance measured in the test with the cost of the runtime environment to run the test. We are still not there with our applications since we do run on AWS Auto Scaling Groups and AWS bills us per started hour. This makes the billing data to blunt too give the correlation between performance and cost from a shorter test. With a Micro Service architecture that uses AWS Services such as Lambda, DynamoDB and Kinesis this would actually be achievable today.

With this vision in mind we have been able to integrate the runtime costs of our Micro Services in AWS with our Continuous Delivery as a Service implementation (Delivery Engine) in another way. For us DevOps Teams are the owners of services and Solutions are the drivers of the cost. So everything that we launch in AWS we tag with name of micro service, owner team name and cost driving solution. 

So this allows our DevOps Teams to see the cost of their Services.



Here we provide a graph of total costs for the services owned by a DevOps team. The services are listed in a top list below the graph and a long running trend of the cost for each component. This allows the Product Owners and Team Members to act in increasing costs. Links are provided to the Dashboard Page of each service.

This Service Dashboard provides the version history of each version built in delivery engine. Our Delivery Engine builds, black box tests the deployed service, load tests it, bakes AWS AMIs for it and launches it into the right environments. Here on the dashboard we visualize the total cost of the service grouped by environment (delivery engine, exploratory testing, qa and prod environments).


From the same dashboard we visualize service usage. In the below example it’s a visualization on cpu consumption across an auto scaling group.



This way we allow the team to ensure that they have the right scaling rules for its services and that they have the right amount of resources in each environment. 

By visualizing the cost of the runtime environment for each service and combining it with Continuous Delivery, Continuous Performance Testing and DevOps we allow our developers to constantly tweak and improve the performance and optimize the cost of our runtime environments. The lead time in the cost reporting still is at the "next day level" as we report the cost per day and per month but I still think this is a reasonable feedback loop when it comes to cost optimizations.



Tuesday, April 22, 2014

Environment Portability

I've talked about this a lot before and we have done a lot of work in this area but it cannot be stressed how important it is. In fact I think portability is the key success factor to building a good Continuous Delivery Implementation. Without portability there is no scalability.

There are two things that need to be built with portability in mind the build pipe itself and our dev, test and prod environments. If the build pipe isn't portable then it will become a bottleneck. Infact if the build pipe can run on each individual developer machine without the use of a build server then it is portable enough to scale in a build server environment.

Though in this post I will focus on Environment Portability from Desktop to Production.

Why do we need Portable Production Environments?

For years we have accepted the fact that the Test Environment isn´t really like the Prod Environment. We know that there are differences but we live with it because its too hard and too expensive to do anything about it. When doing Continuous Delivery it becomes very hard to live with it and accept it.

"If its hard do it more often" applies here as well.

The type of problems we run into as a result of non Portable Production Environments are problems that are hard by nature its scalability, clustering, failover, ect, ect. Its non functional requirements that need to be designed in early. By exposing the complexity of the production environment late in the pipe we create a very long feedback loop for the developers. It can be days or weeks depending on how often you hit prod with your deployments.


By increasing the portability of the production environment we increase the productivity of our developers and the quality of our application. This is obvious but there is another very important issue to deal with as well. Every time a deployment in UAT or Prod fails it undermines the credibility of Continuous Delivery. Each time a something that worked in UAT fails in Prod the person responsible for UAT will call for more manual tests before production. Obviously increasing the lead time even further making the problem worse but we need to constantly improve in order to manage fear.

If we have Portable Production Environments then the issues that stem from Environment Complexity will never hit production as they get caught much earlier in the pipe.

Who owns the definition of our Environments?

The different environments that we have in an organization who defines the and who owns them? There are a lot of variations to this as there are a lot of  variation to environment definitions and organizations. In most normally defunct organizations out there the Ops team owns the Production environments and often the machines in the earlier environments. How the earlier environments are set up and used is often the responsibility of the Dev team and if the organization is even more defunct then there is often a Delivery Team involved which defines how its environments are used.

This presents us with the problem that developers quite often have no clue how the production environment looks and implements the system based on assumption or ignorance. In the few cases where they actually have to implement something that has to do with scaling, clustering or failover its often just guess work as they don´t have a way to reproduce it.

Going into production this most often creates late requests on infrastructural changes and often even cases where solutions where based on assumptions that cannot be realized in a production environment.

What is portability?

When we talk about Portable Production Environments what do we really mean? Does it mean that we have to move all our development to online servers and all dev teams get their own servers that are identical to production but in smaller scale? Well not really. It is doable especially with the use of Cloud providers but I find it highly inconvenient to force the developer to be online and on a corporate network in order to develop. Having the ability to create a server environment for a developer on a need to have basis is great because it does cover for the gap where local environments cannot fully be portable.

Assuming one cloud provider to solve all your needs for portability across all environments is not short term realistic in most enterprise organizations unless you are already fully committed and fully rolled out to a cloud provider. There is always these legacy environments that need to be dealt with.

I think the key is to have portability on the topology of the environment. A environment built in AWS with ELBs will never be portable since you will never have an ELB locally. But having A load balancer in your dev environment and having multiple application nodes forces you to build for horizontal scalability and it will capture a whole lot more than just running on one local node.

Running a Oracle XE isnt really the same as running Enterprise Oracle but it provides a good enough portability. Firing it up on a Virtual Box of its own will force the DB away from local host.

In our production environment we monitor our applications using things like Graphite, Logstash+Kibana, ect, ect. These tools should be part of the development environment as well to increase the awareness of runtime and enable "design for runtime".

Creating the Development Environment described here with Vagrant is super easy. It can be built using other tools as well such as Docker or a combo of Vagrant and Docker. But if you use Docker then it needs to be used all the way. My example here with just Vagrant and VBox is to show that portability can be achieved without introducing new elements to your production environment.

Individual Environment Specifications and One Specification to rule them all.

To create a portable topology we need one way to describe our environments and then a way to scale them. A common human and machine readable Topology Specification that defines what clusters, stores, gateways and integrations there are in the topology of our production environment gives us the ability to share the definition of our environments. Then a environment specific specification that defines the scale of the environment and any possible mocks for integrations in development and test environments.

In an enterprise organisation we will always have our legacy environments. Over time these might migrate over to a cloud solution but some will most likely outlive their existence in a legacy environment. For these solutions we still benefit hugely from portable production environments and one way to define them. In a cloud environment we can recreate their topology and leverage the benefits of portability even if we cannot really benefit from the cloud in the production environment it self.

For these solutions the Topology Specification and Environment Specification can be used to generate documentation, little bit of graphviz can do wonders here. This documentation can be used for change request contracts and for documentation purposes.


We like Groovy scripts for our Infrastructure as Code. Here is an example of how the above example with the Cluster and the Oracle database could be defined using the Topology Specification. This example covers just clusters, storages and network rules but more can be added such as HTTP Gateways Fronts as 'gateways'  and pure network integrations with partners as 'integrations'.


def topologySpec = [name:'SomeService'
   ,clusters:[[name:'SomeServiceCluster'
           ,image_prefix:'some-service'
           ,cluster_port:80
           ,node_port:8080
           ,health_check_uri:'/ping/me'
           ,networks:[[
                   name:'SomeServiceInternal'
                   ,allow_inbound:[
                       [from:'0.0.0.0',ports:'80',protocol:'TCP']
                       ,[from:'192.168.16.0/24',ports:'22',protocol:'SSH']
                   ],allow_outbound:[
                       [from:'0.0.0.0',ports:'80',protocol:'TCP']
                       ,[to:'OracleInternal',ports:'1521',protocol:'TCP']
                   ]
               ]
           ]
       ]
   ],storages:[[name:'Rdbms'
           ,type:'oracle'
           ,network:[
               name:'OracleInternal'
               ,allow:[
                   [from:'SomeServiceInternal',ports:'1521',protocol:'TCP']
                   ,[from:'192.168.16.0/24',ports:'22',protocol:'SSH']
               ]
           ]
       ]
   ]
]



Environment Specifications contain scaling of the Topology but can also contain integration Mocks as 'clusters' defined just for that environment.

def devEnvSpec = [name:'SomeService'
   , clusters:[[name:'SomeServiceCluster'
           ,cluster_size:2
           ,node_size:nodeSize.SMALL
       ]
   ]
   ,storages:[
       [name:'Rdbms'
           ,cluster_size:1
           ,node_size:nodeSize.SMALL
       ]
   ]
]


def prodEnvSpec = [name:'SomeService'
   , clusters:[[name:'SomeServiceCluster'
           ,cluster_size:3
           ,node_size:nodeSize.MEDIUM
       ]
   ]
   ,storages:[
       [name:'Rdbms'
           ,cluster_size:1
           ,node_size:nodeSize.LARGE
       ]
   ]
]


Then the definitions are pushed to the Provisioner implementaiton with the input argument of which environment to Provision.

def envSpecs = ['DEV':devEnvSpec
              ,'PROD':prodEnvSpec
              ,'LEGACY':prodEnvSpec]


//args[0] is env name 'DEV', 'PROD' or 'LEGACY'
//Provisioner pics the right implementation (Vagrant, AWS or PDF) for the right environment

new Provisioner().provision(topologySpec, envSpecs, args)


Test Environments should be non persistent environments that are provisioned and decommissioned when the test execution is finished.

Development environments should be provisioned in the morning and decommissioned at the end of the day. This also solves the issue of building the Dev Environment which can be a tedious manual process in many organisations.

Production environments on the other hand need to support provision, update and decommission as its not always convenient to build a new environment for each topological change.

Also understand that Provisioning an environment is not the same as deploying the application. There can be many deployments into a provisioned environment. The Topology Specification doesn't specify what version of the application is deployed just what the base name of the artefact is. I find that convenient as that can be used to identify which image should be used to build the cluster.

One Specification, One Team Ownership

The Topology Specification should be owned by one team, the team that is responsible for developing and putting the system into runtime. Yes I do assume that some sort of DevOps like organisation is in place at this stage. If it isn't then I would say that the specification should be owned by the Dev team and the generated documentation should be contract between Dev and Ops. Consolidating ownership of as many environments as possible into one team should be the aim.

Summary

I think using these mechanisms to provision environments in a Continuous Delivery pipe will increase the quality of the software that goes through the pipe immensely. Not only will feedback be faster but we will also be able to start tailoring environments for specific test scenarios. The possibilities of quality increase are enormous.

Sunday, February 24, 2013

So it took a year.

When we first started building our continuous delivery pipe I had no idea that the biggest challenges would be non technical. Well I did expect that we would run into a lot of dev vs ops related issues and that the rest would be just technical issues. I was so naive.

We seriously underestimated how continuous delivery changes the every day work of each individual involved in the delivery of a software service. It affects everyone Developer, Tester, PM, CM, DBA and Operations professionals. Really it shouldn't be a big shocker since it changes the process of how we deliver software. So yes everyone gets affected.

The transition for our developers took about a year. Just over a year ago we scaled up our development and added give or take 15-20 developers. All these developers have been of a very high quality and very responsible individuals. Though none of them had worked in a continuous delivery process before and all where more or less new to our business domain.

When introducing them everyone got the run down of the continuous delivery process, how it works, why we have it and that they need to make sure to check in quality code. So off you go make code, check in tested stuff and if something still breaks you fix it. How hard can it be?

Much much harder then we thought. As I said all our developers are very responsible individuals. Still it was a change for them. What once was considered responsible like if it "compiles and unit tests check it in so that it doesn't get lost" leads broken builds. Doing this before leaving early on Friday becomes a huge issue because others have to fix the build pipe. But it goes for a lot of things like having to ensure that database scripts work all the time, everything with the database is versioned, roll backs work, ect, ect. So everyone has had to step up their game a  notch or two.

Continuous delivery really forces the developer to test much more before he/she checks in the code. Even for the developers that like to work test driven with their junit tests this is a step up. For many its a change of behavior. Changing a behavior that has become second nature  doesnt happen over night.

We had a few highly responsible developers that took on this change seamlessly. These individuals had to carry a huge load during this first year. When responsibility was dropped by one individual it was these who always ensured that the pipe was green. This has been the biggest source of frustration. I get angry, frustrated and mad when the lack of responsibility by one individual affects another individual. They get angry and frustrated as well because they don't want to lave it in a bad state and their responsibility prevents them from going home to their families. I'm so happy that we didn't loose any of these individuals during this period.

Now after about a year things have actually changed everyone takes much more responsibility and fixing the build pipe is much more of a shared effort. Which is soo nice. But why did it take such a long time? Id really like to figure out if this transition could have been made smoother and faster.

Key things why it took so much time.

A change to behavior.
Developers need to test much more, not just now and then but all the time. No matter how much you talk about "test before check in" , "test", "test", "test" the day the feature pressure increases a developer will fall back on second nature behavior and check in what he/she believes is done. We can talk lean, kanban, queues, push and pull all we want but fact is still there will always be situations of stress. Its not before a behavior change has become second nature we do it under pressure.

Immature process.
Visibility, portability and scale ability issues have made it hard to take responsibility. Knowing when, where and how to take responsibility is super important. Realizing that lack of responsibility is tied to these took us quite some time to figure out. If its hard to debug a testcase its going to a lot of time to figure out why things are failing and its going to require more senior developers to figure it out. Its also hard to be proactive with testing if the portability between development environment and test environment is bad.

Lot of new things at once
When you tell a developer about a new system, domain and a new process Im quite sure the developer will always listen more to the system and domain specific talks.
Developer has head full of this system communicates with that system and its that type of interface. Then I start going on about "Jira, bla bla bla, test bla, checkin bla bla, Jenkins bla, deploy, bla, fitnesse, test bla, bla" and developer goes "Yeah yeah yeah Ill check in and it gets tested I hear you, sweet!".

I defiantly think its much easier for a developer to make the transition if the process is more mature, has optimized feedback loops, scales and is portable. Honestly I think its easily going to take 3-6 months of the learning curve. But its still going to take a lot of time in range of months if we don´t become better at understanding behavioral changes.

Today we go straight from intro session (slides or whiteboard) to live scenario in one step. Here is the info now go and use it. At least now we are becoming better at mentoring. So there is help to get so that you can be talked through the process and the new developer is usually not working alone, which they where a year ago. Still I dont think its enough.

Continuous Delivery Training Dojos

I think we really need to start thinking about having training dojos where we learn the process from start to finish. I also think this is extremely important when transitioning to acceptance test driven development. But just for the reason of getting a feeling for the process. What is tested where, how and what happens when I change this and that. How should I test things before comiting and what should be done in which order.

I think if we practiced this and worked on how to break and unbreak the process in a non live scenario the transition would go much faster. In fact I dont think these dojos should be just to train new team members but they would also be a extremely effective way of sharing information and consequences of process change over time.


Thursday, February 7, 2013

The world upside down.

Sometimes the world just goes upside down. I talked in a previous post about how continuous delivery and test driven development changes the role of the tester. I retract that post. Well maybe not fully but let me elaborate.

Our testers are asking HOW should we automate and our developers are asking WHAT are you trying to do.

I´ve thought that our problem has been that we haven't been able to find testers that know HOW to automate. Its not our problem. Our problem is that we are asking our testers to automate instead of asking them WHAT to test. If a tester figures out what to test then any developer can solve the how to automate with ease.

I still believe that the role of the tester will change over time and that anyone who can answer the what and how question will be the most desired team member in the future. Until then we need to have testers thinking about WHAT and developers thinking about HOW.

Frustrating how such a simple fact can stare us in the eyes for such a long time without us noticing it.

Next time a tester asks how and a developer has to question what then its time to stop the line and get everyone into the same room asap!

Sunday, February 3, 2013

Working the trunk

When my colleague Tomas brought up the idea of continuous delivery he first thing that really caught my attention was "we do all work on the trunk". I've always hated branches. I've worked with many different branching strategies and honestly they have all felt wrong.

My main issue has always been that regardless of branch strategy (Release Branches or Feature Branches) its a lot of double testing and debuging after merge is always horrible. Its also hard to have a clear view of a "working system", what is the system, which branch do you refer to? Always having a clean and tested version of the trunk felt very compelling. No double work and a clear notion of "the system"! I'm game!

So we test everything all the time. How hard can it be.

Well it has proven to be a lot harder then we thought, not to continuously test but to manage everyone's desire to branch. Somehow people just love branches. Developers want their feature branches where they can work in their sandbox. Managers want their branches so that they don't get anything else then just that explicit bug fix for their delivery and not risk impact from anything else.

These are two different core problems one is about taking responsibility and one is about trust.

Managers don't trust "Jenkins".

Managers don't trust developers but somehow they do trust testers. Its interesting how much more credit a QA manager has when he/she says "I've tested everything" then a blue light on jenkins. In fact managers have MORE confidence in a manual regression test that has executed "most of the test-cases on the current build" then an automated process which executes "all the test-cases on every build". I think the reasons are twofold one is that the process is "something that the devs cooked up" and the other is that jenkins cant look a manager in the eyes. It would be much easier if Jenkins was actually a person who had a formal responsibility in the organisation and could be blamed, shouted on and fired if things went wrong.

It takes alot of hard work to sell "everything we have promised works as we have promised it". For each new build that we push into user acceptance testing we need to fight the desire to branch the previous release. Each time we have to go through the same discussion.

"I just want my bug fix"
"you get the newest version"
"I don't want the other changes"
"Everything we have promised works as we have promised it"
"How can you guarantee that"
"We run the tests on each check in"
"Doesn't matter I don't want the other changes they can break something, I want you to branch"

I dint know how many times we have had this argument. Interesting is that we are yet to break something in a production deploy as a result of releasing bug fix from the trunk (and hence including half done features). Though we have had a failed deploy due to having subdued to the urge to branch. We made a bad call and subdued to the branch pressure. By doing that we branched but we didn't build a full pipe for the branch which resulted in us not picking up a incompatible configuration change.

Developers love sandboxes

Its interesting, developers push for more releases, smaller work packages yet they love their feature branches. I despise feature branches even more then release branches. Reason is that they make it very hard to refactor an application and the merging process is very error prone. The design and implementation of a feature is based on a state of the "working system" which can be totally different from the system its merged onto. Also it breaks all the intentions to do smaller work packages and test them often, a merge is always bigger then a small commit.

The desire to feature branch comes from "all the repo updates we have to do all the time slow us down so much" and "we cant just check stuff in that doesn't work without breaking stuff". The later one isn't just from developers wanting to be irresponsible its also from us running SVN and not GIT. Developers do want to share code in a simple way. Small packages that two team mates want to share without touching the trunk is a viable concern. So the ability to micro branch would be nice. So yes I do recommend GIT if you can but its not a viable option for us. Though I'm quite sure that if we where using GIT we would end up having problems related to micro branches turning into stealth feature branches.

I think the complaint "all the repo updates we have to do all the time slow us down so much" is a much more interesting one. In general I think that developers need to adopt more continuous integration patterns in their daily work but this is actually a scale-ability issue. If you have too many developers working in the same part of the repo you are gonna get problems. When developers do adopt good continuous integration patterns in their daily work and their productivity drops then there is an issue. This is one of the reasons why we have seen feature branches in the past.

Distribute development across the repository

When we started building our delivery platform we based it on an industry standard pattern that clearly defines component responsibility. Early on in the life cycle we saw no issues of developers contesting the same repository space as we had just one or two working on each component. But as we scaled we started to see more and more of this. We also saw that some of the components where to widely defined in their responsibility which made them bloated and hard to understand. So we decided to refactor the main culprit component into several smaller more well defined components.

The result of this was very good. The less bloated a component  is the easier it is to understand and to test, which leads to increased stability. By creating sub components we also spread out the developers across the repository. So we actually created stable contextual sandboxes that are easy to understand and manage.

Obviously it we shouldn't just create components to spread out our developers but I think that if developers start stepping on each other then its a symptom of either bad architecture or over staffing. If a component needs so many developers that they are in the way of each other then the chance is quite good that the component does way too much or that management is trying to meet feature demand by just pouring in more developers.

Backwards compatibly

Another key to working on the trunk has been our interface versioning strategy. Since we mostly provide REST services we actually where forced into branching once or twice where we had not other option and that was due to not being backwards compatible on our interfaces. We couldn't take the trunk into production because the changes where not backwards compliant and our tests had been changed to map the new reality. This is what lead to our new interface strategy where we among things never ever change interfaces or payloads, just add new ones and deprecate old ones.

Everything that interfaces the outside world needs to be kept backwards compatible or program management and timing issues will force inevitable branching.

Not what I expected

When we first deiced to work solely on the trunk I thought it was gonna be all about testing. Its important but I think people management has been a bigger investment (at least measured in mental energy drain) and importance of good architecture was under rated.