Continuous Delivery: DevOps

Showing posts with label DevOps. Show all posts

Sunday, October 2, 2016

Service Simulation rather the Integration Testing of Micro Services

Micro Services or Distributed Monolith

One of the main characteristics of a Micro Service architecture is that each Micro Service has its own lifecycle. This is very important as this is the only way to break the monolith pattern. If the lifecycle of a Micro Service is tied to another entity such as Service, System or Solution then we have a distributed monolith and not a Micro Service architecture.

Ive found that almost everyone agrees in theory but in practice it very wants to do integration testing before release. Integration Testing is so in our DNA, we have been doing it for years and dont really know any other way.

Lets explore the problem of Integration Testing a little bit. If we have Micro Service X and it is consumed by Micro Services Y and Z. Then there is no denying there is a dependency between X, Y and Z. Part of Micro Service architecture is API versioning and backward compatibility. But even with these practices and good test coverage on component X it's hard to deny that the integration between X, Y and Z can still fail. But if we rely on Integration Testing to find these failures then we create a dependency between the teams developing Y and Z and the team of X. That means X no longer has a lifecycle of its own.

Hence if we do rely on Integration Testing we have a distributed monolith which imho is the only thing worse than a monolith.

Consumer Contract Testing

http://martinfowler.com/articles/consumerDrivenContracts.html

Martin Fowler talks about Contract Testing and Consumer Driven Contracts which is a great pattern. In short it means that the developers of Micro Services Y and Z provide contract tests for Micro Service X. The tests provided by the developers of Y and Z correspond to what parts of the X API they consume. These tests are executed as part of the test suite belonging to service X. This way the developers of X know if they broke any backwards compatibility.

I think this is a great pattern and one that helps developers maintain the right amount of backwards compatibility with a fast feedback mechanism. Still I am not sure it is enough. At least in our organization everyone still wants to Integration Test.

There is still a lot of failures in the integration that can fail outside of the component tests of X. In production like environments there can be deployments in different networks, there is service discovery and other factors playing in.

Service Simulation

One Solution is to introduce Service Simulation. Instead of implementing test automation we can implement simulator bots. These bots exercise continuously execute our services. This creates a constant and even load on our solution driven by our services. The monitoring of our Services is used to notify the developers if something failed. No alarms for a given time period after deploy means our solution still works and the redeployment of our Micro Service didnt break anything and we can go on to deploy into the next environment.

Its also relatively easy to implement this as part of the continuous delivery build pipe. Deploy into an environment, check the alarms in that environment for five min and if all green then the Micro Service is verified and the Continuous Delivery implementation continues.

Though its not necessary to have a advanced Continuous Delivery Engine for this pattern to work, just alarm notifications to team channels on slack is powerful in by self.

Another important thing this solves is that the developers and testers start using production grade monitoring to verify services. I think its a very big obstacle in a DevOps transformation that Developers and Testers view Test Reports to understand a system while these arnt available in production. Developers and Testers are usually lost when it comes to production systems. This moves their understanding towards runtime operations of a system which bridges the gap between Dev and Ops in a nice way.

Bots can be deployed into any environment providing same process of deploy and verify in all environments. Personally I like the idea of having a Bot workload only environment as the first full featured environment. The now obsolete Integration Testing environment could be easily converted into a Bot only environment.

While a Bot basically can be implemented in a number of ways I like the idea of using Function as a Service such as AWS Lambda to implement Service SImulation Bots. The scheduled nature and high specialization of a Bot is perfect as a FaaS workload.

Test Pyramid

Simulation - In all Environments Bots Executing Monitoring and Alarms to verify application.
Component & Contract Tests - Deployed Functional Tests using Mocks to stub consumed services. Contract Tests provided by developers consuming this services.
Unit Tests - Well nothing new there.

I think this is what we need to in order to release our micro services with high confidence and with life cycles of their own.

Tuesday, April 22, 2014

Environment Portability

I've talked about this a lot before and we have done a lot of work in this area but it cannot be stressed how important it is. In fact I think portability is the key success factor to building a good Continuous Delivery Implementation. Without portability there is no scalability.

There are two things that need to be built with portability in mind the build pipe itself and our dev, test and prod environments. If the build pipe isn't portable then it will become a bottleneck. Infact if the build pipe can run on each individual developer machine without the use of a build server then it is portable enough to scale in a build server environment.

Though in this post I will focus on Environment Portability from Desktop to Production.

Why do we need Portable Production Environments?

For years we have accepted the fact that the Test Environment isn´t really like the Prod Environment. We know that there are differences but we live with it because its too hard and too expensive to do anything about it. When doing Continuous Delivery it becomes very hard to live with it and accept it.

"If its hard do it more often" applies here as well.

The type of problems we run into as a result of non Portable Production Environments are problems that are hard by nature its scalability, clustering, failover, ect, ect. Its non functional requirements that need to be designed in early. By exposing the complexity of the production environment late in the pipe we create a very long feedback loop for the developers. It can be days or weeks depending on how often you hit prod with your deployments.

By increasing the portability of the production environment we increase the productivity of our developers and the quality of our application. This is obvious but there is another very important issue to deal with as well. Every time a deployment in UAT or Prod fails it undermines the credibility of Continuous Delivery. Each time a something that worked in UAT fails in Prod the person responsible for UAT will call for more manual tests before production. Obviously increasing the lead time even further making the problem worse but we need to constantly improve in order to manage fear.

If we have Portable Production Environments then the issues that stem from Environment Complexity will never hit production as they get caught much earlier in the pipe.

Who owns the definition of our Environments?

The different environments that we have in an organization who defines the and who owns them? There are a lot of variations to this as there are a lot of variation to environment definitions and organizations. In most normally defunct organizations out there the Ops team owns the Production environments and often the machines in the earlier environments. How the earlier environments are set up and used is often the responsibility of the Dev team and if the organization is even more defunct then there is often a Delivery Team involved which defines how its environments are used.

This presents us with the problem that developers quite often have no clue how the production environment looks and implements the system based on assumption or ignorance. In the few cases where they actually have to implement something that has to do with scaling, clustering or failover its often just guess work as they don´t have a way to reproduce it.

Going into production this most often creates late requests on infrastructural changes and often even cases where solutions where based on assumptions that cannot be realized in a production environment.

What is portability?

When we talk about Portable Production Environments what do we really mean? Does it mean that we have to move all our development to online servers and all dev teams get their own servers that are identical to production but in smaller scale? Well not really. It is doable especially with the use of Cloud providers but I find it highly inconvenient to force the developer to be online and on a corporate network in order to develop. Having the ability to create a server environment for a developer on a need to have basis is great because it does cover for the gap where local environments cannot fully be portable.

Assuming one cloud provider to solve all your needs for portability across all environments is not short term realistic in most enterprise organizations unless you are already fully committed and fully rolled out to a cloud provider. There is always these legacy environments that need to be dealt with.

I think the key is to have portability on the topology of the environment. A environment built in AWS with ELBs will never be portable since you will never have an ELB locally. But having A load balancer in your dev environment and having multiple application nodes forces you to build for horizontal scalability and it will capture a whole lot more than just running on one local node.

Running a Oracle XE isnt really the same as running Enterprise Oracle but it provides a good enough portability. Firing it up on a Virtual Box of its own will force the DB away from local host.

In our production environment we monitor our applications using things like Graphite, Logstash+Kibana, ect, ect. These tools should be part of the development environment as well to increase the awareness of runtime and enable "design for runtime".

Creating the Development Environment described here with Vagrant is super easy. It can be built using other tools as well such as Docker or a combo of Vagrant and Docker. But if you use Docker then it needs to be used all the way. My example here with just Vagrant and VBox is to show that portability can be achieved without introducing new elements to your production environment.

Individual Environment Specifications and One Specification to rule them all.

To create a portable topology we need one way to describe our environments and then a way to scale them. A common human and machine readable Topology Specification that defines what clusters, stores, gateways and integrations there are in the topology of our production environment gives us the ability to share the definition of our environments. Then a environment specific specification that defines the scale of the environment and any possible mocks for integrations in development and test environments.

In an enterprise organisation we will always have our legacy environments. Over time these might migrate over to a cloud solution but some will most likely outlive their existence in a legacy environment. For these solutions we still benefit hugely from portable production environments and one way to define them. In a cloud environment we can recreate their topology and leverage the benefits of portability even if we cannot really benefit from the cloud in the production environment it self.

For these solutions the Topology Specification and Environment Specification can be used to generate documentation, little bit of graphviz can do wonders here. This documentation can be used for change request contracts and for documentation purposes.

We like Groovy scripts for our Infrastructure as Code. Here is an example of how the above example with the Cluster and the Oracle database could be defined using the Topology Specification. This example covers just clusters, storages and network rules but more can be added such as HTTP Gateways Fronts as 'gateways' and pure network integrations with partners as 'integrations'.

def topologySpec = [name:'SomeService'
   ,clusters:[[name:'SomeServiceCluster'
           ,image_prefix:'some-service'
           ,cluster_port:80
           ,node_port:8080
           ,health_check_uri:'/ping/me'
           ,networks:[[
                   name:'SomeServiceInternal'
                   ,allow_inbound:[
                       [from:'0.0.0.0',ports:'80',protocol:'TCP']
                       ,[from:'192.168.16.0/24',ports:'22',protocol:'SSH']
                   ],allow_outbound:[
                       [from:'0.0.0.0',ports:'80',protocol:'TCP']
                       ,[to:'OracleInternal',ports:'1521',protocol:'TCP']
                   ]
               ]
           ]
       ]
   ],storages:[[name:'Rdbms'
           ,type:'oracle'
           ,network:[
               name:'OracleInternal'
               ,allow:[
                   [from:'SomeServiceInternal',ports:'1521',protocol:'TCP']
                   ,[from:'192.168.16.0/24',ports:'22',protocol:'SSH']
               ]
           ]
       ]
   ]
]

Environment Specifications contain scaling of the Topology but can also contain integration Mocks as 'clusters' defined just for that environment.

def devEnvSpec = [name:'SomeService'

, clusters:[[name:'SomeServiceCluster'

,cluster_size:2

,node_size:nodeSize.SMALL

]

,storages:[

[name:'Rdbms'

,cluster_size:1

,node_size:nodeSize.SMALL

]

def prodEnvSpec = [name:'SomeService'

, clusters:[[name:'SomeServiceCluster'

,cluster_size:3

,node_size:nodeSize.MEDIUM

]

,storages:[

[name:'Rdbms'

,cluster_size:1

,node_size:nodeSize.LARGE

]

Then the definitions are pushed to the Provisioner implementaiton with the input argument of which environment to Provision.

def envSpecs = ['DEV':devEnvSpec

,'PROD':prodEnvSpec

,'LEGACY':prodEnvSpec]

//args[0] is env name 'DEV', 'PROD' or 'LEGACY'

//Provisioner pics the right implementation (Vagrant, AWS or PDF) for the right environment

new Provisioner().provision(topologySpec, envSpecs, args)

Test Environments should be non persistent environments that are provisioned and decommissioned when the test execution is finished.

Development environments should be provisioned in the morning and decommissioned at the end of the day. This also solves the issue of building the Dev Environment which can be a tedious manual process in many organisations.

Production environments on the other hand need to support provision, update and decommission as its not always convenient to build a new environment for each topological change.

Also understand that Provisioning an environment is not the same as deploying the application. There can be many deployments into a provisioned environment. The Topology Specification doesn't specify what version of the application is deployed just what the base name of the artefact is. I find that convenient as that can be used to identify which image should be used to build the cluster.

One Specification, One Team Ownership

The Topology Specification should be owned by one team, the team that is responsible for developing and putting the system into runtime. Yes I do assume that some sort of DevOps like organisation is in place at this stage. If it isn't then I would say that the specification should be owned by the Dev team and the generated documentation should be contract between Dev and Ops. Consolidating ownership of as many environments as possible into one team should be the aim.

Summary

I think using these mechanisms to provision environments in a Continuous Delivery pipe will increase the quality of the software that goes through the pipe immensely. Not only will feedback be faster but we will also be able to start tailoring environments for specific test scenarios. The possibilities of quality increase are enormous.

Wednesday, January 9, 2013

Test for runtime

Traditionally our testers have been responsible fore functional testing, load testing and in some cases for some failover testing. This covers our functional requirements and some of our supplemental requirements as well. Though it doesn't cover the full set of supplemental requirements and we haven't really taken many stabs at automating these in the past.

The fact that we haven't really tested all the supplemental requirements also leaves a big question, who's responsibility is verification of supplemental requirements? Lets park that question for a little bit. To be truthful we don't really design for runtime either. Our supplemental requirements almost always come as an afterthought and after the system is in production. They always tend to get lost in the race for features to get ready.

In our current project we try to improve on this but we are still not doing it well enough. We added some of the logging related requirements early but we have no requirement spec and no verification of the requirements.

The logging we added was checkpoint logging and performance logging. Both these are requirements from our operations department. The checkpoint logging is a functional log which just contain key events in an integration. It's used by our first line support to do initial investigation. The performance log is for monitoring performance of defined parts of the system. It's used by operation for monitoring the application.

Lets use user registration as an example (its a fictive example).

1. User enters name, username, password and email into a web form.
2. System verifies the form.
3. System calls a legacy system to see if the email is registered in that system as well.
3a. If user registered in legacy system with username and password matching the userid is returned from that system.
4. System persists user.
5. Email is sent to user.
6. Confirmation view displayed.

From this we can derive some good checkpoints.

2013-01-07 21:30:07:974 | null | Verified user form name=Ted, username=JohnDoe, email=joe@some.tst
2013-01-07 21:30:08:234 | usr123 | User found in legacy system
2013-01-07 21:30:08:567 | usr123 | User persisted
2013-01-07 21:30:08:961 | usr123 | User notified at joe@some.tst

The performance log could look something like this.

2013-01-07 21:30:07:974 | usr123 | Legacy lookup completed | 250 | ms
2013-01-07 21:30:08:566 | usr123 | User persisted | 92 | ms
2013-01-07 21:30:08:961 | usr123 | User registration completed | 976 | ms

This is all nice but who decides what checkpoints should be logged? Who verifies it?

Personally I would like to make the verification the responsibility of the testers. Though I've never been in a project where testers have owned the verification of any kind of logging. This logging is in fact not "just" logging but system output, hence should definitely be verified by the testers. By making this the responsibility of the tester it also trains the tester in how the system is monitored in production.

So how do can this be tested?

Lets make a pseudo Fitnesse table to describe the test case .

| our functional fixture |
| go to | user registration form |
| enter | name | Ted | username | JohnDoe | email | joe@some.tst |
| verify | status | Registration completed |
| verify | email account | registration mail received |

This is how most functional tests would end. But let's expand the responsibility of the tester to also include the supplemental requirements.

| checkpoint fixture |
| verify | Verified user form name=Ted, username=JohnDoe, email=joe@some.tst |
| verify | User found in legacy system |
| verify | User persisted |
| verify | User notified at joe@some.tst |

So now we are verifying that our first line support can see a registration main flow in their tool that imports the checkpoint log. We have also taken responsibility of officially defining how a main flow is logged and we are regression testing it as part of our continuous delivery process.

That leaves us with the performance log. How should we verify that? How long should it take to register a user? Well we should have an SLA on each use case. The SLA should define the performance under load and we should definitely not do load testing as part of our functional tests. But we could ensure that the function can be executed within the SLA. More importantly we ensure that we CAN monitor the SLA in production.

| performance fixture |
| verify | Legacy lookup completed | sub | 550 | ms |
| verify | User persisted | sub | 100 | ms |
| verify | User registration completed | sub | 1000 | ms |

Now we take responsibility that the system is monitor able in production. We also take responsibility and officially define what measuring points we officially support and since we do continuous regression testing we make sure we don't break the monitor ability.

If all our functional test cases look like this then we Test for runtime.

| our functional fixture |
| go to | user registration form |
| enter | name | Ted | username | JohnDoe | email | joe@some.tst |
| verify | status | Registration completed |
| verify | email account | registration mail received |

| checkpoint fixture |
| verify | Verified user form name=Ted, username=JohnDoe, email=joe@some.tst |
| verify | User found in legacy system |
| verify | User persisted |
| verify | User notified at joe@some.tst |

| checkpoint fixture || performance fixture |
| verify | Legacy lookup completed | sub | 550 | ms |
| verify | User persisted | sub | 100 | ms |
| verify | User registration completed | sub | 1000 | ms |

Saturday, January 5, 2013

Continuous Delivery and DevOps in a legacy organization

I've been using the term legacy organization. My definition of a legacy organization is a slow changing organization that separates professions in silos. The slow changing nature can but doesn't have to be due to sizes. The separation of professions into silos materializes into a process where responsibility is handed over from profession to profession.

I have intentionally put development and test into same box. In some legacy organization you see theses separated into two silos where development hands over to a QA department which tests the application. I don't want to say its impossible to do continuous delivery with that type of setup because nothing is impossible. It requires the development organization to start taking responsibility for testing. It can be done by smart recruiting of developers with test focus but its going to be hard.

I refere to the above setup as legacy noDevOps organization because it separates development and operations and suffers heavily from the wall of confusion syndrome but it is an organization where test driven development is possible. Two of the biggest issues in a legacy noDevOps organization is the gunpoint standoff and droped responsibility at the wall. The standoff results in unconstructive blame games and lack of constructive change.

The dropped responsibility comes when development just wants out of responsibility at the point of handoff. Project managers want to close the project. Developers want to do new cool stuff. So development picks a few members who get to run at the wall when the rest hide. At the wall the mudball of a deliverable is tossed over the wall hoping that someone on the other side catches it.

A lot of talks and writeups on continuous delivery more or less assume a DevOps organization. Its definately much easier since continuous delivery requires the uses of same deployment mechanisms in all environment, which in turn puts a high requirement on similarity in infrastructure . Building a good process without the help of the direct involvement of the infrastructure experts in the operations organization is extremely hard. Doing continuous delivery well requires a higher level of continuous responsibility by the developers. DevOps allows developers to take responsibility in production, which is hard in legacy noDevOps organizations. So yes obviously continuous delivery is made so much easier with DevOps.

But what should we do? Should we just sitt there and wait till a manager calls a meeting and says we are gonna start doing DevOps and CD. If that happens then the DevOps is gonna be so full of friction because our professions are still at gunpoint standoff. So before anything gets done everyone needs to lower their guns and start trusting each other, this will take time.

Its my firm belief that the standoff is always the "fault" of the development organization. If we would have been delivering high enough quality in a stable enough application then there would not have been any standoff and there would have been trust. We can argue all we want that it's not possible to deliver enough stability and quality from a development organization without help and change from operations but it's beside my point. We can only change our own behavior and we can only do that by being the change we want to see.

If we want to deploy more often in order to archive higher quality then we make sure to hold our end of the bargain, higher quality and stability deliveries. We start by taking active responsibility for quality and stability through continuous regression testing. We test our deployments one million times if that's what it takes to make a stable deployment. We I prove with each delivery. We take pride in learning from our mistakes and automating tests to ensure they don't happen again. Then overtime the trust will increase and the teams will start cooperating more and more.

The development organization is in charge of the full delivery process up to the wall of confusion. So make it the best delivery possible and take pride in delivering high quality out of the development silo. For each successful delivery you bring the wall down one brick at the time.

Also remember that we are talking about continuous delivery, not deployment. It's super important not to ever speak about continuous deployment because it scares the living crap out of the ops team when in a standoff situation. Though always having a deliverable ready and tested at the wall is always going to be appreciated. Then transition into production can happen with less confusion.

I have to confess I'm one of the developers who hates to support an in production application. I fear to be on call and once an application is in production I want to change assignment. Reason is how legacy noDevOps organizations go about developer support. Developers have zero trust so we can't access logs, databases or anything in production. So each time a developer needs to help out with a production issue its with tied hands and a blind fold. It ends up becoming a hostage situation where the developer is held hostage. I love to trace down bugs, solve issues and improve stuff but to be able to do that I need my eyes, my brain and my hands.

We can take charge of this situation as well and stop beeing victims. We can drastically improve our situation by building monitoring and metrics into our application and verifying them as part of our continuous regression testing. This way we build tools that are gonna be available in production that operations are gonna require anyway. Usually these are added late and as low priority supplemental requirements from operations. By being proactive we can build this into our architecture, test process and use it through out our entire delivery process. This way we build more useful monitoring tools that we understand much better. In return we arnt as blind and handicapped when helping out with production issues. Once again we make active steps towards cooperation and trust between organizations while making our own life better.

DevOps makes continuous delivery easier. But continuous delivery can be how we drop the guns and tear down the wall of confusion in a legacy organization and move towards DevOps. Ultimately they should both exist in an organization and I think they will both become as common as agile even in large old organizations.

How ever until we are there I think that continuous delivery is a fantastic tool to enable change in an organization suffering from a deadlock. It requires courage, vision, ambition and patience but all the tools are there for us to start making that change today!

About Me

I´ve worked as a Java Developer/Architect for 15 years. I´ve worked as part of a consulting organization and as part of a line organization.

Over the last 6 years I´ve had an ever increasing interest in the quality of the delivery. Initially this interest lead me to work with automation of system tests. Then more and more towards automation release and deploy processes. Now for the last two years Ive focused alot of my work on the full Continuous Delivery process.

This blog will server as a collections of lessons learned from my work. Mostly just for my self but Im happy to share my experiences if anyone is interested.

Follow @TomasRihaSE

Pages