Sunday, October 2, 2016

Service Simulation rather the Integration Testing of Micro Services

Micro Services or Distributed Monolith

One of the main characteristics of a Micro Service architecture is that each Micro Service has its own lifecycle. This is very important as this is the only way to break the monolith pattern. If the lifecycle of a Micro Service is tied to another entity such as Service, System or Solution then we have a distributed monolith and not a Micro Service architecture.

Ive found that almost everyone agrees in theory but in practice it very wants to do integration testing before release. Integration Testing is so in our DNA, we have been doing it for years and dont really know any other way.

Lets explore the problem of Integration Testing a little bit. If we have Micro Service X and it is consumed by Micro Services Y and Z. Then there is no denying there is a dependency between X, Y and Z. Part of Micro Service architecture is API versioning and backward compatibility. But even with these practices and good test coverage on component X it's hard to deny that the integration between X, Y and Z can still fail. But if we rely on Integration Testing to find these failures then we create a dependency between the teams developing Y and Z and the team of X. That means X no longer has a lifecycle of its own.

Hence if we do rely on Integration Testing we have a distributed monolith which imho is the only thing worse than a monolith.

Consumer Contract Testing
Martin Fowler talks about Contract Testing and Consumer Driven Contracts which is a great pattern. In short it means that the developers of Micro Services Y and Z provide contract tests for Micro Service X. The tests provided by the developers of Y and Z correspond to what parts of the X API they consume. These tests are executed as part of the test suite belonging to service X. This way the developers of X know if they broke any backwards compatibility.

I think this is a great pattern and one that helps developers maintain the right amount of backwards compatibility with a fast feedback mechanism. Still I am not sure it is enough. At least in our organization everyone still wants to Integration Test.

There is still a lot of failures in the integration that can fail outside of the component tests of X. In production like environments there can be deployments in different networks, there is service discovery and other factors playing in.

Service Simulation

One Solution is to introduce Service Simulation. Instead of implementing test automation we can implement simulator bots. These bots exercise continuously execute our services. This creates a constant and even load on our solution driven by our services. The monitoring of our Services is used to notify the developers if something failed. No alarms for a given time period after deploy means our solution still works and the redeployment of our Micro Service didnt break anything and we can go on to deploy into the next environment.

Its also relatively easy to implement this as part of the continuous delivery build pipe. Deploy into an environment, check the alarms in that environment for five min and if all green then the Micro Service is verified and the Continuous Delivery implementation continues.

Though its not necessary to have a advanced Continuous Delivery Engine for this pattern to work, just alarm notifications to team channels on slack is powerful in by self.

Another important thing this solves is that the developers and testers start using production grade monitoring to verify services. I think its a very big obstacle in a DevOps transformation that Developers and Testers view Test Reports to understand a system while these arnt available in production. Developers and Testers are usually lost when it comes to production systems. This moves their understanding towards runtime operations of a system which bridges the gap between Dev and Ops in a nice way.

Bots can be deployed into any environment providing same process of deploy and verify in all environments. Personally I like the idea of having a Bot workload only environment as the first full featured environment. The now obsolete Integration Testing environment could be easily converted into a Bot only environment.

While a Bot basically can be implemented in a number of ways I like the idea of using Function as a Service such as AWS Lambda to implement Service SImulation Bots. The scheduled nature and high specialization of a Bot is perfect as a FaaS workload.

Test Pyramid

  • Simulation - In all Environments Bots Executing Monitoring and Alarms to verify application.
  • Component & Contract Tests - Deployed Functional Tests using Mocks to stub consumed services. Contract Tests provided by developers consuming this services.
  • Unit Tests - Well nothing new there.

I think this is what we need to in order to release our micro services with high confidence and with life cycles of their own.

Thursday, January 7, 2016

Optimizing Runtime Cost through Continuous Delivery

We have always had the need to understand our application runtime costs. In legacy enterprises this has often been done at a finance level and someone in the Runtime Operations department has been responsible for that cost. There have been different ways of distributing the costs to the right organization. Either each application has had its own infrastructure and own the full cost of it or applications have shared infrastructure in order to save money, utilize large servers in a better way or just to minimize setup and maintenance work of the servers. 

In many legacy enterprises there has been a huge disconnect between the rollout the application evolution and the infrastructure evolution. Application development has changed towards micro services with a lot of small applications with lifecycles of their own. The legacy enterprises runtime operations are largely unable to delivery on these changes due to processes, licenses and hardware leases of huge "enterprise scale servers" for the old monoliths. Requests of several servers per micro services have been waved off as unrealistic. 

Several ways around this have been invented in order to cohost multiple applications on the same servers in a somewhat isolated way. Docker draws so much attention partially because legacy enterprises can still have their big old servers and isolate the micro services on them. (Of course not only good thing with Docker).

Regardless of how this was solved the understanding of runtime cost becomes exponentially harder with micro service in legacy enterprise infrastructure. 

When transitioning to DevOps and Micro Services each team is responsible for delivering its services end to end in all environments with the right functionality, performance and imho to the right cost. So what is the right cost? Before even answering that question let’s start with what is the cost. The DevOps team needs to know the cost of its services. In the legacy enterprise that cost might at best case arrive as some kind of cost spit formula calculated by a finance guy based on how many micro services shared the servers, a cost split formula of any license costs and a cost split formula of the man hours to maintain the servers. At best a team would get this information once a month.

Part of the reason we want DevOps is so that the team has full competence and ability to improve its services. This needs to include the runtime costs of the services. A performance optimization of a service that cuts resource requirements by 25% should result in lower runtime costs of the service. In the legacy world of cost splits and financial formals to calculate cost this just doesn’t happen.

Thankfully we have cloud providers. Not only do we get the right resources handle our services when we need them but we get billed real money for them.

With auto scaling, elasticity and all other nice features that we get there is also a risk. We get away with writing increasingly bad applications. Bad performance? Doesn’t matter as long as it scales horizontally. We can IO block to ourselves hell and back as long as we scale horizontally and we do get away with it. Well as long as we don’t have to take responsibility for the cost of our service. This is why it’s so important for the DevOps team to take the responsibility of the runtime cost of the service.

For a few years I have had a dream to be able to stamp a cost runtime cost to a load test and correlate the performance measured in the test with the cost of the runtime environment to run the test. We are still not there with our applications since we do run on AWS Auto Scaling Groups and AWS bills us per started hour. This makes the billing data to blunt too give the correlation between performance and cost from a shorter test. With a Micro Service architecture that uses AWS Services such as Lambda, DynamoDB and Kinesis this would actually be achievable today.

With this vision in mind we have been able to integrate the runtime costs of our Micro Services in AWS with our Continuous Delivery as a Service implementation (Delivery Engine) in another way. For us DevOps Teams are the owners of services and Solutions are the drivers of the cost. So everything that we launch in AWS we tag with name of micro service, owner team name and cost driving solution. 

So this allows our DevOps Teams to see the cost of their Services.

Here we provide a graph of total costs for the services owned by a DevOps team. The services are listed in a top list below the graph and a long running trend of the cost for each component. This allows the Product Owners and Team Members to act in increasing costs. Links are provided to the Dashboard Page of each service.

This Service Dashboard provides the version history of each version built in delivery engine. Our Delivery Engine builds, black box tests the deployed service, load tests it, bakes AWS AMIs for it and launches it into the right environments. Here on the dashboard we visualize the total cost of the service grouped by environment (delivery engine, exploratory testing, qa and prod environments).

From the same dashboard we visualize service usage. In the below example it’s a visualization on cpu consumption across an auto scaling group.

This way we allow the team to ensure that they have the right scaling rules for its services and that they have the right amount of resources in each environment. 

By visualizing the cost of the runtime environment for each service and combining it with Continuous Delivery, Continuous Performance Testing and DevOps we allow our developers to constantly tweak and improve the performance and optimize the cost of our runtime environments. The lead time in the cost reporting still is at the "next day level" as we report the cost per day and per month but I still think this is a reasonable feedback loop when it comes to cost optimizations.