Continuous Delivery

Tuesday, June 18, 2013

Its about the people.

Last week I attended QCon New York. Fantastic conference as usual and it was comforting to see that basically everyone was saying the same thing. "Continuous Delivery is not about the technology, its about the people". Which also happens to be the title of my talk at Netlight´s EDGE conference in september,

In his talk Steve Smith (@agilestevesmith) talked about how 5% is technology and 95% is organization. While I agree with that I think that the non-technical 95% can be divided into organization, change of role definitions and individual maturity. Its these three that my talk will cover.

Hopefully I will be able to have this talk in Gothenburg as well as its been submitted to JDays.

Monday, April 8, 2013

Talk at HiQ 24th of April

Continuous Delivery - Enabling Agile.

The key to agile development is a fast feedback loop. Continuous Delivery strives towards always having tested releases in deliverable state. Continuous Delivery is not just a technical process but a change to the entire organization and the individuals within it. This presentation describes the principles of Continuous Delivery, a brief overview on how it can be implemented, how it changes the organization and how it impacts the individuals.

Target audience for this presentation is Developers, Architects, Testers, Scrum Masters, Project Managers and Product Owners in no particular order. The presentation is not rich in technical detail and based on real life experiences.

Please use this post to provide questions and feedback.

Welcome

Tomas

Link to the presentation: http://www.slideshare.net/TomasRiha/continuous-delivery-hi-q

Sunday, February 24, 2013

Architect to re-Architect

We spend so much time trying to make the right decisions. It's one of the downsides of working on a next generation platform. "You better get it right this time!". We have all been there when a current generation solution just doesn't cut it anymore. Implementing that next requirement is going to be so expensive that we might just as well rewrite the whole thing. Thing is they also tried to "get it right this time!".

Why does it "always" go wrong? Why do we always run into dead ends with systems. Sure not always but always when an application is exposed to a lot of changes and new requirements.

Select technology then abstract and isolate it in the architecture.

Historically we have put a lot of thought into selection of technology when we build something new. Its important to not get it wrong so we think a lot about getting it right. We also think a lot about patterns so that we can replace tech A with tech B if the decision has to be reversed. Who hasn't written hundreds of DAOs so that we one day can change our database. How often do we change database? Historically well I have never done it. Change from Oracle to DB2 or what ever other SQL database has never been the reason for a major rewrite. In fact I've been part of more then one rewrite that has thown out everything but the data layer.

In the future we will see more database changes due NoSQL but if and when we do that do we really want to keep our DAO interfaces? If we do then we sure ant going to accomplish much with our rewrite. If we change then we change because we need to solve a bottleneck problem. In order to solve it we need to make an optimization using a niche product. So we need to write and query our data differently.

The cause of a major rewrite is either lack of scale ability or customer requirements that are to hard to expensive or too high risk to implement. The later almost always happens when everything has become so interconnected that the change can no longer be done in a safe and isolated way. We need to refactor so much I order to make the change possible that its cheaper to rewrite.

Distribute system can still be a monolith.

In standard monolithic design we monolithized everything not just the components of the system but also the data model and the business logic. By normalizing our data model and constantly striving towards decreasing code redundancy we entangle all the services of our application into a huge ball of concert. It's when we end up with our services entangled in a solid ball of concert that we need to blow it up, all of it in order to rewrite it. It doesn't matter how well we modeled our database, how nice our DAOs are or how much inversion of control we use. If we don't treat our services independently we will run into trouble down the road.

Decoupling the monolith into subsystems doesn't necessarily help either. If we still normalize our data and strive towards reusing as much code as possible within the components then all we have done is distributed the monolith. Chances are quite high that you will need to rewrite multiple components when the requirement change appears.

Lets take an example.

We have a training application aimed towards running and cycling. We have users, training sessions and races. Training sessions and races are the same thing really they both contain a number of users, equipment, time, distance and a route. We provide views of user training sessions, user races and race results by race. We sell the application to race organizers and its free to users. We have an agreement to keep the race results highly available and to keep all history of previous years.

So we have a simple data model with users and sessions with a many to many relationship and a type defining if its a race or a training session. Simple. Done. Delivered.

Now the application becomes really popular as a training application among users so we start gaining a lot of data. This data is mostly written since no one else then the user really cares about it. Though it does impact on our race data since people tend to look at that more.

Someone realizes that all the training data is interesting since we also added a heart rate integration. So we build queries on the training data to provide to medical studies. Sweet extra income that he sales dudes came up with. It's no real issue performance wise as we run them once a year and that's done over Christmas.

Now someone sells our services of race data, training and fitness trending to UCI (cycling union) as a tool for their fit against doping. We just need to add a query to correlate our sweet training reports with race results, how hard can that be. We add the develop for a sprint or two and go live. So now we get serious tonnage of data and we run our queries more often. *gag* it doesn't work we can't scale and we can't add e new query without totally killing our SLAs with the other races. We need to rewrite.

Components are not the silver bullet.

Components dont really help us

Having our system distributed into a user repository, session storage and a integration component providing rest services to our GUI component wouldn't help us all at much. Sure we have separated users and their equipment from the sessions but its the queries on the sessions that is the problem and that they are killing our SLAs with the other race organizers.

Design by Services

So what we really need is to move the race result service into a service of its own. We need to isolate it. Even though all the data is identical to the race data by the user. Then we need to separate the integration code for the race organizer service into a service of its own so that we can deploy it separately.

Services do help us

Doing this when hitting then wall is both hard, costly and risky. Just the database split is a nightmare if the data has grown big.

If we would have done this from the get go we could just have re architected the user race and training session service. We could have moved that from our MySQL to a big table database or what ever without affecting our race by organizer service. But doing this upfront feels so awkward we would have had duplicate tables and redundant code.

Define and isolate services in the architecture.

If we focus on isolating services across our components instead of isolating technology then we can actually re-architecture our bottlenecks. In fact in our example we could just added a uci services that duplicates the other services and if it would run into performance issues we could just re-architectured it. But that would have forced us to duplicate more upfront and to increase our initial development costs.

Services can be extremely similar and
yet be different services

It's hard to "get it right" when the right can be against everything you have been thought for years. What we must learn to understand better is how we define and isolate services so that we can re-architecture our bottlenecks for the services that experience them and not the entire system.

So it took a year.

When we first started building our continuous delivery pipe I had no idea that the biggest challenges would be non technical. Well I did expect that we would run into a lot of dev vs ops related issues and that the rest would be just technical issues. I was so naive.

We seriously underestimated how continuous delivery changes the every day work of each individual involved in the delivery of a software service. It affects everyone Developer, Tester, PM, CM, DBA and Operations professionals. Really it shouldn't be a big shocker since it changes the process of how we deliver software. So yes everyone gets affected.

The transition for our developers took about a year. Just over a year ago we scaled up our development and added give or take 15-20 developers. All these developers have been of a very high quality and very responsible individuals. Though none of them had worked in a continuous delivery process before and all where more or less new to our business domain.

When introducing them everyone got the run down of the continuous delivery process, how it works, why we have it and that they need to make sure to check in quality code. So off you go make code, check in tested stuff and if something still breaks you fix it. How hard can it be?

Much much harder then we thought. As I said all our developers are very responsible individuals. Still it was a change for them. What once was considered responsible like if it "compiles and unit tests check it in so that it doesn't get lost" leads broken builds. Doing this before leaving early on Friday becomes a huge issue because others have to fix the build pipe. But it goes for a lot of things like having to ensure that database scripts work all the time, everything with the database is versioned, roll backs work, ect, ect. So everyone has had to step up their game a notch or two.

Continuous delivery really forces the developer to test much more before he/she checks in the code. Even for the developers that like to work test driven with their junit tests this is a step up. For many its a change of behavior. Changing a behavior that has become second nature doesnt happen over night.

We had a few highly responsible developers that took on this change seamlessly. These individuals had to carry a huge load during this first year. When responsibility was dropped by one individual it was these who always ensured that the pipe was green. This has been the biggest source of frustration. I get angry, frustrated and mad when the lack of responsibility by one individual affects another individual. They get angry and frustrated as well because they don't want to lave it in a bad state and their responsibility prevents them from going home to their families. I'm so happy that we didn't loose any of these individuals during this period.

Now after about a year things have actually changed everyone takes much more responsibility and fixing the build pipe is much more of a shared effort. Which is soo nice. But why did it take such a long time? Id really like to figure out if this transition could have been made smoother and faster.

Key things why it took so much time.

A change to behavior.
Developers need to test much more, not just now and then but all the time. No matter how much you talk about "test before check in" , "test", "test", "test" the day the feature pressure increases a developer will fall back on second nature behavior and check in what he/she believes is done. We can talk lean, kanban, queues, push and pull all we want but fact is still there will always be situations of stress. Its not before a behavior change has become second nature we do it under pressure.

Immature process.
Visibility, portability and scale ability issues have made it hard to take responsibility. Knowing when, where and how to take responsibility is super important. Realizing that lack of responsibility is tied to these took us quite some time to figure out. If its hard to debug a testcase its going to a lot of time to figure out why things are failing and its going to require more senior developers to figure it out. Its also hard to be proactive with testing if the portability between development environment and test environment is bad.

Lot of new things at once
When you tell a developer about a new system, domain and a new process Im quite sure the developer will always listen more to the system and domain specific talks.
Developer has head full of this system communicates with that system and its that type of interface. Then I start going on about "Jira, bla bla bla, test bla, checkin bla bla, Jenkins bla, deploy, bla, fitnesse, test bla, bla" and developer goes "Yeah yeah yeah Ill check in and it gets tested I hear you, sweet!".

I defiantly think its much easier for a developer to make the transition if the process is more mature, has optimized feedback loops, scales and is portable. Honestly I think its easily going to take 3-6 months of the learning curve. But its still going to take a lot of time in range of months if we don´t become better at understanding behavioral changes.

Today we go straight from intro session (slides or whiteboard) to live scenario in one step. Here is the info now go and use it. At least now we are becoming better at mentoring. So there is help to get so that you can be talked through the process and the new developer is usually not working alone, which they where a year ago. Still I dont think its enough.

Continuous Delivery Training Dojos

I think we really need to start thinking about having training dojos where we learn the process from start to finish. I also think this is extremely important when transitioning to acceptance test driven development. But just for the reason of getting a feeling for the process. What is tested where, how and what happens when I change this and that. How should I test things before comiting and what should be done in which order.

I think if we practiced this and worked on how to break and unbreak the process in a non live scenario the transition would go much faster. In fact I dont think these dojos should be just to train new team members but they would also be a extremely effective way of sharing information and consequences of process change over time.

Monday, February 11, 2013

Talk at ÅF Consult 2013-01-12

On Tuesday the 12 January I have a talk about Continuous Delivery at ÅF Consult, Gothenburg.

This is the agenda of the day.

Intro to Continuous Delivery
Principles of Continuous Delivery
Look at a Pipe
Impact on Scrum
Feature Driven Development
Impact on Developers and Testers

Participants please use this post for feedback and any questions that you didn't get a chance to ask and would like me to answer.

The slides from the presentation can be found here.

Thursday, February 7, 2013

The world upside down.

Sometimes the world just goes upside down. I talked in a previous post about how continuous delivery and test driven development changes the role of the tester. I retract that post. Well maybe not fully but let me elaborate.

Our testers are asking HOW should we automate and our developers are asking WHAT are you trying to do.

I´ve thought that our problem has been that we haven't been able to find testers that know HOW to automate. Its not our problem. Our problem is that we are asking our testers to automate instead of asking them WHAT to test. If a tester figures out what to test then any developer can solve the how to automate with ease.

I still believe that the role of the tester will change over time and that anyone who can answer the what and how question will be the most desired team member in the future. Until then we need to have testers thinking about WHAT and developers thinking about HOW.

Frustrating how such a simple fact can stare us in the eyes for such a long time without us noticing it.

Next time a tester asks how and a developer has to question what then its time to stop the line and get everyone into the same room asap!

Tuesday, February 5, 2013

My dear Java what have you done?!?!

What is this diagram? Well its a generic lightweight storage component. In previous posts on this blog Ive talked about how we have refactored our main components into smaller specialized components and how fruitful that has been for us. It really has and its been one of the best architectural steps we have taken and the components are super clean and very well written. They are very well written based on our requirements and on how we as a community expect Java applications to be built. But they have really made me think, are we actually putting the right requirements on them??

Lets take an example if our system was a fitness application then we could have four of these components one for user profiles, one for assets (bikes, shoes, skis, ect), one for track routes (run, bike, ski routes) and one for training schedules. Small simple components that store and maintain well defined data sets. Then we have a service component that aggregates these data sets into a consumer service.

So say that our route track component has the responsibility of storing a set of GPS track points and classifying them as run, bike or ski. How hard can it be? Well look at the diagram.

First of all what is the value of the component? Its providing a set of track points. There could be some rules on what data is required and how missing data or irregular data is handled since GPS can be out of reach. But its still very simple receive, validate, store and retrieve. The value is the data. The request explicitly asks for data related to a user and an activity type. So why does the request go through layer upon layer of frameworks and other crap in order to even get to the data? Shouldn't we strive to keep the data as close as possible? Shouldn't we always strive to keep the value as close to the implementation as possible?

I've always been a huge fan of persistence frameworks. I worked back in the good old dot com era when everyone and his granny wanted into IT and become programmers. Ive seen these developers trying to work directly with JDBC and SQL. The mess they created and the mess much better developers created even after the dot com era has made me believe that abstracting away the database through persistence frameworks is a must. The gain on 90% of the vanilla code written by the avg Joe heavily out weighs the corner cases where you need a seasoned developers to write the native queries and mappings.

Though Im starting to believe that JPA and Hibernate has to be the worst thing that has happened to Java since EJB2. Adding a blanket over a problem doesnt make the problem go away. The false notion of not having to care about the database can have catastrophic consequences. Good developers understand this and have learned to understand the mapping between the JPA model and the DB model. They have learned how the framework works and they have learned how hql maps to sql. By trying to mitigate the bad code written by the average Joes we have created even more complexity and developers require to not just master databases but also a new framework on top of them.

So for us to get our data from our database to our REST interface we create objects that map to a model of another system. Then we write queries in a query language that is super hard to test and experiment with and that still doesn't give us feedback on compile time. These queries and mapped objects get translated into a query language of a remote system. Note we transform it in the java application to the language of the other system which means that we in our system need to have full notion and understanding of the other system. Then we take this translated query and feed it through a straw to that other system. Down in that system the query is executed in the query engine, which we have to understand in our system since what we generate maps directly onto it. Then this query engine executes on the logical data model. Which happens to be the core value of our application. The only think that does make sense is the mapping between logical and physical storage of the data, since this we don't have to care about and its actually been abstracted out pretty well.

The database engine then feeds us the data back through our straw where Hibernate maps it back to our Java Objects. Then of course we have been good students and listened to the preaching about patterns so we transform our entity objects into DTOs that we then feed through multiple layers of frameworks so that they can be written to the http response.

To quote a colleague "Im so not impressed by the data industry". I totally agree. We have really made a huge mess of things. Not in our delivery for being what it is, a delivery on the Java platform with a RDBMS its a very well written and solid application.

Am I saying that we should just throw out all the frameworks and go back to using JDBC and Servlets straight of? Well no obviously (even though I say it when Im frustrated). The problem needs to be solved at the root cause. JDBC and SQL was equally bad because it was still pushing data through a straw. Queries written in one system pushed through a straw into another system where they are execute is never ever going to be a good model. Then the fact that the data structure of the database system doesnt match the object structure of the requesting application is another huge issue.

I really think that the query engine and the logical data storage model need to become part of the application and not just mapping frameworks. Some of the NoSQL database do this but most of them still work too much like JDBC where you create your connection to the database engine and then you send it a query. Instead Id like to just query my data. I want data not a connection. The connection has to be there but it should be between the query engine and the persistent store not between me and the query engine.

Query q = Query.createQuery(TrackPoint.class);
List<TrackPoint> trackpoints = q.where("userId").equals("triha")
.and().where("date")
.between("2012-01-01", "2013-01-30");

This should be it. My object should be queried and logically stored as close to me as possible. This way we would mitigate the problems of JDBC and SQL, by removing them not coverying them up with yet another framework. This would give us a development platform easy enough for average Joes and yet not dumbed down to a level that restricts our bright minds.

We could use more of our time actually writing code that adds value rather then managing frameworks that are just a necessity and not a value. Our simple components would look something like the diagram to the right. Actually pretty simple.

How hard can it be? Obviously quite hard. The most amazing thing is that we get all this complexity to work. No wonder its super expensive to build systems.

Sunday, February 3, 2013

Working the trunk

When my colleague Tomas brought up the idea of continuous delivery he first thing that really caught my attention was "we do all work on the trunk". I've always hated branches. I've worked with many different branching strategies and honestly they have all felt wrong.

My main issue has always been that regardless of branch strategy (Release Branches or Feature Branches) its a lot of double testing and debuging after merge is always horrible. Its also hard to have a clear view of a "working system", what is the system, which branch do you refer to? Always having a clean and tested version of the trunk felt very compelling. No double work and a clear notion of "the system"! I'm game!

So we test everything all the time. How hard can it be.

Well it has proven to be a lot harder then we thought, not to continuously test but to manage everyone's desire to branch. Somehow people just love branches. Developers want their feature branches where they can work in their sandbox. Managers want their branches so that they don't get anything else then just that explicit bug fix for their delivery and not risk impact from anything else.

These are two different core problems one is about taking responsibility and one is about trust.

Managers don't trust "Jenkins".

Managers don't trust developers but somehow they do trust testers. Its interesting how much more credit a QA manager has when he/she says "I've tested everything" then a blue light on jenkins. In fact managers have MORE confidence in a manual regression test that has executed "most of the test-cases on the current build" then an automated process which executes "all the test-cases on every build". I think the reasons are twofold one is that the process is "something that the devs cooked up" and the other is that jenkins cant look a manager in the eyes. It would be much easier if Jenkins was actually a person who had a formal responsibility in the organisation and could be blamed, shouted on and fired if things went wrong.

It takes alot of hard work to sell "everything we have promised works as we have promised it". For each new build that we push into user acceptance testing we need to fight the desire to branch the previous release. Each time we have to go through the same discussion.

"I just want my bug fix"
"you get the newest version"
"I don't want the other changes"
"Everything we have promised works as we have promised it"
"How can you guarantee that"
"We run the tests on each check in"
"Doesn't matter I don't want the other changes they can break something, I want you to branch"

I dint know how many times we have had this argument. Interesting is that we are yet to break something in a production deploy as a result of releasing bug fix from the trunk (and hence including half done features). Though we have had a failed deploy due to having subdued to the urge to branch. We made a bad call and subdued to the branch pressure. By doing that we branched but we didn't build a full pipe for the branch which resulted in us not picking up a incompatible configuration change.

Developers love sandboxes

Its interesting, developers push for more releases, smaller work packages yet they love their feature branches. I despise feature branches even more then release branches. Reason is that they make it very hard to refactor an application and the merging process is very error prone. The design and implementation of a feature is based on a state of the "working system" which can be totally different from the system its merged onto. Also it breaks all the intentions to do smaller work packages and test them often, a merge is always bigger then a small commit.

The desire to feature branch comes from "all the repo updates we have to do all the time slow us down so much" and "we cant just check stuff in that doesn't work without breaking stuff". The later one isn't just from developers wanting to be irresponsible its also from us running SVN and not GIT. Developers do want to share code in a simple way. Small packages that two team mates want to share without touching the trunk is a viable concern. So the ability to micro branch would be nice. So yes I do recommend GIT if you can but its not a viable option for us. Though I'm quite sure that if we where using GIT we would end up having problems related to micro branches turning into stealth feature branches.

I think the complaint "all the repo updates we have to do all the time slow us down so much" is a much more interesting one. In general I think that developers need to adopt more continuous integration patterns in their daily work but this is actually a scale-ability issue. If you have too many developers working in the same part of the repo you are gonna get problems. When developers do adopt good continuous integration patterns in their daily work and their productivity drops then there is an issue. This is one of the reasons why we have seen feature branches in the past.

Distribute development across the repository

When we started building our delivery platform we based it on an industry standard pattern that clearly defines component responsibility. Early on in the life cycle we saw no issues of developers contesting the same repository space as we had just one or two working on each component. But as we scaled we started to see more and more of this. We also saw that some of the components where to widely defined in their responsibility which made them bloated and hard to understand. So we decided to refactor the main culprit component into several smaller more well defined components.

The result of this was very good. The less bloated a component is the easier it is to understand and to test, which leads to increased stability. By creating sub components we also spread out the developers across the repository. So we actually created stable contextual sandboxes that are easy to understand and manage.

Obviously it we shouldn't just create components to spread out our developers but I think that if developers start stepping on each other then its a symptom of either bad architecture or over staffing. If a component needs so many developers that they are in the way of each other then the chance is quite good that the component does way too much or that management is trying to meet feature demand by just pouring in more developers.

Backwards compatibly

Another key to working on the trunk has been our interface versioning strategy. Since we mostly provide REST services we actually where forced into branching once or twice where we had not other option and that was due to not being backwards compatible on our interfaces. We couldn't take the trunk into production because the changes where not backwards compliant and our tests had been changed to map the new reality. This is what lead to our new interface strategy where we among things never ever change interfaces or payloads, just add new ones and deprecate old ones.

Everything that interfaces the outside world needs to be kept backwards compatible or program management and timing issues will force inevitable branching.

Not what I expected

When we first deiced to work solely on the trunk I thought it was gonna be all about testing. Its important but I think people management has been a bigger investment (at least measured in mental energy drain) and importance of good architecture was under rated.

Tuesday, January 15, 2013

Package power!

We often talk about pipe design and how to implement it in jenkins or other ci tools, that everything should be versioned and that everything should be tested all the time. These things are very important but something I didn't realize for quite some time was how important packaging is.

Our packaging was giving us problems.

Early on when building our continuous delivery pipe we where a bit worried about the number of artifacts we where spewing out of our pipe and the impact it would have on our nexus repo. So we did release our war and jar files into our repo but the final deliverable assembly we released was just a property file containing versions. These property files where used by our rudimentary bash deploy scripts. The scripts basically did a bunch of wgets to retrieve the artifacts from the nexus repo before deploying them. Yeah laugh you I now know how dumb this was.

Our main problem due to this was that our scripts where very delivery specific. For delivery Y we had components A, B and C while for delivery Z we had components A, D and E. We couldn't reuse things well enough so we had duplicates of our scripts. Another issue we had was that there was no portability in this what so ever. We didn't really make the connection between lack of packaging and our huge developer environment problems. Switching between working on delivery X and Z was tedious because we where managing the local deployments in eclipse with the JBoss plugin. It also required full understanding of what components needed to be deployed.

Manual tasks and a required high level of domain knowledge didn't make things easy for our new developers. In act it also made life a pita for our architects that develop less hours a week then the developers. For them the rotting of the development environment was a huge issue. Since all components where managed manually all had to be updated, built and deployed.

Inspiration and goals.

When me and my colleague where at QCon NY (awesome conf that everyone should try to attend) we listened to talks by Netflix and Etzy. We where totally blown away by two things. Etzy's practice that a new developer should code and deploy a production change on the first day and Netflix baking of images instead of releasing wars and ears. These where two of the main things we brought back with us and two things that we keep revisiting as we iterate our process.

Since we don't do continuous deploy we set the goal that a new developer should be able to commit a change that is ready for delivery on the first day. The continuous delivery part of the goal wasn't the problem since we already had that in place. It's the most obvious part of that goal. The next obvious task for us was that we really had to do something about our dev env setup. Then with some thought we realized that this wasn't enough we needed to do something about our entire on boarding process with mentoring and level of knowledge in the team. In order to mentor someone a developer needs to have a good understanding of most tasks in jira. At this stage this wasn't the case.

We made the knowledge increase our priority since this was biting us in many ways. I won't go much more into that. Then we tried to prioritize the setup of our developer environment but doing something about our deploy scripts ended up being a higher priority. This was a very good and honestly lucky decision. We knew how to do our deploy script changes and our production deployments where really more important. But we where also not sure how to do our developer environment changes so sleeping on it was what we decided, even though our devs where literally screaming in frustration.

Addressing the problems.

First thing we did when we started to rewrite our scripts was to sort out our packaging once and for all. We killed the property file and started using maven for everything. We had already been using maven to release all components and most configurations. But we where not using maven to package our final deployables and we where not using it to release our deploy scripts. We had already been made very well aware that we had to tie our deploy scripts to our deployable assembly. We changed both these things. We started to release everything and not just versioning everything. This imho is very important thing that's not mentioned enough. Blogs, articles and demos talk about versioning everything but not so much about the importance of actually releasing everything and treating each release as an artifact even if its "just" a httpdconf.

Once we started building these packages and setting our structure it was so clear how Netflix came to the conclusion that they should bake images. The package contains war files, config files, deploy scripts, liquibase scripts, custom JBoss control scripts, httpdconf, ect, ect. The more we package and the more servers we get in our park the more things we notice that we need to put into the package. Then it becomes even more obvious since we take this package and transfers it to tons of servers for different test purposes. Once at the server we run our deploy scripts that copy and link stuf into place on the server. Remind me why are we doing this over and over? Wouldn't it be better to just do this once and make an image out of it and mount this image on different nodes. Of course it would be, Netflix know what they are talking about! Most importantly it would bring the final missing pieces into the package JBoss, Java and Linux distributions. Giving us the power to actually roll out and test even OS patches through the same process as any other change. We arnt there yet, but the path is obvious and its nice to feel that what was once an overwhelming w000t is now a definite possibility.

So through a good packaging strategy we managed to improve and solve our deploy script problems. We now had one script to distribute and deploy them all! This also resulted in much fewer changes to the deploy scripts which in turn made them more stable. A lot of changes that previously required changes to deployment scripts now just requires a change to the packaging which makes the entire deployment process much more robust.

Portability!

Still though we hadn't solved our issues with our developer environments. I had the hunch for some time that our packaging could help us. Still it took us some time before we realized that we actually had created an almost fully portable deployment solution. Our increased maven usage had made us so portable that we could actually just write a simple script that combined the essence of the assembly job and the deployment job of our jenkins pipe into a local dev env script. By adding "snapshots true" to our maven version properties update we allowed our assemblies to be built including snapshots. Then we could just use our deploy scripts and voila our local JBosses and Mule ESBs where deployed with artifacts containing our code changes and most importantly our rebel.xmls, giving us full JRebel power with our production deploy scripts.

Our packaging strategy had made our continuous delivery process portable to our development environment allowing us to use the same assemble+deploy from local dev env to prod. Our developers now just need to know what assembly to deploy and they don't need to rebuild all included components just the ones they are currently working with, the others are added by maven for he nexus repo. So now our developers can quickly and easily switch between single component deploys and full deliveries.

Getting closer to our goals.

By adding JBoss & Mule installations to the script we further simplified the setup process for the new developers. We still have a few things we want to add to the script such as IDE install and initial source code checkout in order to simplify things further but at will have to rest it a bit since we have other higher priorities. Still we have taken huge steps towards our Etzy inspired goal of having new developers commit a code change on the first day.

It feels like all these levels of improvement have been unlocked by a good packaging strategy!

If its one thing I would change about the way we have gone by our implementation its the packaging. It's easy to say in hindsight but I'd really try to do it properly of the bat.

Wednesday, January 9, 2013

Test for runtime

Traditionally our testers have been responsible fore functional testing, load testing and in some cases for some failover testing. This covers our functional requirements and some of our supplemental requirements as well. Though it doesn't cover the full set of supplemental requirements and we haven't really taken many stabs at automating these in the past.

The fact that we haven't really tested all the supplemental requirements also leaves a big question, who's responsibility is verification of supplemental requirements? Lets park that question for a little bit. To be truthful we don't really design for runtime either. Our supplemental requirements almost always come as an afterthought and after the system is in production. They always tend to get lost in the race for features to get ready.

In our current project we try to improve on this but we are still not doing it well enough. We added some of the logging related requirements early but we have no requirement spec and no verification of the requirements.

The logging we added was checkpoint logging and performance logging. Both these are requirements from our operations department. The checkpoint logging is a functional log which just contain key events in an integration. It's used by our first line support to do initial investigation. The performance log is for monitoring performance of defined parts of the system. It's used by operation for monitoring the application.

Lets use user registration as an example (its a fictive example).

1. User enters name, username, password and email into a web form.
2. System verifies the form.
3. System calls a legacy system to see if the email is registered in that system as well.
3a. If user registered in legacy system with username and password matching the userid is returned from that system.
4. System persists user.
5. Email is sent to user.
6. Confirmation view displayed.

From this we can derive some good checkpoints.

2013-01-07 21:30:07:974 | null | Verified user form name=Ted, username=JohnDoe, email=joe@some.tst
2013-01-07 21:30:08:234 | usr123 | User found in legacy system
2013-01-07 21:30:08:567 | usr123 | User persisted
2013-01-07 21:30:08:961 | usr123 | User notified at joe@some.tst

The performance log could look something like this.

2013-01-07 21:30:07:974 | usr123 | Legacy lookup completed | 250 | ms
2013-01-07 21:30:08:566 | usr123 | User persisted | 92 | ms
2013-01-07 21:30:08:961 | usr123 | User registration completed | 976 | ms

This is all nice but who decides what checkpoints should be logged? Who verifies it?

Personally I would like to make the verification the responsibility of the testers. Though I've never been in a project where testers have owned the verification of any kind of logging. This logging is in fact not "just" logging but system output, hence should definitely be verified by the testers. By making this the responsibility of the tester it also trains the tester in how the system is monitored in production.

So how do can this be tested?

Lets make a pseudo Fitnesse table to describe the test case .

| our functional fixture |
| go to | user registration form |
| enter | name | Ted | username | JohnDoe | email | joe@some.tst |
| verify | status | Registration completed |
| verify | email account | registration mail received |

This is how most functional tests would end. But let's expand the responsibility of the tester to also include the supplemental requirements.

| checkpoint fixture |
| verify | Verified user form name=Ted, username=JohnDoe, email=joe@some.tst |
| verify | User found in legacy system |
| verify | User persisted |
| verify | User notified at joe@some.tst |

So now we are verifying that our first line support can see a registration main flow in their tool that imports the checkpoint log. We have also taken responsibility of officially defining how a main flow is logged and we are regression testing it as part of our continuous delivery process.

That leaves us with the performance log. How should we verify that? How long should it take to register a user? Well we should have an SLA on each use case. The SLA should define the performance under load and we should definitely not do load testing as part of our functional tests. But we could ensure that the function can be executed within the SLA. More importantly we ensure that we CAN monitor the SLA in production.

| performance fixture |
| verify | Legacy lookup completed | sub | 550 | ms |
| verify | User persisted | sub | 100 | ms |
| verify | User registration completed | sub | 1000 | ms |

Now we take responsibility that the system is monitor able in production. We also take responsibility and officially define what measuring points we officially support and since we do continuous regression testing we make sure we don't break the monitor ability.

If all our functional test cases look like this then we Test for runtime.

| our functional fixture |
| go to | user registration form |
| enter | name | Ted | username | JohnDoe | email | joe@some.tst |
| verify | status | Registration completed |
| verify | email account | registration mail received |

| checkpoint fixture |
| verify | Verified user form name=Ted, username=JohnDoe, email=joe@some.tst |
| verify | User found in legacy system |
| verify | User persisted |
| verify | User notified at joe@some.tst |

| checkpoint fixture || performance fixture |
| verify | Legacy lookup completed | sub | 550 | ms |
| verify | User persisted | sub | 100 | ms |
| verify | User registration completed | sub | 1000 | ms |

Saturday, January 5, 2013

Continuous Delivery and DevOps in a legacy organization

I've been using the term legacy organization. My definition of a legacy organization is a slow changing organization that separates professions in silos. The slow changing nature can but doesn't have to be due to sizes. The separation of professions into silos materializes into a process where responsibility is handed over from profession to profession.

I have intentionally put development and test into same box. In some legacy organization you see theses separated into two silos where development hands over to a QA department which tests the application. I don't want to say its impossible to do continuous delivery with that type of setup because nothing is impossible. It requires the development organization to start taking responsibility for testing. It can be done by smart recruiting of developers with test focus but its going to be hard.

I refere to the above setup as legacy noDevOps organization because it separates development and operations and suffers heavily from the wall of confusion syndrome but it is an organization where test driven development is possible. Two of the biggest issues in a legacy noDevOps organization is the gunpoint standoff and droped responsibility at the wall. The standoff results in unconstructive blame games and lack of constructive change.

The dropped responsibility comes when development just wants out of responsibility at the point of handoff. Project managers want to close the project. Developers want to do new cool stuff. So development picks a few members who get to run at the wall when the rest hide. At the wall the mudball of a deliverable is tossed over the wall hoping that someone on the other side catches it.

A lot of talks and writeups on continuous delivery more or less assume a DevOps organization. Its definately much easier since continuous delivery requires the uses of same deployment mechanisms in all environment, which in turn puts a high requirement on similarity in infrastructure . Building a good process without the help of the direct involvement of the infrastructure experts in the operations organization is extremely hard. Doing continuous delivery well requires a higher level of continuous responsibility by the developers. DevOps allows developers to take responsibility in production, which is hard in legacy noDevOps organizations. So yes obviously continuous delivery is made so much easier with DevOps.

But what should we do? Should we just sitt there and wait till a manager calls a meeting and says we are gonna start doing DevOps and CD. If that happens then the DevOps is gonna be so full of friction because our professions are still at gunpoint standoff. So before anything gets done everyone needs to lower their guns and start trusting each other, this will take time.

Its my firm belief that the standoff is always the "fault" of the development organization. If we would have been delivering high enough quality in a stable enough application then there would not have been any standoff and there would have been trust. We can argue all we want that it's not possible to deliver enough stability and quality from a development organization without help and change from operations but it's beside my point. We can only change our own behavior and we can only do that by being the change we want to see.

If we want to deploy more often in order to archive higher quality then we make sure to hold our end of the bargain, higher quality and stability deliveries. We start by taking active responsibility for quality and stability through continuous regression testing. We test our deployments one million times if that's what it takes to make a stable deployment. We I prove with each delivery. We take pride in learning from our mistakes and automating tests to ensure they don't happen again. Then overtime the trust will increase and the teams will start cooperating more and more.

The development organization is in charge of the full delivery process up to the wall of confusion. So make it the best delivery possible and take pride in delivering high quality out of the development silo. For each successful delivery you bring the wall down one brick at the time.

Also remember that we are talking about continuous delivery, not deployment. It's super important not to ever speak about continuous deployment because it scares the living crap out of the ops team when in a standoff situation. Though always having a deliverable ready and tested at the wall is always going to be appreciated. Then transition into production can happen with less confusion.

I have to confess I'm one of the developers who hates to support an in production application. I fear to be on call and once an application is in production I want to change assignment. Reason is how legacy noDevOps organizations go about developer support. Developers have zero trust so we can't access logs, databases or anything in production. So each time a developer needs to help out with a production issue its with tied hands and a blind fold. It ends up becoming a hostage situation where the developer is held hostage. I love to trace down bugs, solve issues and improve stuff but to be able to do that I need my eyes, my brain and my hands.

We can take charge of this situation as well and stop beeing victims. We can drastically improve our situation by building monitoring and metrics into our application and verifying them as part of our continuous regression testing. This way we build tools that are gonna be available in production that operations are gonna require anyway. Usually these are added late and as low priority supplemental requirements from operations. By being proactive we can build this into our architecture, test process and use it through out our entire delivery process. This way we build more useful monitoring tools that we understand much better. In return we arnt as blind and handicapped when helping out with production issues. Once again we make active steps towards cooperation and trust between organizations while making our own life better.

DevOps makes continuous delivery easier. But continuous delivery can be how we drop the guns and tear down the wall of confusion in a legacy organization and move towards DevOps. Ultimately they should both exist in an organization and I think they will both become as common as agile even in large old organizations.

How ever until we are there I think that continuous delivery is a fantastic tool to enable change in an organization suffering from a deadlock. It requires courage, vision, ambition and patience but all the tools are there for us to start making that change today!

Tuesday, January 1, 2013

Upps our Continuous Delivery process became mission critical

At some point something changed with our Continuous Delivery process, it became mission critical. When we started working on the process it was basically a side project that another Tomas and I had. We added a consultant early in our project and he ended up doing some of the work on the first version of our deployment scripts but it wasnt anything organized and not part of any proccess or tools team.

When we increased the number of developers and started seeing issues with stability and scaleability we also started to realize that our process had become mission critical. In fact or continuous delivery process had become more important to us then our mail system.

Now we had a mission critical hobby project with the following setup.

No official Owner.
No official Developers.
No official Operations professionals involved

Operations only supporting the OS of the Jenkins and nexus instance.

One "live" instance of Jenkins on a super small virtual node.

All development done on live instance.

One "live" instance of Nexus with a very small disk.

All development done on live instance.

Small number of test servers, virtual but not cloud nodes.

Having about 30 developers really depending on a process that is setup like this is obviously a no go.

We started to figure we need to put more effort into it when we where to do our first rewrite of our deploy scripts. Still we didn't think in terms of production mission critical system. We needed a resource and I kept insisting we needed a CM, more on that in an upcoming post. We had architecture and test working together building the application around the process. But we needed some more hands building the deploy scripts and also someone who could help us with the complexity of our system configuration. As I wrote in the entry on deploy scripts this didn't work out well at all. Mostly because the CM ended up working alone in a corner of the organization but also because he didn't share our vision of continuous delivery. Between all discussions trying to get us to implement branching strategies he was writing deployment scripts without any JBoss or DB competence. Obviously this didn't work out all that well and it was during this script rewrite that we started to realize that our process was mission critical. The new deploy scripts where very unstable and as mentioned our tests had stability issues.

Now we started realizing that we have a mission critical system at our hands and we need to start treating it as such. Still this was a bit of an unknown entity in our landscape operations only support our office it and our customer deliveries while development supports tooling. While this for sure falls into tooling department the development organization isnt equipped to support a mission critical system. Still we had to do something about it so this was when we created our tools team, we refer to it as a platform team as it was intended to own certain components such as logging, help desk, ect. But main focus was to be continuous delivery. Our lacking development environment was another area of responsibility that we moved to this team, more on that as well in another entry.

The team consisted of our CM, application DBA, a newly added senior Java developer and my self as architect/lead. It was obvious from the onset how effective it is when you have resources (with full range of competence) that can focus on the process. This made us much more responsive to bugs in the process and faster in implementing changes.

We still at this date have not solved all the infra structure issues but most of it is being worked by the tools team and a new resource in our operations department who is responsible for our tooling serves. Still we don't have a Jenkins test environment and still the operations responsibility of Jenkins and Nexus aren't really well defined. But we have resources dedicated to the process and when something isn't working we handle it as bugs.

The biggest lesson is that its really important to get dedicated resources from dev and ops early. Getting two 50% resources is better then one full time as one isolated resource is a huge bottleneck and has a hard time prioritizing his work. Also make sure to have a bug/enhancement process in place early. Priorities should be made based on user experience, same as with any system in production. Also as soon as the process is in use by the developers you need a test environment for Jenkins (or what ever build server you use to drive the process) as its a production system after all.

I think the reason we got a bit blindsided by the process becoming mission critical is that we haven't had anything similar in our landscape before. There is actually one thing that has grown mission critical at about the same rate hand in hand with our CD process and that's our JIRA server. In fact we have an even bigger dependency on our JIRA if it goes down our developers have no clue what to work on and get stranded very quickly. For us this is a new type of mission critical systems. Previously they have only been supporting systems.

Another reason is that the continuous delivery community talks about how easy it is to get started and how we can just take small baby steps from our nightly build CI. It is both true and the way to go. I just guess I wasn't reading the fine print which says "and then it becomes mission critical".

About Me

I´ve worked as a Java Developer/Architect for 15 years. I´ve worked as part of a consulting organization and as part of a line organization.

Over the last 6 years I´ve had an ever increasing interest in the quality of the delivery. Initially this interest lead me to work with automation of system tests. Then more and more towards automation release and deploy processes. Now for the last two years Ive focused alot of my work on the full Continuous Delivery process.

This blog will server as a collections of lessons learned from my work. Mostly just for my self but Im happy to share my experiences if anyone is interested.

Follow @TomasRihaSE

Pages