Pages

Monday, December 17, 2012

Test stability

The key to a good Continuous Delivery process is a stable regression suite. There are a few different types of instability on can encounter.

• Lack of robustness
• Random test result
• Fluctuation in execution time

Lack of robustness often comes from incorrectly defined touch points. If a test needs to include a lot of functionality in order to verify a small detail then its bound to lack robustness. Even worse is if a test directly touches implementation.

Adding extra interfaces to create additional Touch Points.
I'm sure everyone has broken unit tests when refactoring without actually breaking any functionality. This almost always happens when unit tests evaluate implementation and not functionality. For example when you just split methods to clean up responsibility.

Tough this doesn't only happen to unit tests it can just as easily, if not easier to functional acceptance tests. If they for instance touch database in validation, instrumentation or data setup you will break your tests every time you refactor your database.

Testers seem to be very keen on verifying against the database. This is understandable since in an old monolith system with manual verification you only have two touch points GUI and db. It's important to work with test architecture and to educate both developers and testers to prevent validation of implementation.

Another culprit is dependencies between implementation and test code. For example Fitnesse fixtures that directly access, model objects, DAOs or business logic implementation.

Tests must be robust enough to survive a refactoring and addition of functionality. Change of functionality is obviously another matter.

We actually managed this quite well with our touch points.

Random test failures are a horrible thing because it will result in either unnecessary stoppage of the line or people loosing respect for the color red.

This is where we have had most our issues. At one point our line was failing 8 of 10 times it was executed and at least 7 of these where false negatives. "Just run em again" was the most commonly heard phrase. Obviously the devs totally lost respect for the color red. The result was that when now and again a true bug caused the failures no on cared to fix it and they started stacking.

Result of our Jenkins Job History for
Functional Tests could look like this.
So why did we have these problems? First of all our application is heavily asynchronous so timing is a huge issue in our tests. Many of our touch points are asynchronous triggers that fire off some stuff in the application and then we wait for another trigger from the application to the test before we validate. We don't really have much architectural room here as its an industry standard pattern. This in it self isn't a big deal except that each asynchronous task schedules other asynchronous tasks. So a lot of things happen at the same time.

Since its desirable to keep execution times short we reconfigure the system at test time to use much shorter timeouts and delays. This further increases the amount of simultaneous stuff that happens for each request.

When all these simultaneous threads hit the same record in one table you get transactional issues. We had solved this through use of optimistic locking. So we had a lot of rollbacks and retries. But it "worked". But our execution times where very unpredictable and since our tests where sensitive to timing they failed randomly.

Really though did we really congest our test scenarios so much that this became a problem? I mean how on earth could we hit the optimistic lock so often that it resulted in 7 out of 10 regressions failing due to it?

Wasn't it actually so that the tests where trying to tell us something? Eventually we found that our get requests where actually creating transactions due to an incorrectly instances empty list, marking an object as new. We also changed our pool sizing as we exhausted it causing a lot of wait.

So we had a lot of bugs that we blamed on the nature of our application.

Listen to your tests!! They speak wise things to you!

Eventually we refactored that table everyone was hitting so that we don't do any updates on it but rather track changes in a detail table. Now we didn't need any locking at all. Sweet free feeling. Almost like dropping in alone on a powder day. ;)

Fluctuation in execution time. I find it just as important that a test executes equally fast every time as it is that it always gives the same result.

First of all a failed test should be faster then a success, give me feed back ASAP! Second if it takes 15 sec to run it should always take 15 sec, not 5 or 25 at times. This is important when building your suites. You want to have your very fast smoke test suites your bang per min short suites and your longer running suites. Being able to easily group tests by time is important.

It's also important to be able to do a trend analysis on the suite execution time. It's a simplistic but effective way to monitor decrease in performance.

We have actually nailed quite a few performance bugs this way.

Asynchronous nature makes everything harder to test but don't make it an excuse for your self. Random behavior is tricky and frustrating in a pipe but remember TRUST YOUR TESTS!

No comments:

Post a Comment