e-testing Blog

How Much Regression Testing is Enough?

by Don Mills

Regression Testing

The question in the title is more commonly asked about testing in general, not just regression testing.  Every tester who’s completed ISTQB Foundation-level training knows that the answer is, “It all depends …”; and if you’ve been trained since about 2010, you’ll know that what it all depends on is the amount of risk in the testing situation, versus constraints on the amount of resource available to deal with the risk.

In an ideal world, we would have enough resource to ensure that risk was completely eliminated—though in an ideal world, some would argue, there’d be no risk to eliminate.  The trouble is, as you’ll also know from Foundation level, total elimination of risk through testing requires “exhaustive testing” of all possible combinations of input values.

The biggest resource constraint there, is time.  Picture a tiny application with only one input field, 26 characters wide, for which the only rule is that you must type in the 26 capital letters, once each, none missing, but in any order.  If you had a tool to test the combinations of letters at the rate of 1,000 per second, it would still take around 1.28*1016 years—about 12.8 quadrillion years, which is near enough to a million times the astronomers’ current estimate for the age of the universe[1].

Whatever risks were associated with the application’s failing might not be relevant by then.  The universe might not be relevant by then.  And you’d still be a long way from having achieved exhaustive testing.  (Why?).

Because of this inconvenient difficulty, testers work by taking samples from the set of all possible inputs, and assuming that the outcomes are representative of all possible similar values.  Such strategies are called “partitioning strategies”, and are fundamental to all test techniques.

What is true of testing generally must perforce be true of regression testing too.

Regressionology
How do testers make progress?  One way is by running (useful) new test cases, that haven’t been run before[2].  Boris Beizer called this progression testing.

One way in which tester’s aren’t making progress, or perhaps two ways, is when they are re-running test cases they’ve already run.  The two main circumstances for this are confirmation testing (traditionally but confusingly known as re-testing) and regression testing.  In the first case, testers re-test with test cases that previously failed, to confirm that the software has been correctly fixed in the interim.  In the second case, testers re-test with test cases that previously passed, to see if changes have accidentally broken something that hasn’t been changed.

A regression is an unintended side effect, within a component or system, of a change elsewhere in the component or system, or in the component’s or system’s run-time environment.  Regressions are important, because capabilities that users rely on may suddenly and mysteriously stop working correctly.  And because the capability that no longer works right may be a long way away from where changes were made, users might not be looking at it and might not notice the problem for quite a long time.

Beizer estimated that, when a mature software system undergoes maintenance, perhaps as much as 5% of the code may be changed.  The remaining 95% should continue to work as though nothing had happened—and that’s where the problem of potential regressions lies, as illustrated in the Eyeball Diagram.

eyeball_diagram_regression_testing

The eyeball represents an important and mature exist-ing system—say, a Hospital Total Management System, to which all patient monit-oring equipment is attached for centralised supervision, and which (more important-ly) pays the doctors and nurses.

The outer part of the iris represents the addition of middleware to connect some new piece of equip-ment.  The inner part of the iris represents changes that had to be made in existing functionality to support the connection.  The two together represent “the project”, within which we might expect to see progression testing combined with confirmation and regression testing.

The eyeball veins[1] represent paths of connection from the changed areas into unchanged areas of the system.  They are of two sorts: direct connections, by which components directly exchange data by component-to-component transfer, and indirect connections via file or database records.  They represent lines of communication along which the unintended effects of changes might promulgate from the “iris”.  Whether they will or not depends on the relationships between the changes and the connections, and the number and “strength” of connections between components.  Together, these concepts make up the “connectivity” of the components, and of the system.  As a rule of thumb, a higher degree of connectivity correlates with a higher risk of regressions.  And that rule of thumb correlates with Four Laws of Regressions.

Four Laws of Regressions
The first law of regressions (apart from the fact that they will happen) is Law 0, which says by way of background that, whenever a regression causes a failure, there is a completely logical connection between it and the change that caused the regression; but it is not necessarily a straightforward connection, and by no means necessarily an obvious one.  Here are the remaining laws, based on that principle:

  • There is no necessarily obvious connection between where a change is made, and where a regression may occur. It could be in the same component, or it could be anywhere else in the system.  Or in another system.  Wherever the connections lead to.
  • There is no necessarily obvious connection between when a change is made, and when a regression may strike. In my personal experience, a well-intended change to a backup-and-recovery system resulted in a failure to back up part of the database correctly, and since the area affected was only occasionally touched by users, it was four months before it was realised that several thousand records had been corrupted.  Unfortunately, the organisation operated a three-month daily backup cycle.
  • There is no necessarily obvious connection between the size of a change and the impact of a resultant regression. The classic case was the change of a comma to a full stop in a Fortran program (or was it the other way round?) which resulted in a multi-million dollar rocket blowing up on the launch pad.  The opposite of making progress …

These laws may be seen as resulting in some pretty severe risks of failures caused by regressions, and it’s no good saying, “It’ll never happen to us”.  Although “regression rate” is sometimes referred to “the unknown unknown”, I read of a study conducted within IBM which found that roughly one software change in three produced a regression somewhere in the system; and testers participating in e-testing’s training courses confirm that regressions often blight their lives.  One-in-three may be a much higher rate than is occurring at your site; or there again, neither you nor anyone else may be measuring your local rate (there are several web sites that will tell you how[2]), and it may indeed be your unknown unknown.

“General” regression suites

“How much regression testing is enough?”

It’s been estimated that almost half the cost of software maintenance lies in the cost of regression testing[3].  How do we optimise this cost?

“It all depends …”—on the risk of damage that may be caused by regressions, versus the constraints on time, cost, tools, skills, and materials (availability of suitable test cases) for building and executing test suites.

A full regression suite would contain all the test cases ever created for the software product, minus those obsoleted by changes.  In an ideal world, or next to ideal, we’d be able to rerun every test case against a system, every time we changed the system or its environment.  A test suite of this sort, appropriately maintained, would be applicable to all versions of the software; such approaches are known as general test case prioritisation[4].

But for a decent-sized piece of software, there could be a formidable number of test cases to run each time the software is changed.  Automated regression suites can help, but even an automated full test suite may be too large to run except very occasionally.  Moreover, given the way that regressions trace back to specific changes via specific paths of connection, most test cases would be exploring areas of the product that had little or no chance of being affected.

Unnecessary testing has low cost-effectiveness.  So how can we cut regression suites down to a more cost-effective size?  There are three main ways:

  • minimisation (removing redundant test cases);
  • selection (to meet particular criteria); and
  • prioritisation (ordering test cases for early achievement of particular goals).

You could use all three methods in combination, and as we’ll see, there’s possible overlap between the second and third.  But first, we’ll consider minimisation.

Minimising test sets
Minimisation techniques use various means to identify and remove obsolete and redundant[5] test cases.  At first blush, this might look like a common-sense approach, but computer scientists have identified numerous formal criteria for deciding on redundancy, which mostly rely on “heuristic” (rule-of-thumb) techniques such as type and degree of coverage.  Other approaches address the redundancies often produced by combinatorial test techniques such as decision tables, where a proportion of decision rules are based on inputs that won’t affect the outputs.  Other techniques again focus on redundancies in model-based test suites such as those generated from state transition diagrams.

Some of these techniques, used on the same test set, may lead to opposite conclusions regarding redundancies, making decisions difficult.  All approaches face the danger that an apparently redundant test case actually covers an unrecognised risk, and so might find a bug that every other test case misses[6].

Nonetheless, combing out redundant and obsolete tests is an excellent idea, and should be done fairly regularly.  Studies show that, if the heuristics are used systematically, a minimised test suite is likely to find 98% or better of the bugs that the full suite would find, with “significant savings” in execution cost.  The payoff will generally be worthwhile (depending on the degree of risk associated with system failures, of course); but minimisation still doesn’t address the cost-effectiveness issue: the likelihood that most test cases in a regression suite still stand little realistic chance of stumbling across bugs.

Selection and prioritisation
Selection strategies try to reduce test suites to a still more cost-effective size by removing those low-likelihood test cases.  Prioritisation tries to arrange the execution of test cases to a sequence that will find bugs sooner, offering the possibility that low or very low-priority test cases might be de-selected from the set.

Test suite partitioning is a type of selection strategy, in which you develop separate regression suites for separate functions.  When a function changes, its regression suite is run to look for internal regressions, and other functions are regression-tested in rotation until, over a period of say a month or two, all functions have been regression tested.  Since such test suites are intended to be run across multiple successive versions of the software, the approach is still a form of “general” test case prioritisation.

Most selection strategies, by contrast are version-specific, meaning that the approach is to identify test cases that are “relevant” to the changes made in the software or its environment.  Computer scientists have studied how to build what they call “safe” regression test suites.  Those are test sets that include all test cases that will exercise all code elements that are not only definitely connected to the location that was changed, but are definitely connected in a way that leaves them vulnerable to side-effects of the changes.  These test sets are built using version-specific test case selection and prioritisation techniques: techniques that tailor the regression suite to the changes.

Selection techniques that have been extensively studied by computer scientists tend to concentrate on code-level criteria related either to control flow or data flow within individual components.  However, some experimenters have successfully extended such approaches to system-level black-box testing.  For example, control-flow and data-flow models may be built and explored for use cases.  Such approaches generally work by identifying the exact differences between pre- and post-changed versions of the software, using well-known techniques such as “graph-walking” through the models.  The graph walk algorithm has also been applied to test web services, despite the challenges that arise from the distributed nature of web services.

However, black-box approaches have the weakness that functional specifications, and models based on them, are unlikely to be sensitive to internal (structural) connections within the software, except in a superficial manner.  Those internal connections, of course, are where regressions propagate themselves.

Impact analysis via connectivity
A more reliable approach is to trace and measure the connectivity between changed areas and other areas of the system.  A very good configuration management tool might store this information in an accessible manner, but it may be better to use a specialised tool such as Semantic Designs’ “ComponentConnectivity” tool.  Such tools perform impact analyses of the changes by scanning across the source code for a system, and provide specific information about each connection, such as the type of connection (e.g., “read” versus “write”) and the information exchanged.

Such tools can provide valuable insights into the likelihood that a change at location X might result in regressions at locations Y and Z, or anywhere else.  Using the information, a repository of all test cases, and good traceability, testers can build regression test suites which are tailored to the specific changes made.  To optimise bug-finding, the order of testing might be prioritised by the relative risk of regressions.

If testers don’t have access to a tool such as ComponentConnectivity (or a really good CM tool), they might have to rely on the development team’s knowledge of the internal connections of the system architecture, and their “feel” for their levels of connectivity.

Another approach, called TestTube, is used in regression testing of system integration.  By examining the locations and types of change (“no change”, “code change”, “specification and code change”), testers build a virtual “firewall” within which lie the modules in which regressions may occur.  That region is the focus of the regression testing.  Sadly, Leung and White, who devised the method, pointed out that the weakness of the method was the general unreliability of integration test suites[7].

As might be expected, experimentation has shown that different selection techniques can result in statistically significant differences in the rates of bug-finding, as well as significant degrees of cost.  Overall, however, “version-specific” techniques, even the simplest, generally improve the rate of bug-finding at both the code and function levels, as compared with “general” prioritisation techniques intended for any version of the software.

Final considerations
Experimentation appears to confirm that “version-specific” regression test suites, based on impact analysis, offer a better bet than “general” suites that try to touch all bases equally.  As usual, quality beats quantity.

When considering regression testing, analysis of the impact of change should[8]:

  • Be performed with the help of documentation abstracted from the source code;
  • Identify potential ripple effects; and
  • Consider the history of prior changes, both successful and unsuccessful.

These criteria should operate at different levels of testing.  For example, when a single component is changed (and if the risk of regressions warrants it[9]):

  • Component testing should verify and validate the changes, but also regression test the remainder of the component’s code.
  • Integration testing should regression test at least the components that communicate directly with the changed component.
  • System testing should regression test the remainder of the system.

Using the clues above, impact analysis can be applied at all levels.  Which parts of the component are most likely to be affected?  Which closely-related components are likely to be affected?  Which other parts of the system are likely to be affected?

Our objective in all things should be what James Bach called “Good-Enough Testing”—testing that strikes the cost-effective balance between risk and constraints over testing, especially those imposed by the customer who’s paying for it.  Running all your test cases every time you do regression-testing may not be the most cost-effective way to achieve this.

[1] That’s 924,733 times 13.82 billion years, for those interested.  MS Excel® did the maths …

[2] A useful test case is one that reveals a useful new truth about the test object.  It might be a confirmation of what we already suspected or hoped for, or it might be something we never expected, but it’s new and significant.  The opposite is a test case which conceals the truth by giving an unclear or ambiguous result, or which merely replicates information already provided by other test cases.

[3] Ophthalmologists among my reading audience may wish to point out that the “veins” in the eyeball are spreading in the wrong direction, from the wrong focus.  Point away, friends.

[4] A technical paper by Microsoft’s Alexander Tarvo provides a lot of useful information on measuring regression rates.  You will find it at http://tinyurl.com/TarvoRegressions .  Watch out for the formulae.

[5] Muthusamy and  Seetharaman, “Effectiveness Of Test Case Prioritization Techniques Based On Regression Testing”, see http://tinyurl.com/MuthusamySeetharaman .

[6] See the discussion by Elbaum, Malishevsky, and Rothermel, “Prioritizing Test Cases for Regression Testing”, at http://tinyurl.com/ElbaumEtAl .  If you’re like me, you’ll struggle with the maths, but the concepts may be very useful if you persevere.  The authors write about “prioritisation techniques”, but of course, these might be used to “prioritise” test cases into or out of tailored test suites.

[7] A redundant test case gives you no useful information that you won’t get from other test cases.

[8] This is a reflection of Beizer’s Guarantee, which says that any test case you don’t run is guaranteed not to find any bugs at all.  That’s the only guarantee that testers have, and one of our most insoluble problems …

[9] See http://tinyurl.com/LeungWhite .

[10] IEEE Standard for Software Maintenance (IEEE Std 1219).

[11] Potentially, it always does.  It’s the unexpected regressions that cause the most trouble!

CLICK HERE FOR UPDATES

Subscribe to our RSS feed and get the latest updates in your inbox weekly

logo