Thursday, March 29, 2012

Test Metrics - Dangers In Communication

At the Software Testing Club meet last night many interesting points were made but one that captured my interest in particular was testing metrics.

What Are Metrics?

Testing, like most things, is full of numbers and statistics. Numbers and statistics are everywhere. And metrics are just numbers that have reason to them. Metrics are "an analytical measurement intended to quantify the state of a system"1

Unfortunately numbers say whatever you would like them to say. I always return to an old adage taught to me at school:
Information = Data + Meaning
So the numbers in statistics are data, and what they refer to is the meaning. Or the numbers are the quantification and what they refer to is the state of the system. But this is all that these numbers refer to, and that is the inherent problem with metrics.

Why Do We Use Metrics?

Well to "quantify the state of the system" of course! But is that really why we use metrics? We know that's what they refer to but metrics have a habit of miscommunication meaning and I think the operative word we hit on last night was context.

Managers love metrics. They like reports, charts and graphs because they can gain information about a system that they know perhaps very little about or don't have time to understand. It's easier to go to your boss with "Our bug count is 17% lower this quarter" than "well, looks like it's going okay and the testing guys say that the quality looks higher". But do the numbers tell the truth about the system? If making you or your team look good, or putting your head down and getting away with nominal improvement is your goal then metrics are a gift from heaven.... but if relaying the true quality of your product is the goal then qualification and context are vital.

Our discussion last night prompted the example of "severity". Saying "We have 20% fewer Severity One bugs" sounds great, but this statement should prompt a lot of questions, such as:

  • What on earth is a "Severity One" bug?
  • What does it mean to 'have' the bug? That we found fewer? That we fixed more than we found?
  • Is 'having' fewer of these bugs really a good thing? Are we spending time fixing bugs and not finding them? Are we simply not finding them any more when we should?


Science is a methodology for finding truth about the universe despite our own cognitive biases. One look at an optical illusion shows us that seeing isn't believing, and starting with an answer and then trying to prove it makes the likelihood of experimental data showing the pre-selected answer more likely ("hmmn, that doesn't look right, I'll take that reading again..."). We are pattern seeking animals, we tend to confirm our own beliefs and look for "hits" and forget the "misses" - this is confirmation bias. It's why we see shapes in the clouds and the face of Jesus on a crisp, we are evolutionarily programmed to respond to patterns and false negatives (is that rustling in the trees wind or an enemy? Better run just in case).

All this means that we should not interpret the data after deciding what we want it to say, otherwise we are creating data that does not tell the truth. They are the most basic rules of scientific philosophy. It might make us look good or cover up our mistakes or those of our colleagues, but it's not the genuine state of the world, or the system under test.

How Should We Use Metrics?

Interpret the numbers on behalf of those receiving them, do not let them do it themselves. Those collecting the metrics are probably at the best point of understanding, so add the context and qualifications there before handing them off to someone else. Do not use confusing short-hand terms like "code coverage" and "bug count" without explanation. Formalise the meaning of domain terms like "severity" so that testers can categorise issues by a domain-wide understanding of the term.

This all feeds into our need to understand more complex things about the system than "number of bugs fixed" or "test coverage", to the ability to understand opportunity cost and apply it to schedules, risk assessment, functional prioritisation and so much more. 


No comments:

Post a Comment