Deathray Research

Month

August 2012

2 posts

If you forecast using velocity, you're going to be late

In a recent post I praised forecasts based on velocity and said everyone should use them. I would be remiss if I didn’t point out that using velocity to directly estimate end dates will cause you to be late. The culprit here, as in much of development, is bugs.

Some tools go to great lengths to lead you astray. In the Pivotal Tracker FAQ, you find this answer to “Why can’t I estimate bugs?”

By default, only features can be estimated. In contrast to features, bugs tend to emerge over time, and while they are a necessary part of your project, they can be thought of as an ongoing cost of doing business.

Tracker’s automatic velocity calculation frees you from having to account for this cost. By measuring velocity in terms of features only, Tracker can estimate how much real, business-valued work can be completed in future iteration, allowing you to predict when project milestones might be achieved, and allow you to experiment with how any change of scope might affect such milestones.

After all that, they admit that you can enable bug estimation if you must. That’s good.  They pack an amazing amount of wrong into those few sentences: suggesting that features don’t tend to emerge over time, for example; Or that software development should be thought of in project terms (the root of much evil); Or that the business value of software can be considered independently of defects (all defects reduce value; there’s some threshold above which the software would be worthless); Or the subtle suggestion at the end that the impact of scope increases will be linear…. but I’ll stick to the impact of bugs on velocity. [1]

If you follow their recommendation, how bad your estimates will be wrong depends on how you manage your rework cycle.  I’ve included the diagram here for reference. 

You could choose to measure velocity purely on features, ignoring the bottom half of the diagram.  All bugs will accumulate in that case (while you’re off adding business value), but you’d be adding a potentially large, but un-estimated pile of work to the end of your plan. Depending on the defect rate, the bug-fix effort can easily be greater than the feature work. I have to assume that’s not what the Pivotal folks have in mind.  

In the best case, you fix bugs as you go. In this scenario, bugs do have an immediate impact your velocity, and your predictions will be better, but you should keep in mind that the only way that velocity can fully account for defects is if these assumptions hold:

1. You find all of your defects with no delay.

2. You fix all of your (must have) defects before moving on to the next bit of feature work

The first assumption will never hold in the real world; the second is just extremely unlikely.

In any other scenario, your velocity calculation will be optimistic and your projections wrong.  Going back to the Pivotal example, if you don’t fix all your bugs before moving on, you must estimate them to keep your velocity reasonably correct.

In general, if you have a highly effective test process and keep the known bug count at a constant value, the delay will also be constant. But if you allow bugs to accumulate un-estimated in either the known or unknown state, your projections will be off by a growing amount as work proceeds. And that could turn out to be a big problem.

FootNote:

[1] I edited the Pivotal quote slightly for readability. Pivotal is one of the best tools around so the shear amount of misunderstanding suggested in those paragraphs is some indication of how much folklore and urban legends still drive engineering management. 

Aug 9, 20122 notes
#estimation
Customers ruin everything

The Rework Cycle is a simple, realistic model of software engineering. Here we extend that model to show how interactions with customers increase development complexity. Like the original, this version was derived from simulations using system dynamics.

image

On the left of the diagram, customers increase scope, adding to the backlog of work to be done, and in all likelihood, pushing out the completion date. On the right, they review the completed work and ask for changes. Both actions can cause outsized variations in development performance. 

Read More →

Aug 7, 20122 notes
#process models #system dynamics #simulation

July 2012

9 posts

9 steps to effective metrics

My doctor is pretty good, so I wasn’t surprised when she noticed I was fat. She backed up her observation with a metric: my weight. She also noted that the trend was not good. (Thank you, Google, for all those brownies.) When you consider blood pressure, pulse, oxygen level, cell counts, etc., maintaining human health is all about the numbers.

No so with software development, where demonstrably useful numbers are not often seen in normal practice. Here are nine tips for using metrics for better process health:

1. Don’t measure what you won’t use

Metrics are expensive and tedious to gather. Unless they’ll drive a decision, don’t collect them. 

2. Embrace the limitations of your numbers

One of my hackers challenged code coverage as an inaccurate measure of test effectiveness. While he was correct, it was irrelevant. The role of a metric is to reduce uncertainty. Life is not a math quiz where only perfect answers counts.

Read More →

Jul 22, 20121 note
#metrics
The rework cycle

The diagram shows a system dynamics model of software development I liberated from a class presentation at MIT. The researcher’s model was much more complex, but this is the heart of it. They dubbed this loop “the rework cycle.” In their model, it accounted to a large degree for the success or failure of a development effort.

The model says progress is determined by combining the number of people with how productive each is. The quality of the work determines how many bugs are created and there’s a defect discovery process that limits how quickly defects are found and re-enter the process as more work.

Read More →

Jul 21, 2012
#process models #system dynamics #simulation
Story points reconsidered

Are story points about complexity or time? Mike (Agile Estimation) Cohn was explicit:

point-based estimating is about the time the work will take

In short, story points are a flimsy, undersized hospital gown draped over real, time-based estimates. As much as you want to hide them, they’re gonna show through.

Some processes associated with Agile (continuous integration, unit testing, short, functional delivery cycles, customer representation, etc.) help us build better software. But not story points.

Read More →

Jul 15, 20123 notes
These will make you smarter

Ten of the best books from the Deathray Research bibliography. Guaranteed to make you smarter about software engineering and the world. Inspired by the book, This Will Make You Smarter, and my teenage son, who said today “All books are self-help books”.  Couldn’t agree more. 

Read More →

Jul 14, 20124 notes
tenXer revisited

I had a brief, and pleasant, conversation with tenXer CEO Jeff Ma yesterday.  We talked about metrics, performance improvement and where tenXer is heading. 

Jeff was nice enough to provide a login where I could evaluate tenXer with real data. What I learned (other than that Jeff has been shirking his coding duties) was how the data comes together to provide a useful picture of an engineer’s work. 

Read More →

Jul 12, 2012
Measuring productivity thru code-mining

most programmers believe a great hacker is several times more productive than a marginal hacker, while simultaneously believing that it’s impossible to measure hacker productivity

I believe that quote because I wrote it.

I also believe you can measure hacker productivity by looking at the code they write, and for more or less the same reason. I developed a measure of productivity and experimented with it with some friends. In this post, I’ll tell you how to calculate it (it’s easy), and how to use these powers - along with measures of quality and test coverage, of course - for good and not evil. 

Read More →

Jul 11, 20123 notes
#code-mining #metrics
Money can't buy me performance

One of the most persistent beliefs in corporate management is that money motivates people to do a better job. It’s used to justify exorbitant executive salaries and underlies the whole performance-appraisal/salary-increase dance. It distracts managers from the vital challenge of increasing work’s intrinsic motivational power.

And it’s not true. 

Granted, I’m more likely to work for you if you offer me $100 than if you offer me $10, all else equal.  But the link from that to, say, creating a bonus plan that increases hacker productivity, is tenuous. Here’s why:

Read More →

Jul 5, 2012
tenXer, the quantified life and the quantified hacker

There’s an excellent, slightly scatological article in The Atlantic on computer scientist Larry Smarr’s quest for the quantified self. The article is by Mark Bowden of Black Hawk Down fame. If you’re interested in the intersection of Big Data, metrics and medicine you should check it out.  

On a related note, I recently come across a profile in the New York Times on tenXer, a startup that claims it can turn a ‘1x’ engineer into a ‘10x’ one through data-mining and gamification. The Economist also picked up on tenXer, showing that if nothing else, they have a flair for PR.  The media interest derives no doubt from TenXer’s remarkable founder Jeff Ma, a former member of the MIT BlackJack Team featured in the film ‘21’.

I signed up for their beta test to check it out. I don’t want to rush to judgement based solely on their beta, so take this with a grain of salt: From what I’ve seen, their approach is shallow, to the extent that I wonder how much they really understand the great sausage factory of software development.  In an interview, Ma states that software is just the start, rather than the sole focus of tenXer; if so, the lack of depth is understandable.

The basic idea is that they gather metrics (counts of check-in’s, lines-changed, emails sent, bugs fixed, etc.) from a variety of sources (GMail, Pivotal Tracker, GitHub, Phabricator and Jira to start with) and provide visualization tools to help you track your “progress” and encouragements to beat your prior bests.

I have two concerns: First, how all these stats relate to effective software development is unclear. Contrast tenXer with the depth of Larry Smarr’s inquiries into his health. Smarr understands that without a model of how it fits together, data is meaningless.  

My second concern is that tenXer is all about individual performance. Software development is a team sport, so beware the local optima. The last thing you want is someone optimizing his personal check-in rate to make the tenXer leader board. 

Still, I think this is a startup to watch. Data mining is the wave of the future in software engineering and it’s great to see startups moving into this space. 

(Full-disclosure: TenXer is funded by Google Ventures and my bank account is funded by Google, Inc. The opinions expressed here are my own. Google Ventures is unaware of my existence, etc. etc.)

Jul 3, 2012
#metrics #code-mining
Positive Software Engineering

I came across a story recently on hacker news: 

My boss decided to add a “person to blame” field to every bug report. How can I convince him that it’s a bad idea?”

Dear OP, Good luck with that. 

Your boss isn’t (necessarily) a bad guy. Software development is a complex dynamic system so it’s hard to see what’s really going on. Intuitions about how to fix things are often wrong and many management interventions are unproductive.

What’s unfortunate about this particular fail, is not that it’s ineffective, but that it poisons the environment. Organizations can develop a pervasive culture of finger-pointing, fear and defensiveness. It becomes impossible to discuss issues because anything imperfect has to be someone’s ‘fault’. The result feels like the org-chart of fear from Joseph Heller’s novel “Something Happened”:

“In my department there are six people who are afraid of me, and one small secretary who is afraid of all of us. I have one other person working for me who is not afraid of anyone, not even me, and I would fire him quickly, but I’m afraid of him.”

All of this takes a toll on the people. If you take an activity that many people happily do for free, combine it with some of the highest salaries of any profession, and produce a work-life that sucks, that’s sub-optimal. It’s not, unfortunately, unusual: many software people are less than thrilled with their work. 

Why should anyone care? Two reasons: First, there is considerable evidence that employee well-being has a positive, causal impact on performance. Second, if you can structure work so that it supports, rather than undermines your team’s well-being, then you should. It’s morally the right thing to do, and you’ll make your own work life more meaningful in the process.

Improved well-being is clearly both a motivator for, and desired outcome of, Agile development practices, but to my mind it doesn’t go far enough. I’m looking for connections between how we build software and the relatively new fields of positive psychology, and positive organizational scholarship.  Positive psychology is concerned with taking healthy people and increasing their well-being. By analogy, we might take healthy development teams and help them really thrive. For lack of a better name, lets call this line of inquiry ‘Positive Software Engineering’.

FWIW, here’s my real answer to the OP: Instead of putting a “person to blame” field in every bug report, suggest he put all the hacker’s names (and maybe photos, too) in the product itself. Most products have an “About” link or dialog. Put it there.

No one wants their name on something they’re not proud of. Open source projects credit their hackers; game developers, too. If he needs a further precedent, show him how Steve Jobs put the original Mac developer’s signatures inside every Mac. Your boss can be positive, and be like Steve. What’s not to like?

Jul 2, 2012
#software engineering #well-being #positive software engineering

May 2011

1 post

Save the baby code

Code reviews are both a standard part of the development process and the biggest wasted resource in software engineering.

Approaches vary from face-to-face discussions to online systems like Review Board.  They share two things: They’re arguably the most effective way to assess code quality, and they’re expensive. 

Yet even as we pay experts to evaluate the actual code, we manage with metrics like code-coverage and defect counts that provide indirect (and possibly delayed) signals about its health. If we could somehow quantify those reviews, the insights could lead to improvement. 

Faced with a similar problem, Virginia Apgar published in 1953 a paper titled: “A Proposal for a New Method of Evaluation of the Newborn Infant” that changed obstetrics and neo-natal practice around the world.  She did it by devising a simple, 10 point scale that rated newborns on five categories like muscle tone, color etc, awarding 0 to 2 points for each.

In the words of Atul Gawande, Apgar’s score ”turned an intangible and impressionistic clinical concept- the health of newborn babies- into numbers that people could collect and compare.” This led to two kinds of innovations: One produced new techniques to save babies with low scores; the second brought advances that led to increased average scores. The result was a 16X improvement in infant mortality and 140,000 lives saved each year in the US alone. 

To do this, Apgar first demonstrated that her score was a true measure of newborn health. She divided 2,096 newborns into three groups according to their scores. Mortality for the middle group was an order of magnitude worse than the best group, while the lowest group’s mortality was an order of magnitude worse still:

  • Infants receiving 0, 1 or 2 scores: 14%
  • Infants receiving 3, 4, 5, 6, 7  scores: 1.1% 
  • Infants receiving 8, 9, 10 scores: 0.13%

Having established the score’s effectiveness, she went on to demonstrate the advantages of one technique over another by comparing the scores they produced.  The results for ways to deliver anesthesia, for example

  • Spinal anesthesia: 8.0 
  • General anesthesia: 5.0  
  • Epidural or caudal: 6.3

showed clear differences between the techniques. The result was was the widespread adoption of the rating system and ongoing competition among doctors, hospitals and researchers for improved scores.

What does this have to do with code reviews?  The health of newborn code is also an “intangible and impressionistic” concept. It needs an Apgar score so that teams can learn and improve.  

There are complications: First, a baby is a baby, but checkins vary from a one-line bug fix to a huge body of code. This can potentially be addressed by normalizing scores with respect to the amount of code reviewed.  Second, no single attribute of code health is as unambiguous as death. This is more troubling, but it can be approached the way Apgar approached infant health: devising a score and comparing it to actual results. In this case, the results might be defect counts and other measures of quality. 

Here is my first pass based on conversations with a few hackers: First, I would measure correctness as a raw count of identified defects.  For the remaining criteria, I would assign a rating of 0, 1 or 2 points.  The categories are:

  • Readability: (Inadequately documented or poor naming, Acceptable or NA, Clearly documented, well-chosen names)
  • Test coverage: (Inadequate, NA or marginal, Fully covered)
  • Simplicity: (More complex than necessary, Acceptable or NA, Complexity appropriate to requirements)
  • Performance: (Inadequate and material, NA or Immaterial, Appropriate to requirements)
  • Reuse (Inadequate or inappropriate use of existing code, NA, Appropriate use of existing code.

Like babies and their Apgar scores, the code would be rated twice: Once on first submission and once with approval (unless, of course, it was approved on first review).

Other, better approaches are possible.  What would you do? 

By themselves the scores do nothing to improve your process, just as Apgar scores alone don’t improve an infant’s health.  The important step, one that will challenge your knowledge and creativity, is to relate them to your other data, understand what this tells you about your process and invent ways to improve things. 

p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Courier; color: #666666; background-color: #dcedd2}

<img src=”http://images.demandmedia.s3.amazonaws.com/verify.png?id=B8pCkdn81wLM5l5326a2IU4” 

alt=”” style=”width:1px;height:1px;border:0px !important;” />

May 22, 201111 notes
#software engineering #metrics #code reviews

April 2011

4 posts

The pathology of estimates

I recently sparred gently on twitter with Scott Ambler regarding an assertion that repeatedly renegotiated schedules was evidence of unethical behavior. Others on the thread equated the practice with lying.

Not so.

(Full disclosure: Scott is a highly respected thought-leader in the Agile community and the Chief Methodologist at IBM Rational. And I’m jealous.)

Bernie Madoff Syndrome

When we witness a problem, we search for the cause. Often the trail seems to lead to the (mis)behavior of a few ‘bad apples’ 

Bad-apple theory suggests the recent near-collapse of the world-wide financial system was caused by the misdeeds of a handful of Bernie Madoff types. The solution is to punish these people and discourage imitators.

The theory is attractive because it has a simple story line. It’s morally satisfying. Cause and effect are tied neatly together, letting us off the hook from the hard work of changing the system itself. 

I’m not going to argue that lying and deceit don’t sometimes play a role, but we should be wary of attributing the success of some projects to the virtues of Agile, and the failure of others to individual vice. 

Many bad apples?

To find other explanations, we begin by observing that the sliding schedule problem is not uncommon. Time and software are old enemies:

“More projects go awry for lack of calendar-time than all other reasons combined.” Fred Brooks, Mythical Man Month.

While it’s possible that engineering management attracts a disproportionate share of incompetent and unethical people, it’s more likely that the fault is systemic. When a problem recurs regularly in spite of the best efforts of bright, resourceful people, we can assume it has deep roots. 

Bazerman and Watkins argue that some problems recur because they fall into a kind of “sweet spot” for failure, where political interest, organizational dysfunction and cognitive limits align against them. Software blowups fit this model nicely.

Bad brains, and good ones

Dr. Frederick Frankenstein: Ah! Very good. Would you mind telling me whose brain I DID put in? 
Igor: Abby someone. 
Dr. Frederick Frankenstein: [pause, then] Abby someone. Abby who? 
Igor: Abby… Normal. 

Unfortunately, it’s not just abnormal brains that cause problems. Daniel Kahneman won a Nobel Prize in Economics for his research on biased decision-making under uncertainty. Its application in the financial markets helped others win more lucrative prizes- even without cheating.

More broadly, Kahneman and his followers studied a host of common heuristics and biases. One finding: people have a clear bias towards optimistic time estimates known as the Planning Fallacy.

“The phenomenon is not limited to commercial mega-projects… and its occurrence does not depend on deliberate deceit or untested technologies.”

In fact it’s been shown to be true with even simple household tasks. Better still, a person can be pessimistic about plans in general, e.g. “Software projects always run late” and still be too optimistic in their own planning: “I think I can finish on time.”

Ironically, more detailed planning can actually make people more optimistic.  The theory is that optimism derives from a mental image of success; more detail makes for a more compelling image.

Another common cure, having people estimate their own work may also backfire. While this should increase commitment and incentives for accuracy, it goes against human nature: People tend to be optimistic regarding their own plans, but more realistic regarding the plans of others.

While the planning fallacy affects estimates generally, the heuristic called Anchoring and Adjustment undermines our attempts at revision. People often unconsciously start with a reference point (the ‘anchor’) and then ‘adjust’ from that to derive an estimate. The problem is that the adjustments are usually too small. The current scheduled completion date can be an anchor that contaminates subsequent efforts to create a viable schedule. 

From Madoff to Gandhi

How powerful is this effect? People were asked to estimate how old Gandhi was when

he died, but first half were asked if he was older than 9. The others were asked if he was less than 140. (Obviously his age was between the two when he died.) The first group had an average estimate of 50; the others an average of 67: a difference of 17 years due to completely irrelevant anchors.

Then there’s The Confirmation Trap. Given an assertion like “We can do that in six months,” we have a strong and harmful tendency to seek supporting evidence and stop when we find some. But the presence of supporting evidence doesn’t make a plan achievable. Plans can only be proven infeasible before hand. Searching for proof that a plan is infeasible is an uncommon project activity, to say the least.

How common are these problems? In the words of one researcher: “One of the most robust findings in the psychology of prediction is that people’s predictions tend to be optimistically biased.”  If you think you’re immune, you’re suffering from a positive illusion bias. These not only harm our estimates, they lead us to think that everything will work out fine, undermining our motivation to do something before things get out of hand.  Other biases have similar effects:

  • Discounting the Future: We tend to avoid small costs in the present that would prevent large problems in the future. 
  • Status quo bias: We tend to avoid actions involving any clear harm, even when the positive benefits of the action outweigh the negatives greatly.

In short, human nature leads us astray. It leads us to underestimate the task at hand and to delay corrective action when needed. If that qualifies as unethical, then we’re all guilty at least some of the time:

“when we make mistakes, we shrug and say that we are human. As bats are batty and slugs are sluggish, our own species is synonymous with screwing up” - Kathryn Shulz- Being Wrong: Adventures in the Margin of Error.

These biases partly explain the persistence of sliding schedules. If nothing else, they should make us think twice about hammering people who miss a deadline.  I plan to follow this up with posts on other factors that may play a role, but I’d rather not say exactly when I’ll be done.

(Sources on cognitive limits include Shulz, Bazerman & Watkins, Kahneman, etc, and Gilovich, etc. All are listed in the bibliography with links to their Amazon pages.

Mel Brooks’ Young Frankenstein was my source on problems arising from installing an abnormal brain in a giant, home-made creature.)

Apr 10, 201115 notes
#software engineering #estimation
Software, Metrics and Ethics

“It’s impossible to move, to live, to operate at any level without leaving traces, bits, seemingly meaningless fragments of personal information.” William Gibson

One of the themes of this site is that the lack of transparency in the development process is a leading cause of mis-management. This need not be the case.

Nearly every aspect of software development leaves a digital trace. Analyzing those traces can help eliminate the fog surrounding software development. I believe the current state of the process can be made available to decision makers. I also believe, though it’s unproven, that the quality and productivity of teams and individual hackers can be measured by analyzing the traces the process leaves behind.

Whether or not that assertion is true, it raises the ethical question implied by the quote from William Gibson: Under what circumstances is it acceptable to put the process under the microscope?

The question is not new, though it is new to software development, where what hackers do is generally thought too complex to measure. In my experience, most programmers believe a great hacker is several times more productive than a marginal hacker, while simultaneously believing that it’s impossible to measure hacker productivity.

There is good reason to suggest that this is not the case. The next time you tell your phone to play a song or have Google translate something, remember that you’re watching statistical natural language processing at work, and that natural languages are far more complex than programming languages. Their complexity hasn’t prevented valuable progress from being made.  Look more broadly and examples abound of the successful application of statistics in systems that are more complex than software development. Our metrics are weak not because software is so complex, but because our data sucks.

There is also good reason not to ignore the question of hacker productivity. In the long run, the only way to keep programming jobs in high wage locations is through demonstrably superior productivity. 

The question of how to measure performance in an ethical and non-threatening way is old news in industries where statistical process control (SPC) is common.  I had the privilege years ago to study with W. Edwards Deming at NYU, who was renowned for having taught SPC in Japan after the war. Japan, of course, taught it to the rest of the world by decimating their low-quality competitors. If you can drive to work without wondering if your car will break down, you owe something to Deming. 

In Deming’s view the ultimate goal of process improvement was to “provide jobs and more jobs.” He saw this both as a moral imperative and a practical necessity: only “driving out fear” could prevent the sabotage of the metrics needed for SPC. Because of that, he spent relatively less time discussing math and more time teaching managers how to NOT to misinterpret data. And he emphasized consistently employee morale and security.[1]

If we are to make effective use of data in software engineering we need to be equally vigilant. The data must be used only to (1) help lower-performing individuals improve, and (2) to help move the team as a whole to a higher average. If it turns out that one of your people just doesn’t have the ability to be a strong hacker, it’s on you to find a way for them to contribute. If this happens often, you need to work on your hiring process. Having good data may help. What you don’t do is fire them. The first time someone uses your data as grounds for termination, you’re lost.  

[An aside: Other people are starting to move in the direction of analyzing data produced as a normal part of the development process. Michael Feathers had a recent post on his excellent blog that mentions several different SCM-mining efforts underway, though they’re a bit different from what I have in mind.]

[1] Well, that, and the infinite stupidity of America’s corporate leadership. He was a pretty cranky guy on that subject. 

Apr 5, 20111 note
#metrics #empirical software engineering
Finding meaning in manual tests

How do you assess the overall quality of your application when you have too many manual/functional acceptance tests to run them all after every sprint?  Perhaps you’ve been working on an application for some time and want to predict when the quality will be good enough to ship.

(Here some will say, “We don’t need manual tests; We have unit tests for everything,” If your automated suites thoroughly test integration and fully exercise your UI, fine. Otherwise, we’ll assume that you need or want to augment your automated tests.)

One approach is to run all the manual tests for a functional area with each iteration. This is often coordinated with a push to fix bugs in the same area. It’s an efficient way to use testing resources, and when coordinated with a bug sweep, it helps you find the things you broke when you swept.  

Be aware, though, that it tells you little about the quality of the entire application. A different approach, which can be used in combination with a focused testing effort, is to select a set of tests at random and execute those. 

Specifically:

  • select a different random set of tests to run with each iteration.
  • execute each test and record whether it passes or fails
  • calculate an overall pass rate for the suite.

Easy. Now what do you do with the failing tests? In terms of learning about your application, it doesn’t matter whether you fix the issue or not - but it’s essential that if you do fix it, you don’t change the original pass rate. That just pollutes your data. 

Lets say 90% of your sample tests pass. Can you assume that 90% of the tests you didn’t run would also pass?  Not necessarily. What’s cool about sampling is that it tells you how much to trust your results.

How many tests is enough?

To know how many tests to run for a given level of precision, you can use a sample size calculator like the one at http://www.surveysystem.com/sscalc.htm. 

To calculate sample size, you have to provide some guidance. First, tell it how many tests you’re sampling from. This is your population. Lets assume you have 1,000 tests.

Next, select a confidence interval of say, plus or minus 5%. If your sample tests pass at 90%, you can now say the pass rate for all tests (run and not run combined) is probably 90% plus or minus 5%, i.e. between 85% and 95%. 

Note that I said “probably.” To be more specific, select a confidence level (usually 95% or 99%).  If you pick 95%, you can now be very specific about what “probably” means: “I’m 95% sure the pass rate for all tests is between 85% and 95%. Or rather you could say that if your sample size is big enough. In this case, the calculator shows that you’d have to run 278 randomly selected tests for that level of precision.  

The moral of the story

If that seems like that’s a lot of tests for little precision, then you’ve uncovered the most important lesson here. Think about how many times you’ve seen someone use a similar pass rate, taken from an even smaller, non-random sample and act like it was perfectly accurate: “Last month we passed 91%, this month we passed at 90%. Why are we getting worse?” 

If you’re using sampling, you know that a difference that small is probably meaningless. The real value of being precise about the limits of your knowledge is that it can keep you from chasing random fluctuations and making things worse. The way to judge your improvement is to wait until you have a handful of results, plot them and look for trends.

The frame is not the universe

Before ending, we should be clear about one more thing: All we really know is the pass rate for our tests. We’ve been making an implicit assumption that the suite would provide an accurate measure of overall quality if we ran them all. That remains to be proven.

If you think of each test as exercising a particular path through the application, then some terms from sampling theory can help make the remaining limits of our understanding clearer:

Universe: What we really want to measure. In this case perhaps our quality over the set of all possible user paths.

Frame: The set of accessible paths from which we draw our sample. In this case all the list of paths we’ve documented as unique test cases.

Sample: A randomly selected subset of the frame.  

In software terms, we need to understand the coverage our test suite provides. There are numerous ways we can define coverage, but that’s a subject for another day. 

Apr 4, 2011
#metrics #quality
A note on quality

“We never use a screwdriver in the last week. We hammer the screws in. We slam solder on the connections, cannibalize parts from other televisions if we run out of the right ones, use glue or hammers to fix switches that were never meant for that model. All the time management is pressing us to work faster, to make the target so we all get our bonuses.” Worker in a Soviet television factory quoted in Milgrom & Roberts: Economics, Organization and Management.

Every hacker, at one time or another has committed the software equivalent of hammering in screws when a deadline approaches. It’s understood that quality suffers, but quality can mean many things. Here we talk about three kinds of quality: design quality, conformance quality and total quality.[1]

Design Quality is a statement of intent, a measure of how the product as designed would appeal to the market’s true needs and desires if it were made perfectly. It has nothing to do with the hidden details of the internal architecture and everything to do with user. The original iPhone had great design quality.

Conformance Quality is the degree to which the actual product reflects that design.

A product with few bugs, but many issues closed as “Works as designed” may have good conformance quality and poor design quality. Usability issues are design quality problems. Designing a product no one wants, that costs too much to operate or that isn’t competitive are others. In our simple model of software development, what we’ve labeled “quality” is conformance quality.  We’ll add design quality when we add customers to the model.

Typical bugs are conformance failures—the product doesn’t perform as specified. But if you host your software, so are performance problems, deployment problems, hardware issues, scaling problems. All detract from the user experience and create a gap between the intended value and what you delivered.

Total Quality is the combination of design quality and conformance quality. The combination is not additive: poor design quality destroys the product’s value, even if the conformance quality is very high. A product has good total quality when the implementation conforms closely to a design that meets the market’s expectations.  

We distinguish between design quality and conformance quality for a reason: Most software organizations invest far more on finding bugs than they do on the quality of their requirements, usability, or the competitiveness of the overall design. For startups, at least, this is starting to change as more adopt the Lean Startup approach that makes customer development (essentially market research) an equal partner with product development. 

[1] The distinction between these three definitions of quality is stolen with pride from Kaoru Ishikawa. 

Apr 1, 2011
#lean startup #quality #software engineering

March 2011

5 posts

A note on defect discovery

When people see defect discovery in the development model, they naturally think of quality assurance engineers hammering away on the product. But that’s just one approach, and not the most effective. (The most effective way is to demo the application to an important customer.)

Defect discovery includes also 

  • Static code analysis (including that performed by the IDE)
  • Compilation
  • Manual desktop testing by hackers
  • Code reviews
  • Unit tests
  • Integration tests
  • Customer beta tests
  • Demos (of course)

And so on.

We rarely think of what the hacker does as defect discovery, but much of it is.  One methodology developed at IBM in the eighties took developer testing so seriously they eliminated it. The Cleanroom approach aimed to provide both near zero defect code and guaranteed levels of reliability. And did.[1]

The approach included several novel elements, but the strangest to current practitioners is that they took away the compilers so no testing could be done by developers.  All test executions were performed by a separate team and the results were recorded. Having a complete history of all test executions helped enable the creation of probability models that could forecast the number of errors in production. 

Viewed another way, Cleanroom Development tells you not just how many known defects there are, but how many unknown defects are in the application.[2] On the day you launch that’s one of of the two things you most want to know. 

[1] This is not the only reason why no one uses it.

[2] More precisely, it let you project confidently the mean-time-to-failure (MTTF) for the application.

Mar 30, 2011
#Rework-Cycle #software engineering #Cleanroom Software Engineering #Defect Discovery
Romantic Agile and the universal theory of big software

“Experience alone, without theory, teaches management nothing about what to do to improve quality and competitive position, nor how to do it.” W. Edwards Deming

For some people, software engineering is a solved problem and Agile is the solution. If a project is small enough, management enlightened enough, and customers sufficiently supportive or powerless  - in other words, if it’s an easy project - Agile is the way to go. 

But it’s not enough: 

  • It leaves open the question of how to continue improving. How do you outperform the other “Agilistas” and build a firm-specific advantage in engineering? 
  • As an individual, you compete globally with engineers also using Agile who have a significant cost advantage. How will you feed your kids ten years from now?
  • Software projects differ. Complex projects differ from simple ones. Web software differs from medical device firmware. When goals are different, any technique must support some better than others. To customize the process you need to combine a solid theory with intense study of your own situation.
  • Agile methods focus heavily on coding and testing and say less about things like requirements and operations. It’s impossible to optimize an entire process while focusing on one part. 
  • Agile is likely not the final advance in development methodology. Open Source, for example, is a far more radical rethinking of software production, and one with important lessons for developers. 

In manufacturing, the techniques that make up Just-in-Time (to which Agile is often compared) are less important than the realization that every aspect of the process is a control to be manipulated, coupled with a “relentless pursuit of understanding and improvement.” 

Which brings to mind another lesson from Just-in-Time, the distinction between “pragmatic” JIT - characterized by a patient and exhaustive focus “on the concrete details of the production process” - and the “quasi-mystical hyperbole” of “romantic JIT.”  Much of the writing about Agile (“People over Processes”) tends towards the romantic, making it simultaneously more appealing and less useful.  

None of this makes Agile ‘wrong’; It’s not, but too many people use it as a cook book and when they need to improvise they’re stuck.  They need to learn to cook; not just follow recipes. 

Is there a ‘universal theory of big software’?  I think the pieces exist, rooted in economics, systems dynamics and other disciplines.  They’re just waiting for someone to pull them all together. 

That someone is not me.

I’m reminded of William James’s remarks on his own ground-breaking text Principles of Psychology. He called it ”a loathsome, distended, tumefied, bloated, dropsical mass, testifying to but two facts: 1st, that there is no such thing as a science of psychology, and 2nd, that WJ is an incapable.” 

In the posts that follow, I hope to produce something worthy of similar praise. 

Mar 29, 20113 notes
#agile #software engineering
Build your own Airline Reservation System

“Air Canada suspended activity related to the implementation of a new reservations system under development with ITA Software.

The carrier recorded a second-quarter impairment charge of C$67 million (US$61.9 million) related to the development of the system, dubbed Polaris.” - Air Transport World, June 2010.

I worked on Polaris and was responsible for a few parts. Problems were not unexpected. Of nine major project failures listed in one paper, two were airline reservation systems: A United Airlines system cancelled in the early 70s after $50 million was spent, and an American Airlines system cancelled in 1992 after burning through $125 million. 

Airline reservation systems (ARS) are among the largest and most complex ever built.  The fact that there were no failures listed after 1992 is not because people figured out how to build them, but because - to the best of my knowledge - no one attempted a new, from-the-ground-up reservation system for a major international carrier between American’s attempt and Polaris. 

Although Air Canada timed-out and walked away, the work on Polaris has born fruit.  The flight schedule management system we built is in production.  The inventory control system has been purchased by American Airlines, and the core reservation system and departure control systems are being considered by other airlines. 

What made Polaris interesting to me was its size - hundreds of person-years of effort with several hundred people working on it at its peak. I’ve been building large software systems for nearly 20 years and think I know it all, but saw immediately that I didn’t know how to build something this big and complex.

Now I feel like I’m starting to get it. In subsequent posts I’ll attempt to describe what I think I learned, and then you too can build an ARS for fun and profit!

Mar 28, 2011
Ramblings about software development

This is where I write down all the stuff I’m thinking about before I forget it. 

Mar 26, 20111 note
Next page →
2011 2012
  • January
  • February
  • March
  • April
  • May
  • June
  • July 9
  • August 2
  • September
  • October
  • November
  • December
2011 2012
  • January
  • February
  • March 5
  • April 4
  • May 1
  • June
  • July
  • August
  • September
  • October
  • November
  • December