Measuring productivity thru code-mining

most programmers believe a great hacker is several times more productive than a marginal hacker, while simultaneously believing that it’s impossible to measure hacker productivity

I believe that quote because I wrote it.

I also believe you can measure hacker productivity by looking at the code they write, and for more or less the same reason. I developed a measure of productivity and experimented with it with some friends. In this post, I’ll tell you how to calculate it (it’s easy), and how to use these powers - along with measures of quality and test coverage, of course - for good and not evil. 

When I say measure I’m mean to reduce uncertainty through observation. I don’t mean perfection. This measure is imperfect, but useful. And it only tells you about output, not whether you’re building the right thing. You have Agile and Lean Startup tools for that.

Looking at code is key: In no other craft is the record so complete. Despite limitations in version control systems[0], your source code revision history contains the most complete and accurate development data available. Bug and project data don’t come close, though they get more attention.

The metrics:

Productivity is a measure of change over time, so the process has three steps:

  1. Measure one version of some code
  2. Measure how much it differs from another version
  3. Total those differences over some time intervals (code changes per week, for example)

That’s it. Because the approach is rooted in information theory, I refer (pretentiously and imprecisely) to the size metric as the “Information content" of the code, and the change metric as the "information distance" between two versions of the same code. They’re based on the insight that compression can be used to measure the ‘distance’ between two texts, first presented in a paper called Language Trees and Zipping by Benedetto, Caglioti, and Loreto. Their (controversial) findings were summarized by the Economist

And now, our first metric: The ‘information content’ of a program is its compressed size in bytes. Why compressed? In short, because it squeezes out the redundancy and boilerplate in the code. And because we need it compressed for the second metric.

Here’s the other metric: The ‘information distance’ between two versions of the same code is the compressed size of both versions concatenated, minus the compressed size of the original. To convert this to a productivity measure, sum the distance measures over some time interval. 

A little light theory

In information theory, the Kolmogorov complexity of a string is defined as the length of Kolmogorovits smallest representation in a given language.  The more regular (less complex) a string is, the smaller its compressed representation. Hence ‘aaaaaaaaaaaa’ can be represented in fewer bytes than a random string of the same length. 

Because compression algorithms approach this theoretical minimum, the relative size of two strings compressed with the same algorithm is an approximation of their Kolmogorov complexity. Given it’s regularity, a program file consisting mainly of accessors (“public void setFoo(String foo)…”) should compress better than one where each line is unique. This fits roughly with notions of simple and complex code. Compression also minimizes the effect of differences in formatting, variable naming, and so on.

Why I do it this way

The measures combine two key aspects of output: the amount and complexity of the code produced. There are other nice properties:

  • It’s fairly easy to implement
  • It has a reasonable theoretical basis
  • It’s language independent: the same implementation can be used to measure Ruby,  Java, Python, Lisp, etc.[1]. Metrics like Cyclomatic Complexity require language-specific implementations.
  • It can be used to measure HTML, XML, CSS, maybe even documentation, as well as actual program code. Maintaining these files is a big part of what hackers do.
  • All changes produce a positive result.[2] If you refactor code to make it simpler and smaller, the zipped combination will still be larger than the zipped original, so it shows as an addition to productivity. That’s not the case with LOC, NCSS and Cyclomatic Complexity, but going from 500 LOC to 480 really is progress and should be counted as such.[3]

Yeah, but does it work? 

We take the most basic definition of ‘work’ to mean that larger differences between two files consistently produce larger information distance values and small differences smaller ones.

To test that, we reasoned that each time a file is edited it tends to diverge further from the original. We compared over 100 versions of a single file individually against the original and plotted the calculated distances in version order. As expected, each subsequent version was further from the original than its predecessors- except in a few cases. In those cases, code had been removed that was also not in the original, so that version really was more like the original than it’s predecessor. All this suggests the method is a reasonably consistent measure of change. 

Possible applications

We tried this metric as part of a larger code-mining experiment. In it, we calculated the changes between every version of every file for a multi-year project. Here are a few findings: 

  • Productivity did not vary with experience size, though I thought it would. 
  • Neither adding or reducing headcount made productivity go up. Cost, obviously does vary with team size, so that’s something to consider in your team. 
  • That whole “10X” thing, is it for real? It appeared to be.

Measuring the relative investment in test versus production code is an interesting application. We simply calculated productivity separately for production and test code and looked at the ratio. If you know what your total staffing costs are, this is an easy way to estimate what you spent on tests. Comparing this ratio across projects could help quantify the cost/benefit ratio for automation.

Measuring the productivity of individual hackers

An obvious application is to see who writes the most code. Since we pulled all the metadata from our repository with each change set, this was fairly easy to calculate. This metric is fraught with danger, however, so proceed use caution. You won’t be able to resist peeking, but don’t publicize these numbers or use them for reviews. Don’t even tell your boss unless you’re sure she can show the same restraint.

And - please - don’t try to directly manage this number (“10% more productivity next quarter, please!”). Productivity is a dependent variable; focus on variables that are both (1) relatively independent and (2) high leverage. More on this in another post I’ve been writing for like a year now. (speaking of productivity.)

Caveats:

  • These ideas are experimental. All metrics should be subjected to repeated, but I decided to put it out there now because I’m not sure when or if I’ll get back to it. 
  • Always consider the sum of a person’s contribution. Some people who scored high on productivity reported very few bugs over the same period. Writing good bug reports is a useful activity that reduces the time available for coding.
  • The choice of compression algorithm is critical. Gzip produces inconsistent results. We got our best results using libbsc on the default settings. YMMV.
  • If you measure productivity and don’t track quality, you’ll get what you deserve.
Footnotes:

[0] SVN didn’t track the connection between renamed files very well. 

[1] If you want to strip comments before zipping, you need separate code for C-style comments, etc. Also note that I’m not suggesting that results for different languages are comparable.

[2] If a file is removed completely, the distance would be 0. This isn’t awful, though, because removing a file is easy, and you could probably compensate by adding a few points.  Changing other code to make a file obsolete may involve real work, but that would be picked up in the distance metrics.

[3] You can count lines added, removed, and changed, but the way diff algorithms decide when something has been changed, rather than removed and added, often seems arbitrary, if not just wrong.

text
3 notes
  1. deathrayresearch posted this
About

Deathray Research

deathrayresearch
Deathray Research is Larry White's software engineering blog. Larry is an engineering manager and hacker at Google, and lives in Beverly, MA. He's been managing large software projects for years and finally thinks he knows what he's doing.* The opinions expressed here are his own.

*Actually, he thought he knew what he was doing the whole time.

PS - I bought the domain deathrayresearch.com years ago thinking i would use it for a startup. Or a blog, maybe.

Recent Tweets