SPLASH 2013
Sat 26 - Thu 31 October 2013 Indianapolis, United States

Point

Shriram Krishnamurthi, Brown U, USA

The software sciences are rich in artifacts: programs, yes, but also datasets, tests, execution logs, models, and more. Yet, our research is evaluated entirely on the content of bitmapped encodings of partial views of these artifacts. In reality, the work is embodied at least as much in this constellation of artifacts as in the paper that describes them. It’s time artifacts got their due.

How did we arrive at the current state of affairs? Decades ago, our discipline made the important step from the specific to the general: one could no longer publish a paper merely for having built something, but rather needed to learn something from the experience, and distill what they learned into packaged, reusable scientific knowledge. This was an important, necessary step for the discipline. In making this shift, however, we have devalued the artifacts that embody and lead to that knowledge. We should restore this missing balance.

Naturally, not every paper needs to be accompanied by an artifact. In particular, entirely theoretical or speculative ideas are as valuable as ever. However, an artifact-driven paper that is not accompanied by an artifact that can be evaluated independently should perhaps rightly be regarded with some suspicion. It remains — as it already is — the responsibility of program committees to assess these submissions. Formalizing artifact evaluation merely forces authors to be clear on what type of paper they are claiming to submit, so program committees can arrive at more informed decisions.

What might we learn from evaluating artifacts? The most negative interpretation is that they increase honesty and decrease outright fraud. More neutrally, they help reviewers form a fuller picture of what a paper accomplishes: for instance, where the artifacts are executable, reviewers can experiment with new inputs, building a better understanding of what the system does. In the best case, the process can even teach reviewers new things that the papers did not cover.

The beauty and power of the peer review process is, by applying diverse viewpoints to an effort, it raises questions the initial authors did not consider. The more inputs the reviewers have to work with, the better they can function.

Counterpoint

James Noble, Victoria U Wellington, New Zealand

Science is about knowledge. Engineering is about products. We should not confuse the two, but artifact evaluation is based in this confusion. Ideas, algorithms, and studies should not be accepted based on how easy they are to obtain, install, run, or how much memory you need to start a virtual machine.

Artifact evaluation is biased against many kinds of research. Good human ethics research practice requires confidentiality or anonymity. Companies often need to keep their code (and especially their customers’ code) in house. Double blind artifact evaluation is biased against large and complex systems of systems: how can you evaluate a web-based system without leaking identity (other than virtualizing the entire infrastructure?) Under some qualitative research methods (such as ground theory) it simply does not make sense to “re-analyze” data — reproducibility requires doing new studies.

Finally, artifact evaluation discourages innovation and encourages rent seeking. Toolsmiths have vested interests in encouraging researchers to build upon their mature tools, rather than hacking together quick prototypes, or just working things out on paper. New programs, languages, systems, applications, and proof techniques will obviously be more fragile that more established methods. In no other research discipline are papers evaluated upon how easily experimental equipment can be packaged up and moved into other labs, rather than on the quality of the results, descriptions, and analyses.

poster (debate.pdf)190KiB