Artifact evaluation, theory and practice

Previously on the SIGOPS blog, we discussed the current status of artifact evaluation (AE) and how to improve it. At HotOS’23, we took another step in this direction: attendees discussed artifact evaluation in a panel led by Shriram Krishnamurthi (Brown), Margo Seltzer (UBC), and Neeraja Yadwadkar (UT Austin), and organized by Roberta De Viti (MPI-SWS), Solal Pirelli (EPFL), and Vaastav Anand (MPI-SWS). This blog post is the opinion of the panel organizers, and not necessarily that of the panelists or the HotOS chairs. Additionally, the panel organizers would like to thank Anjo Vahldiek-Oberwagner for providing feedback on the panel report as well as the opinions presented here.

As a reminder, AE is a process through which artifacts can be recognized as high-quality by awarding badges to the corresponding papers. The goals are to improve the reliability of computer science research and to make it easier for researchers to build upon each other’s work.

The key theme of this blog post is the difference between the theory as seen by some AE chairs, and the practice of artifact evaluation as reported by some authors and evaluators. The effort required to make AE work in practice today is high, and the benefits are not always aligned with what the community expects. We propose a way forward that brings the theory closer to the expectations reported by the HotOS attendees, which should be easier to consistently apply in practice.

While the full report is available on arxiv, one important outcome of this panel discussion, which should frame future conversation on the topic of artifact evaluation (AE), is that the current “three badges” system is of unclear origin and is notably not what was intended for the first artifact evaluation. Thus, we should not consider these badges as constraints written in stone; other possible classifications include the U.S. National Academies’ and Feitelson’s.

Availability

Current systems AEs award the “artifact available” badge if the artifact “exists” on some online repository. In theory, this works out well if the authors do not forget any files, accidentally delete repositories, and so on. In practice, the HotOS panel audience felt that this does not always work well, though. The audience reported that some “available” artifacts only contain binaries, some are only partially available… in one instance, an artifact contained nothing at all. It is unclear why the latter passed “availability” evaluation; maybe evaluators trusted authors to upload the artifacts later on and the authors forgot to do so, or the authors accidentally deleted their repository afterwards. While there are ways to archive artifacts such as Zenodo, they are only recommended and not required by current badges.

Functionality

Current systems AEs award the “artifact functional” badge if the artifact is “functional”. In theory, this works out well if the evaluators have enough time and expertise to review the artifact in depth and if there are clear, standard guidelines that define functionality? In practice, the HotOS panel audience felt that whether an artifact can be executed is not as important as whether another researcher could use the artifact as a baseline for their own experiments, or as a building block for their own system. The latter is closer to what the programming languages and software engineering communities currently look for when evaluating artifacts.

Reproducibility

Current systems AEs award the “results reproduced” badge if the results obtained in the paper can be reproduced up to “some tolerance”. In theory, this works out well if evaluators have enough time and expertise to reproduce results, but also interpret deviations that may be due, e.g., to the type of experiment or the use of different hardware. In practice, the HotOS panel audience felt that full reproducibility for all artifacts is not worth the amount of time it demands (i) to evaluators, most of which are students who already have enough work to do and are already under enough pressure, and (ii) to authors, who may need to spend a significant amount of extra time in packaging their artifact, especially if their dependencies are not straightforward such as specialized hardware. Furthermore, evaluators and chairs may also have reasonably different definitions of “some tolerance”. As in the case of functionality, the audience felt that evaluating whether the artifact is usable by others for future work would be a better time-reward tradeoff for all the parties involved.

Incentives

The three badges for availability, functionality, and reproducibility are an incentive for authors to submit their artifact for evaluation. In theory, this works out well: authors can at least make their artifact available, and maybe functional, even if they do not have the time to make it reproducible. In practice, the HotOS panel audience was concerned that badges may not only be performative but disincentivize research that does not fit well within the framework of AE, as badges can be seen as a necessity to be taken seriously. For instance, designing new hardware systems is an important area of computer systems: it would be impractical to ship an entire custom server to evaluators. While current AEs typically allow video recordings as a backup if there is no other way to show that an artifact works, it is quite difficult to evaluate for functionality or reproducibility based on a video.

A way forward

We propose to decouple whether a paper’s claims are reproducible and whether an artifact can be reused:

Reliability tracks. Provide first-class tracks to support or contradict past claims in systems venues, instead of requiring AE to evaluate reproduction in a short time frame.
Reusability badging. AEs should yield a single badge, reusability, and the badge’s definition should be shared across venues and across publishers.

The purpose of reliability tracks would be to publish short papers that support, or contradict, the claims made by prior work. Such work must involve both testing the original artifacts if available, also known as reproduction, and performing a new study without reusing the original artifacts, also known as replication. For such tracks to be taken seriously rather than as “cheap” ways to increment one’s paper count, the bar must be high. Replication must be required, not only using existing artifacts and calling it a day. The authors of the work under study must be given a chance to respond if their claims are contradicted.

Reliability tracks would also serve as clear on-ramps for budding researchers, especially those with no easy access to mentorship. These tracks would give early-researchers clear targets to work on, and the process might also stimulate research ideas and help undergrads decide whether to pursue research or not.

The reusability badge should reward artifacts that are public and can be productively used by any reasonable researcher without hand-holding from the original authors. The work necessary to enable this is anyway necessary to ensure future members of a research group can be productive even after the original author has moved on. Evaluation should follow the same principles as reviewing a contribution from a collaborator: the goal is to spot mistakes and improve the artifact.

We advocate for precise checklists for artifacts, rather than abstract high-level goals left to the interpretation of individual evaluators. We also think that each type of artifact needs its own checklist: not all artifacts are code, thus not all artifacts can be grouped under the same checklist. We believe the community should discuss what types of artifacts should the AE process consider, and what checklist would be appropriate for each type.

To give an initial example, we propose a checklist for high-level software not requiring specialized hardware:

The paper includes an appendix detailing what the artifact does and does not contain, and the paper reviewers agree this is enough.
The artifact’s source code is available on a public archive with irrevocable versioning and a clear long-term storage policy, such as Zenodo but not GitHub.
The artifact corresponds to the design described in the paper at the module level.
The artifact includes documentation explaining: (i) what claims it backs up and how; (ii) what dependencies it requires; (iii) how to run an example experiment, on any environment that satisfies the dependencies; and (iv) how to interpret all output from such an example in terms of the paper’s claims.
The artifact includes all data and other inputs to the paper’s experiments, such as packet traces used to benchmark a firewall.
The artifact reasonably minimizes the human effort necessary to use it, such as using scripts to replicate configuration instead of requiring users to copy and paste
The artifact has a license that allows use for research purposes including comparison and extension, such as the CC-BY or MIT licenses.

Checklists may evolve over time, but should do so independently of any specific publisher, e.g., USENIX and ACM should not have different sets of badges. However, a badge can be versioned, and any venue could stick to its own version of the badge if it strongly feels this is necessary. The versions can be hosted on a community page such as sysartifacts.

This proposal incentivizes authors to publish reusable artifacts, evaluators to be recognized for a reasonable amount of service requiring general engineering skills, and potential reproducers to publish reproduction studies.

We propose an evolution, not a revolution. This proposal combines the same decoupling of reusability and reproducibility currently found in some programming languages and software engineering venues, and the formalization of requirements currently found in some systems venues. We believe the systems community should keep the debate started with the HotOS panel alive, and work on the above proposal or make new ones to address the concerns raised by most of the HotOS panel participants.

Picture Credit: Davanum Srinivas (AWS)

Editor: Tianyin Xu (University of Illinois at Urbana-Champaign)