Artifact evaluation, present and future

Science should be transparent and reproducible. In computer science, this means code and data should be publicly available and anyone should be able to validate a paper’s claims given reasonable hardware.

Dedicated scientists have made their code and data public for decades, but until recently, this was done in a decentralized fashion. Groups had their own website with the code and data of their papers. Groups also had different policies for packaging dependencies, licensing their software, automating experiments, and other such requirements.

Artifact evaluation was created to harmonize good practices for research artifacts and enforce these practices. Like any new process, it has some rough edges, and practices aren’t as good nor as harmonized as they could be. In this blog post, I describe the current state of artifact evaluation, summarize its history, explain how it currently works in systems venues, and highlight some of the current issues. I then propose actionable changes that venues could take to improve reproducibility and make artifacts first-class citizens in computer science research.

The artifact evaluation process

Artifact evaluation is a process inspired by paper reviewing during which evaluators decide whether code and data submitted by the authors of accepted papers comply with good research practices, such as availability and reproducibility. These practices are represented by different badges, which are awarded to papers and added to their first page. For instance, “Artifact Available” badges typically require a paper’s artifact to have a publicly available snapshot that links back to the paper. The list of badges and their exact definition currently varies by community. Venues such as ICSE also allow special “replication” papers that attempt to replicate past results, and retroactively award replication badges to the replicated papers.

Evaluators rate artifacts in a similar fashion to reviewers rating papers, collectively deciding whether a given artifact meets badge criteria. Authors can choose which badges they target, and may for instance decide to not publicize their artifact but provide it privately to evaluators. Unlike paper review, artifact evaluation is done in cooperation with authors, who can be messaged by evaluators to ask clarification questions and point out issues that need fixing. The goal is to validate a paper’s major claims, not reproduce its exact results. Thus, it is a less adversarial process than peer review.

Artifact evaluation increases the impact that papers can have since artifacts that passed evaluation are easier to reuse. Anyone can use a reproducible artifact to improve a technique as a baseline for comparisons. Even within a group, having an artifact evaluated for a project means onboarding future students, interns, and other collaborators is easier since the artifact has already been tested by others.

A brief history of artifact evaluation

The software engineering community pioneered artifact evaluation at ESEC/FSE 2011. Other communities followed, such as SIGPLAN at ECOOP 2013. SIGOPS introduced artifact evaluation at SOSP 2019, and now has a list of all systems artifact evaluations at sysartifacts.github.io. The security community has created a sibling website, secartifacts.github.io.

Artifact evaluation is nowadays a standard part of most computer science conferences. Some conferences, such as ESEC/FSE, even encourage authors to submit an early version of their code and data along with papers.

An example artifact evaluation process

I use the EuroSys 2022 artifact evaluation as an example since I co-chaired it along with three wonderful co-chairs and an amazing committee of early-career researchers. Each member of the committee had one or two artifacts to evaluate.

Authors of accepted papers received an invitation to submit an artifact along with their paper acceptance notification. Authors had 6 days to register their artifact, similar to abstract registration, and 6 more days to submit the full artifact. Along with the artifact, authors submitted the accepted version of their paper and an “Artifact Appendix” describing their artifact.

Evaluators and authors then exchanged messages during a “kick the tires” early evaluation period of 10 days, with the goal of enabling all evaluators to set up and run their assigned artifacts. This gave authors a clear time frame during which they had to be available to fix bugs such as problems installing the artifact or forgotten data files. The process was single-blind: reviewers knew authors’ identities but not vice-versa.

Evaluators then had a little over a month to evaluate their assigned artifacts. Most, but not all, authors applied for all three available badges: “Artifact Available”, “Artifact Functional”, and “Results Reproduced”. To clarify expectations and ease reviewing, we provided checklists for badge criteria. We also provided a guide for evaluators, including how to use the CloudLab and Chameleon academic clouds that generously provided resources to replicate experiments requiring multiple machines or specific hardware.

Evaluators then discussed through online comments which badges their assigned artifacts should be awarded, and we notified authors in time to add badges to the camera-ready papers. Some papers also needed small changes to their results thanks to artifact evaluation uncovering bugs, though no claims were invalidated.

Current issues in artifact evaluation

While artifact evaluation is a step forward, it is currently not as effective as it could be, mainly because it is typically implemented as an afterthought on top of existing processes.

Reviewers on program committees usually have no access to artifacts, unless authors link to anonymized versions of their code in their submitted papers. This leaves reviewers unable to see the implementation of an algorithm, the details of how an experiment is run, or the raw data behind a figure. This is partly addressed in the software engineering community, whose conferences encourage authors to disclose data at paper submission time.

Authors have no incentive to design a reproducible artifact at the same time as they write their paper, thus leaving them with little time to prepare an artifact once their paper is accepted. This also risks invalidating paper results if authors discover a fundamental bug while preparing their artifact, which could be as simple as an author having forgotten a step while manually executing scenarios for a key part of the evaluation.

Papers do not always lend themselves to post-hoc reproducibility since not all experiments can be reasonably reproduced by evaluators or curious readers. For instance, a paper may use a few dozen beefy virtual machines in a commercial cloud service to provide evidence of a system’s scalability. Paying for such virtual machines during an entire artifact evaluation process is unreasonable, preventing evaluators from validating the system’s scalability claims.

Artifact evaluation being “bolted on” existing processes also means it has as much variance as these processes have in different communities, including different standards and different timelines. The same artifact for the same paper could receive different badges if submitted in different venues, even if these badges have the same name such as ACM badges.

Recommendations

The goal of this post is to improve reproducibility and transparency in research, not merely criticize. Thus, I make the following recommendations, based on the issues mentioned above and my experience with artifact evaluations.

While artifact evaluation may one day become a mandatory part of paper submission, as is already the case for JSys, these recommendations are actionable in the short term, without major policy shifts.

Encourage concurrent submission with papers. Venues should ask authors to submit their code and data along with their paper, or to explain why they cannot. Authors should be incentivized to think of a reproducible artifact as a part of the research process, not something to create after the fact. Software engineering venues already do this.
Encourage replication papers. Venues should have a separate track for papers that replicate, or fail to replicate despite reasonable effort, previous papers. Papers that are successfully replicated can retroactively be awarded a badge. This is also a great way to introduce students to research. Software engineering venues already do this.
Encourage small reproducible experiments. Papers that use large experiments that cannot reasonably be reproduced should be encouraged to also provide smaller experiments that can be reproduced. For instance, a multi-node system’s evaluation can contain both an experiment with many cloud machines and a single-machine experiment creating one container per CPU core and injecting artificial network delays.
Settle on common badge checklists. Publishers such as USENIX, ACM, and IEEE should discuss and formalize checklists defining each badge. The EuroSys 2022 checklists were successful and can be used as a base. Key decisions include defining when it is acceptable to run on hardware provided by the authors, the level of automation expected of experiments, and acceptable forms of artifact packaging.
Settle on a common knowledge base. Instead of each community or even each venue maintaining its own good practices and instructions for evaluators and chairs, share them on a common website. Sysartifacts can be used as a base and already contains a guide for chairs.

Providing code and data to reviewers, encouraging replication and reproducible experiments, and settling on common definitions and guides would make artifacts first-class citizens of the research world, bringing the computer science community one big step closer to fully reproducible research.

About the author: Solal Pirelli is a PhD student at EPFL, working on automated verification of systems software. He has evaluated systems artifacts in four conference artifact evaluations and co-chaired the EuroSys 2022 artifact evaluation.