Lessons Learned from Chairing the Artifact Evaluation at SOSP 2023

In early 2023, we received an invitation from the SOSP PC chairs to serve as Artifact Evaluation Co-chairs for SOSP 2023. This was an incredibly exciting opportunity for all of us. As junior members in the systems community, being able to take on a leadership role at our flagship conference is an honor and a recognition . However, it also meant responsibility – we have to be prepared for the upcoming challenges of fairly and systematically evaluating many substantial systems research artifacts.

In this article, we share our experience as AE chairs of SOSP. Specifically, we share the help and support we received during the process and the lessons we learned that may be helpful to future chairs and the wider community. Please note that all opinions expressed here are our own, and we welcome any constructive criticism.

Background and Motivation

Artifact Evaluation (AE) has become a standard process in most conferences within the systems research community, such as SOSP, OSDI, ATC, EuroSys, ASPLOS, and others. In simple terms, AE evaluates the artifacts of conference papers. The evaluation is based on the artifact’s availability, functionality, and reproducibility, with corresponding badges awarded to signify their reliability and robustness in advancing scholarly research. There are some valuable discussions about AE, like the HotOS Panel on Future of Reproduction and Replication of Systems Research (and a blog about it), the review of AE, etc. We try to share our (hopefully new) insights from the perspective of AE chairs.

Clearly, being an AE co-chair is not an easy task.  On one hand, co-chairs need to be familiar with the entire AE process on the other hand, co-chairs must possess the “leadership skills” to help the AE committee members. 

Help We Received

When we initially embarked on this work, we quickly realized that there were numerous tasks that needed to be addressed, such as recruiting the AE committee, issuing the call for artifacts, preparing the website, and determining important deadlines, among others. Our first challenge was to establish a well-structured plan or schedule to effectively manage all these responsibilities. Fortunately, we received invaluable help in the form of the Guide for AE Chairs by Solal Pirelli (EPFL) and Anjo Vahldiek-Oberwagner (Intel Labs). This guide proved to be an incredibly comprehensive resource, providing us with a solid foundation for developing our own timeline. We followed the suggestions and established the timeline for this year’s AE and a sheet as our task tracker (as shown in the following figure) to explicitly list the tasks we should do, the status and deadline of it, and the co-chair assigned.

Fig: Part of our “task tracker”. It has 34 items in the full table.

Some lessons learned from this process: First, we found that AE co-chairs greatly benefit from having an internal timeline that is more detailed than the publicly shared “important deadline” for authors and reviewers. This internal timeline should include specific tasks, like when to send certain emails to authors and reviewers, to avoid any potential chaos. Second, we discovered the amazing effectiveness of using the “task tracker” to manage our responsibilities as co-chairs. By assigning tasks to specific co-chairs and having a clear overview of incoming tasks, we were able to stay organized and ensure that everything was properly addressed.

One particularly beneficial suggestion offered in the guide was to reach out to previous chairs of the venue to solicit informal feedback, especially if they had prior experience with AE.

We reached out to Robert Ricci (University of Utah) and Dan Ports (Microsoft Research, University of Washington), seeking their advice. They were incredibly kind and suggested having an online meeting to facilitate in-depth discussions. We arranged a Zoom meeting to delve into the preparations for artifact evaluation, and both Rob and Dan shared a multitude of valuable and detailed suggestions.

Fig: Screenshot of the meeting with Rob and Dan. Hannah Cohoon (Postdoc at University of UTah) also shares her comments and suggestions.

Here are a few examples of their suggestions, which we find are incredibly helpful and we believe would be also beneficial for future chairs:

  1. Widely recruit AEC members, aiming for 5 reviewers per submission and 2 reviews per reviewer.
  2. Request PC chairs to distribute AEC-related emails to the community members and their students, encouraging them to submit their artifacts.
  3. Be mindful and well-prepared for any special hardware requirements, such as GPU access.
  4. Leverage the tagging feature in HotCRP to effectively organize the AEC workflow.
  5. Ensure that each submission is reviewed by at least one highly experienced reviewer.
  6. Respond to inquiries and questions promptly, prioritizing quick and responsive communication, as it fosters motivation among AEC members.
  7. Account for the possibility that some AEC members may be unable to complete their reviews due to various reasons.
  8. Try to be on the side of both AEC members and authors (personally, I think this is very very important as a chair).

We truly appreciate all the suggestions provided by Rob and Dan, as they form a solid foundation for this year’s SOSP AE.

Dan also shared his cats with us, and they are absolutely adorable! 😊

During the whole process, almost all of the SOSP’23 organizers (Margo, Jason, Matthew, and many others!) give support. We will not list them all, but really thanks for their valuable assistance!

Challenges We Did Not Expect 

Reviewer experience

It is important for the chairs to support junior AE reviewers. Being part of the AE committee offers them valuable experience from a reviewer’s perspective. This year, many of our AE reviewers were students, and 44 out of 88 were serving as reviewers for the first time. Although we provided our “Guide for AE Committee” to all reviewers, we later realized during the process of AE that some reviewers were unfamiliar with the HotCRP system and they encountered various issues, such as not knowing the difference between review and discussion pages and how to switch between them. Additionally, in some cases, reviewers may be overly strict by challenging the techniques used in a paper (which is not within the AE scope), or by questioning the results, even if the differences are minor. The key gap here is that while senior researchers understand that AE, especially for the reproducible badge, evaluates if a paper can be reproduced within certain tolerances, some AE members (often students) may not have the same understanding. When we observed the emergence of those issues, we contacted the reviewers on the HotCRP or via email to address them in time.  

Therefore, as the AE chairs, it is necessary to periodically review the comments and discussions on artifacts during the process and provide assistance when there are misunderstandings. More detailed instructions in the guide on using tools like HotCRP could be more helpful to better explain the goals and purpose of AE in order to avoid the issues as much as possible beforehand.

Overall, it is an ongoing challenge that AEC members may not first timer and often work under busy schedules. As AE chairs, we need to be prepared to support them accordingly and address their needs.

Hardware dependencies

Although we expected that hardware dependencies would pose a challenge, we did not expect the extent of the difficulty. It is a common occurrence in system conferences that many works rely on specific hardware configurations, such as NVM, GPUs, or large-scale clusters. An effective strategy we have found is for authors to provide the necessary environment and offer a means for reviewers to remotely connect and conduct experiments. This approach has proven particularly useful, as many artifacts can be easily reproduced using this strategy. Alternative resources like public clouds or research platforms such as CloudLab can also be explored. However, some challenges still remain. For instance, in some situations, authors may be unable to allow reviewers to remotely access their machines due to school restrictions or dependencies that cannot be satisfied by public clouds or CloudLab.

Thanks to our amazing reviewers and the authors for their support. After thorough investigation, we identified three reviewers (two of whom were already assigned to other artifacts) who possessed the necessary resources for conducting the experiments. With their assistance, we assigned the artifact to the reviewers capable of executing the experiments. Additionally, the authors promptly prepared a simulation environment for other reviewers who did not have access to real hardware, enabling them to evaluate the functionalities of the work.

Reproducing system research with hardware dependencies can be a struggling task for the AE chairs, reviewers, and authors. However, it also highlights the beauty of system research, as they represent real systems (although mostly in prototype) running on real hardware. It is likely that future systems conferences will continue to face hardware dependency challenges, and one valuable lesson learned is the importance of early communication in such cases.

Best Artifact Award

We received significant support in our efforts to select the Best Artifact Award. One valuable suggestion came from Margo: the AE chairs should develop the suggested criterias for nominating the best artifact. While the criteria for the three badges are relatively clear, assessing one artifact as superior to another poses a greater challenge. Different reviewers tend to have slightly different sets of criteria. 

In consideration of this, we have compiled a list of criteria (with input from Margo and Jason) that includes factors like clarity and completeness of documentation, ease of execution (how easily someone can use the artifact to run tests distinct from those presented in the paper), and so on. Although the criteria may not be perfect at present, we believe that they would serve as an excellent starting point and future AE chairs could continue to refine and enhance them.

The End

Despite all the challenges, organizing artifact evaluation is a profoundly meaningful task. And, again, we are grateful for the tremendous support we have received! With the unwavering support of our community, we have full confidence that future AE chairs can do better!

About the author: Dong Du and Jiayi Meng are SOSP’23 AE co-chairs. Dong Du is an Assistant Professor at Shanghai Jiao Tong University, working on operating systems, hardware-software co-design, and serverless computing. Jiayi Meng is a tenure-track assistant professor in the University of Texas at Arlington (UTA), working on building systems to support next-generation mobile applications (e.g., VR, AR, and MR) via edge computing over 5G and beyond.

DisclaimerAny views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGOPS.

EditorTianyin Xu (University of Illinois at Urbana-Champaign)