Editor’s notes: As more and more systems research conferences start to adopt artifact evaluation (AE), we invite Manuel Rigger and Arpan Gujarati to share their practices and experience on preparing award-winning systems research artifacts — their artifacts received the Distinguished Artifact Awards at OSDI’20.
In the first edition, Manuel shared his practice of building and maintaining SQLancer, an automatic testing tool for finding logic bugs in database management systems.
In this second edition, Arpan Gujarati from MPI-SWS will tell us about how he prepared his award-winning artifacts at OSDI’20 for a distributed, heterogenous system for serving deep neural network models.
Arpan Gujarati is a postdoctoral researcher at the Max Planck Institute for Software Systems (MPI-SWS), where he works with Jonathan Mace in the Cloud Software Systems Group. He completed his PhD under the supervision of Björn B. Brandenburg in the Real-Time Systems Group at MPI-SWS. Starting February 2021, Arpan will be joining University of British Columbia’s CS department for a three-year stint as a Research Associate.
Arpan is broadly interested in real-time systems, distributed systems, fault tolerance, reliability analysis, and scheduling problems in both cloud and cyber-physical systems (CPS) domains. Most recently, Arpan has been exploring efficient system designs for deep neural network inference serving in the cloud. In the past, he has also worked on virtual machine scheduling and distributed auto-scaling for datacenter applications. During his PhD, Arpan focused on reliability and schedulability analysis problems for hard real-time systems.
Could you briefly introduce your artifact?
Let me first summarize our paper. Machine learning is adopted widely today, to the extent that it has become a core building block even for interactive web applications. This means that compute-intensive operations, such as inference using a deep neural network (DNN), are now on the critical path of our web requests. To deploy such web applications at scale, we need cloud systems and infrastructure that can concurrently serve thousands of requests from different users in a timely fashion. These are called model serving or inference serving platforms. Our OSDI 2020 paper presents Clockwork, a new DNN inference serving platform.
Inference serving platforms designed before Clockwork, like Clipper and INFaaS, resemble traditional multi-tenant cloud servers; that is, they attempt to maximize request throughput and system utilization using concurrent execution of requests, and use best-effort techniques such as power-of-two or reactive autoscaling for mitigating unpredictability in response times. Clockwork, on the other hand, observes that inference using DNN is fundamentally a deterministic operation with an extremely predictable execution time. We designed Clockwork to achieve end-to-end predictability from the bottom up, using an approach we called “consolidating choice”, which centralizes all decision making and leaves little room for performance variability. In our evaluation, Clockwork is able to serve 1000s of models concurrently, mitigate tail latency while supporting tight latency SLOs in the order of 10 to 100 ms, and achieve close-to-ideal goodput under overload, contention, and bursts.
Coming back to the artifact, we released two main artifacts as part of the OSDI artifact evaluation (AE) process, both of which are publicly available. The first and also the primary artifact is Clockwork’s implementation (about 26000 lines of C++ code). This comprises Clockwork processes that run on each worker (GPU backends), the central controller (frontend), and the client-side process for workload generation. Clockwork’s implementation makes use of many third-party libraries and software like
Asio C++ library, protocol buffers, and TVM, which can be easily installed on any Linux machine. The only non-trivial dependency is an NVIDIA driver and CUDA, which are needed for GPU support. The second artifact comprises all our experiment scripts and an extensive documentation for reproducing the exact results from our paper. As part of this artifact, we also provided a Docker image and scripts for running Clockwork out of the box. In addition to these two artifacts, we released our modified TVM repository that is used by Clockwork, and the set of all pre-compiled models and workload traces from Microsoft Azure that we used for experimentation.
How much effort did you spend on these artifacts?
Our team actually spent quite some time on developing the artifacts for the artifact evaluation. I will list some of the main tasks that we undertook specifically for artifact evaluation.
Clockwork’s implementation was always in good shape, since we followed good software engineering practices like unit testing, code review, use of separate development branches, clean and useful commit messages, even during the prototyping phase.
For the camera-ready and for artifact evaluation though, we added support for running Clockwork with emulated workers in place of GPU workers, which allowed us to evaluate many aspects of Clockwork’s controller, like its scalability and scheduling policies, without the need for GPU backends.
This feature was developed primarily by Wei Hao during his summer internship at MPI-SWS and came in quite handy for one of our AE reviewers who did not have GPU machines, but who could still reproduce certain results from the paper.
At the same time, I spent a lot of time writing experiment scripts and documentation. Anyone who has ever worked with distributed systems would know that evaluating such systems can be challenging, since it requires managing multiple processes on different machines, aggregating and processing telemetry data gathered from all machines, generating graphs to visualize the processed data. More often than not, the experiments need to be run for several hours and repeated multiple times with different configurations. Clockwork’s experiments were no different.
Our goal was to facilitate reviewers who are not familiar with the system to accurately reproduce our results.
Therefore, I put in extra effort to automate our entire experiment pipeline. I also provided detailed step-by-step instructions to run the experiments manually, for anyone interested in tweaking the experiments. Both these steps were greatly appreciated!
Reza Karimi meanwhile worked on developing a docker container image with the Clockwork implementation, its installation dependencies, and the runtime environment properly setup. Reza also added support to ensure that the experiment scripts I was writing worked seamlessly in a dockerized environment. Having a ready-to-use Docker image automated many configuration steps required for running Clockwork, such as setting different environment variables, and provided an easy way to test Clockwork out of the box.
This gave the artifact evaluators a few choices. First, we enabled remote access to our MPI cluster for a week, so that evaluators could replicate results on the same hardware setup we used in the paper. While this significantly eased the review process, it kept me busy for a week since I had to set up and manage the cluster machines during this period. Second, it was possible to run everything in the cloud and/or in a dockerized environment, using different machine and GPU configurations. Lastly, it was also possible to run Clockwork in an “emulated” mode if GPUs weren’t available. In addition to documenting the “ideal” evaluation setup, we provided general documentation for how to tweak and reconfigure everything for different environments.
What did the reviewers like about your artifact?
The reviewers liked our comprehensive documentation of Clockwork’s installation procedure and the evaluation methodology. They especially liked that our documentation was very detailed and that we broke down everything into small steps, as is evident from their remarks.
- “detailed artifact evaluation plan was provided and documented excellently”
- “very detailed documents in terms of preparation, installation, and execution.”
- “complete, documented well, and very easy to reuse in my opinion”
- “the documentation is more-than-expectation sufficient, very easy to follow step by step, and is beginner-friendly”
I mentioned earlier that we provided a Docker image to run Clockwork out-of-the-box, provided reviewers with access to MPI-SWS cluster machines for a week, and added a new feature to test and evaluate Clockwork without GPU machines (using dummy Clockwork workers). We also tested our setup on Google Cloud VMs and provided related instructions in the documentation. In summary, I think facilitating multiple ways to run Clockwork was useful, and the reviewers also noticed our efforts in this regard when it came to judging the artifact.
- “provides instructions to launch docker container with all dependencies pre-installed”
- “described how to set up and run Clockwork either bare-metal or using Docker, using one’s own cluster machines or using a cloud provider”
- “provide docker and cloud images to help quickly deploy the test environment.”
Finally, I think simple things like providing a starter example, a utility script for checking if the environment is set up correctly, and a troubleshooting guide detailing common-case errors were greatly appreciated by the reviewers.
- “contains starter example which walks the user through basic steps (e.g. starting up worker, controller, and client) needed to start serving inference requests”
- “contains utility script for checking if environment is set-up correctly; I found this be very useful”
- “provide handy troubleshooting guide, detailing how to fix common-case errors, and describe available workload and controller (some user-configurable) parameters in detail”
- “provides instructions for experiment setup without GPUs or with smaller memory sized GPUs.”
Suggestions for folks who are preparing artifacts?
Clockwork’s implementation was already in good shape for artifact evaluation. I think that is the best way to secure a successful artifact evaluation. In other words, rather than aiming for “cleaning up” the entire project repository in the end, if authors can follow practices like unit testing, daily builds and CI tests, code review, use of development branches, clean and useful commit messages, etc., the project is almost always in a reproducible stage. For Clockwork, except for daily builds and CI tests, we followed all of the aforementioned practices from day one. This meant we did not have to spend more time on our system implementation again before making our code public.
I would also like to emphasize the role of automation. My advice is to automate as much as possible, and as early as possible.
It is often tempting to take (what may seem as) shortcuts — just set one more environment variable, manually copy all logs into one place for processing, dump everything into NFS, etc. Unfortunately, such shortcuts can often cause problems, especially when it comes to large-scale experiments in a short time before the deadline.
Lack of automation also causes problems during AE when reviewers want to evaluate reproducibility, since it is very difficult for someone not familiar with the setup to precisely run every step. The moment the setup changes a bit, there are bound to be discrepancies in the setup instructions. Automation prevents many of these issues.
To evaluate reproducibility, a key concern is that reviewers may not have the same hardware/software resources available.
Therefore, authors need to figure out how they can provide reviewers with an (almost) exact replica of their setup, and if not, what is the best way for the reviewers to reproduce the experiment results in a setup that differs from the authors’ setup. My suggestion therefore would be to test your system in different environments, and make provisions for the reviewers to evaluate your system in different environments. Also, given that so much work is needed for a successful artifact evaluation, start early, especially if you do not have a large project team.
What is your experience with artifact evaluation?
This is actually the first time I have built an artifact specifically for artifact evaluation. I do have experience serving on an AE committee once for a real-time systems conference. So I had a fair idea how to go about it. In general, during my PhD, since AEs did not exist or were still new, I always prioritized working on a subsequent publication rather than pushing for AE. In contrast, I spent a lot of time working on Clockwork’s AE; to some extent, this was possible only because during the AE period I was mainly waiting for my UBC paperwork and did not have to jump immediately on to my next research project. Personally though I am quite excited about the fact that AE is becoming a norm in more conferences across communities. IIRC, OSDI itself had more than 50% successful artifacts. Going forward, I am definitely planning to make this an important goal of my research.
Suggestions for improving the artifact evaluation process?
I was quite impressed with the OSDI artifact evaluation (AE) process in general. There were many volunteers, and our artifact was assigned three reviewers, which was quite nice. But I think the AE process could still be improved.
We were not aware this time if the reviewers will have any extra budget to rent cloud VMs or if they will have credits for any specific type of cloud VM instances. If this information can be shared with authors in advance, say during CFP, it can help authors prioritize their environment setup.
I also found the AE timeline slightly tight, possibly because I was heavily involved with the AE, camera-ready preparation, as well as the talk preparation (which was due much earlier this time since the conference was held virtually). I guess for small teams this can be very tough. So if the current model continues (i.e., paper notification followed by AE and CR),
I would like to see the deadlines better spaced in the future.
As far as recognition of artifacts is concerned, I think the Distinguished Artifact awards like at OSDI and now this SIGOPS blogpost are very good starting points. They do a great job at recognizing the effort put in by authors towards AE. Maybe we could take this one step further where artifacts across communities (SIGOPS, SIGPLAN, SIGBED, etc.) are recognized on a common platform, similar to ‘papers with code’ in the machine learning community.
Another question that could be considered by future conference organizers is whether AE could be conducted concurrently with the PC review process. What if the paper is rejected? In that case, you should be able to automatically transfer your AE certificate to the next submission. I hope some of these reforms will happen in the near future.
What’s your view of recognizing research artifacts?
Ideally, for every paper we want to have a fully functional and reproducible artifact, and one that is robust enough to run in different environments. This is immensely useful. For instance, in this paper we evaluated Clockwork against INFaaS, another model serving platform proposed a couple of years back. INFaaS had a wonderful artifact in the sense that we could directly launch an EC2 instance with image provided by the authors, follow their instructions, and run the same throughput experiment as with Clockwork without much hassle. And likewise for Clipper. But at the same time, INFaaS’s system was tightly coupled with AWS, so we couldn’t easily port it to our cluster for a fair comparison with Clockwork, which we would have preferred. In short, in an ideal world, we want a lot of research artifacts out there that are not just available and functional, but reproducible in different environments, and we want authors to give enough importance and time to research artifacts to make this possible. But this requires better incentive structures.
There is an ongoing debate regarding whether AE should be a part of PC deliberations, which will encourage more people to work on AE right from day one. Reproducibility as a first-order concern in the new Journal of Systems Research is also a welcome move. In the long run though, beyond everything that conference organizers do to encourage AE, I strongly think incentives at the university level can help. Can we include AE as part of PhD evaluation, faculty hiring, and tenure evaluation? PhD requirements today (at least in Germany) do not mandate students to submit research artifacts corresponding to their thesis. Since PhD students are integral to research in CS, change in incentives for PhD students would automatically lead to a better research artifact culture in computer science.
Cover image: Arpan with a OSDI‘20 Distinguished Artifact Award and the Clockwork presentation at his MPI office before flying to the University of British Columbia where he is starting as a Research Associate.
Disclaimer: This post was created by Tianyin Xu for the SIGOPS blog. Any views or opinions represented in this blog are personal, belong solely to the blog author and the person interviewed; they do not represent those of ACM SIGOPS, or their parent organization, ACM.