Fish in a Barrel: an insider’s retrospective of the SOSP’09 multikernel paper

Editor’s notes: We invite SIGOPS Hall of Fame (HoF) winners to write about backstories behind the HoF work. In this article, Andrew Baumann wrote an insider’s retrospective of the 2009 SOSP (and recent SIGOPS HoF) paper “The Multikernel: A new OS architecture for scalable multicore systems“. Andrew is a researcher in the Systems Group at Microsoft Research Redmond. His research interests are operating systems and systems security, with a particular focus on problems driven by hardware evolution, or close to the hardware/software boundary.

Inspiration and beginnings

The story of this paper is really the story of the early days of the Barrelfish research project. The motivation for Barrelfish first formed between Timothy (Mothy) Roscoe and Paul Barham around 2006. At the time, Mothy had just decided to join ETH Zurich as a professor, and Paul was managing the systems research group at Microsoft Research, Cambridge. The two had previously worked together as graduate students on the Nemesis OS, and saw a new opportunity to collaborate on OS research.

Paul says:

The conversations started around the notion of “just how stupid are operating systems — they have all these spare compute cycles but make truly awful decisions based on sometimes decades old heuristics baked into the code”… We felt that a modern OS should be maintaining lots more measurements and statistics on the hardware performance, network topology etc so it could then “do the right thing” when you asked it to “move 3GB of files from here to there”.

Mothy says:

We had been talking for a long time about managing ensembles of physically colocated devices as a single system… at some point this pivoted into a recognition that modern computers basically were distributed systems already, and that was really what was wrong with modern OS designs.

Of course, every new research project needs a good name, and the name “Barrelfish” was born out of frustration around the same time. Paul again:

The name for the project came from the fact that we wanted to put the OS world to rights. Everywhere we looked there was some “traditional” way of doing things that was based on assumptions that compute was scarce and maintaining state was expensive. It was easier (nay essential) for the OS to make decisions using a handful of machine instructions and no data. We thought that identifying all these things and coming up with principled mechanisms to make better dynamic decisions would be like shooting fish in a barrel.

The system itself began taking shape in roughly October 2007, after Mothy had formed a group at ETH Zurich (Simon Peter, Adrian Schüpbach, Akhi Singhania and myself) that, along with our collaborators at MSR Cambridge (Paul, Tim Harris, and Rebecca Isaacs), made up the initial team. From the beginning we intended to build a new OS for multicore systems, but it wasn’t for at least another year until apparently-fundamental design decisions, including the sole reliance on message-passing communication, were actually settled.

Our initial goals for Barrelfish were to:

  • build a new OS from scratch;
  • treat a heterogeneous multicore machine as a distributed system;
  • reason about hardware using high-level constraint optimisation techniques;
  • eschew compatibility with existing systems (Linux and Windows);
  • borrow good ideas from other research, in particular:
    • minimise shared state (Tornado, K42),
    • upcall-based processor dispatch (Psyche, Scheduler Activations, K42),
    • libOS structure with policy in application domains (Exokernel, Nemesis),
    • partitioned capabilities for kernel memory management (seL4),
    • isolated user-mode device drivers (L4),
    • EDF as a per-core CPU scheduler (RBED),
    • domain-specific languages to improve OS engineering (Devil);

By mid-2008 we had implemented the guts of an OS and written a workshop paper about our ideas for diversity and heterogeneity, but were becoming increasingly nervous — many-core systems were now a hot topic, and Barrelfish was far from ready. Boyd Wickizer et al. had just published Corey at OSDI, we knew of the Berkeley Tessellation and Akaros projects, and
fos at MIT, and we were desperate not to miss our chance at an SOSP submission. It was really only in the process of writing the 2009 HotOS paper (including a memorable spring afternoon when Simon, Mothy and myself sat down outside the student bar to pick the terms multikernel and CPU driver) that the multikernel design became concrete and we adapted the system to run a separate kernel on each core, and to implement efficient shared-memory intra-core RPC.

Our predictions

The multikernel architecture was motivated by three trends in commodity computer systems:

  1. scalability to many cores;
  2. heterogeneity within a single system (for example, cores with differing ISA features and performance, accelerators, and offload devices);
  3. diversity between different systems, precluding static design-time optimisation.

Re-visiting the paper with a decade of hindsight, it’s clear that while all three motivations featured strongly, scalability was the most significant. At the time, many core systems were a hot topic. It was already obvious that Dennard scaling had broken down, but with Moore’s Law continuing it was not unreasonable to expect hundreds or thousands of cores in a system. Obviously, this didn’t happen: core counts have grown only gradually, and legacy OS architectures have continued to scale with incremental tweaks to meet the needs of commodity PCs while turning a blind eye to GPUs and accelerators. Although it’s impossible to disentangle cause and effect (more cores would certainly have been possible, had there been a market for them, but this in turn would have required a scalable OS), it’s evident that our predictions of a many-core future were somewhat off base.

On the other hand, although individual processors may not have scaled to the extent we originally anticipated, disaggregated and rack-scale systems have. We still believe that a multikernel architecture is the best match for such systems, and we are encouraged by recent work such as LegoOS that has applied it in this context.

Finally, heterogeneity and diversity are very real issues today, and it was perhaps a mistake to focus so strongly on scalability. We changed the title of the paper from “A new OS architecture for heterogeneous multicore systems” to “… scalable multicore systems” only in response to (justified) reviewer feedback that it was misleading because we didn’t evaluate any heterogeneous systems. Big/little core hybrid designs are now common, and the use of accelerators and programmable logic has increased to the extent that they are now mainstays of datacenter architecture. Barrelfish is also being used for the Enzian research computer under construction at ETH Zurich.

Our mistakes

While successful, Barrelfish was not without its missteps. We spent the first year of the project building a system without a particularly well-defined research agenda. This was a lot of fun — we built a brand new OS while scratching everyone’s personal engineering itches, but we also allowed ourselves to be distracted solving irrelevant problems.

By far the biggest distraction came from adopting the seL4 capability model for memory management. In this design, all memory, including memory that may be mapped to user-space, but also kernel objects that consume memory (for example, page tables, and thread control blocks), is managed by capabilities. User code creates new kernel objects by retyping capabilities to unused memory. Besides its conceptual elegance, the primary advantage of this scheme is that the kernel need not perform any dynamic memory allocation; because there is no kernel heap, there is no possibility that it may overflow, which would obviously be a problem for a formally-verified kernel.

In Barrelfish, we had no intention of formally-verifying the system, but nevertheless adopted the seL4 capability model for its elegance. However, unlike the embedded systems of seL4, with Barrelfish we were targeting commodity PCs and seeking to support rich (server-class and interactive user) applications. We thus found ourselves solving major design problems adapting the seL4 model to our more dynamic environment. For example: the seL4 design assumed that the location and type (RAM vs device registers, etc.) of all physical memory was known at boot time, but on a PC this is known only after walking ACPI tables, and may change as devices and memory are hot-plugged. Worse, in the seL4 design, capabilities always represented power-of-two sized memory regions, and could only be split up into equal smaller power-of-two-sized subregions, so an arbitrary-sized region may require O(log(n)) capabilities to manage it. Compounding matters, any attempt to support flexible dynamic memory allocation (such as a heap implementation) must resolve a three-way cyclic dependency between the capabilities to physical memory, capabilities to the page tables that map the memory, and capabilities to the memory storing the capabilities themselves (CNodes). It thus becomes ludicrously complex to predict how many capability slots may be required to allocate a heap block, when the need to allocate more capabilities or page tables may itself require extra heap space.

All of these problems with the capability system were solvable, but none of this work advanced our research agenda, yet Barrelfish somehow persisted with this design until 2016 when Simon Gerber finally simplified the capability system by removing the power-of-two restrictions and replacing the guarded capability table with a simple two-level structure.

Another, lesser, class of mistakes were efforts that simply didn’t pay out in terms of research benefit. These tend to be inherent in many systems-engineering projects. For Barrelfish, they include brief flirtations with the CMake build system and Deputy extensions to the C language, along with a significant effort invested in optimising the local (intra-core) IPC fast-path to be competitive with L4, which was ultimately reported as one microbenchmark in the SOSP paper,
but was largely irrelevant to our goals.

Finally, we failed to fully capitalise on one of the motivating ideas behind the project: to measure the performance properties of a system and its environment, and use that information to make dynamic data-driven decisions about OS policy and resource allocation. This idea was not lost — it led to the Barrelfish system knowledge base that was used for tasks such as multicast message routing, device configuration, interrupt routing and scheduling. However, as the current trend of applying machine learning to operating systems demonstrates, there are more fish left in that particular barrel.

The SOSP paper

Memories of the paper writing process in early 2009 remain a blur. It was, as always, a last-minute affair with key results being obtained only in the final hours before the deadline. We benefited tremendously from combined bandwidth on both engineering and paper writing sides — although it was a distributed team, all hands were on deck and committed to the effort. Pierre Evariste-Dagand, a masters student who was visiting the group to work on his masters thesis on formal guarantees for embedded languages, ended up porting SQLite to Barrelfish and debugging low-level C code. Paul and Simon put together the experiment in section 2 that made the case for treating shared memory as a message-passing substrate. Mothy, Rebecca and Tim somehow turned the first half of the paper from a collection of observations and outlandish claims into a coherent, compelling motivation for the multikernel. Adrian and Akhi soldiered through to get reliable results from network benchmarks.

My own most-memorable contribution to the paper deadline was rather ignominious. Until that year, SOSP had always used the venerable ACM sig-alternate document class, with its hardcoded 9 point font size. However, less than a day before the deadline, we discovered that the call for papers specified a different 10 point format. To minimise churn, I made the disastrous decision to keep sig-alternate, and instead kludged over an internal definition to change the font size:

changeset:   152:75801390527b
user:        Andrew Baumann <andrewb@inf.ethz.ch>
date:        Sun Mar 08 09:12:55 2009 +0100
summary:     use 10pt font as requested in CFP

diff -r 15bd80f5d1a5 -r 75801390527b paper.tex
--- a/paper.tex Sun Mar 08 08:47:54 2009 +0100
+++ b/paper.tex Sun Mar 08 09:12:55 2009 +0100
@@ -6,6 +6,17 @@
 \usepackage{txfonts}
 \urlstyle{sf}

+% AARGH! the CFP wants 10pt fonts! this looks awful -AB
+\makeatletter
+\renewcommand\normalsize{%
+   \@setfontsize\normalsize\@xpt{10.5\p@}
+    \abovedisplayskip 6\p@ \@plus2\p@ \@minus\p@
+    \belowdisplayskip \abovedisplayskip
+    \abovedisplayshortskip 6\p@ \@minus 3\p@
+    \belowdisplayshortskip 6\p@ \@minus 3\p@
+    \let\@listi\@listI}
+\makeatother
+
 % sig-alternate class declarations
 \conferenceinfo{SOSP'09}{}
 \CopyrightYear{2009}

Astute LaTeX hackers may notice my mistake here. I am not one, so I blithely edited the paper down to the desired page limit, and breathed a sigh of relief when HotCRP’s format checker accepted it. As it turned out, I had changed the font size but not the leading (line spacing), and in the process discovered both a limitation of the automated format checker and an excellent opportunity to upset our reviewers. Thankfully, the PC were ultimately lenient, and if the experience taught me anything besides the need to obsess over the details in a call for papers, it is to be mindful that format rule violations may have a benign explanation.

The reviewers were detailed and thorough, but also remarkable in their uniformity. Every one of the nine reviewers (this count was inflated by several final meta-reviews) gave exactly the same scores, equivalent to a medium-confidence weak accept. It seems noteworthy that none of the reviewers expressed high confidence for a paper squarely about operating systems principles. This points to the breadth of the systems community, but also to a lack of interest and/or expertise in core OS that we still find worrying.

Every review called for more depth on how to build high-level system services over the multikernel model — we tried to provide more of this in the camera-ready, but ultimately it took us years to flesh out the system architecture.

Fighting the hardware

Since we were building a scalable many-core OS we needed a large multiprocessor system to run it. At the time, AMD held the lead in PC server hardware, and the largest such machine we could acquire was a unique 8-socket beast that we named Gruyère (following a naming theme of Swiss cheeses). This machine consisted of a 4-socket Tyan motherboard (with 4 CPUs, DRAM and I/O) plus a separate optional expansion daughter-board with 4 more CPU sockets (and associated memory), connected together by two HyperTransport link cables.

When we ordered this machine, we knew it would be special, and we had already used its unique interconnect topology as an attention-grabbing figure in our posters, a work-in-progress talk, and our HotOS paper:

However, only in the weeks and days before the SOSP deadline did we discover just how special it was. In a multi-socket system, it is the job of platform firmware (the BIOS) to program the memory controller’s HyperTransport routing tables on each socket. After observing strange performance anomalies in our message-passing performance between certain pairs of cores, we discovered that the firmware configuration was programmed sub-optimally, so that in certain cases cache-coherence traffic between two sockets that were both located on the daughterboard would needlessly traverse the slow, contended HyperTransport link cables to the main motherboard and back again. Worse, our microbenchmarks suffered occasional huge (orders of magnitude) outliers, which we eventually diagnosed as a deadlock in the HyperTransport network leading to a 10ms timeout before the interconnect would reset and retry a cache fill.

Needless to say, when Gruyère was finally retired some years later, there was much rejoicing (not to mention ceremonial destruction)!

Making it real

At the time of submission, the system we referred to as “AnonOS” was really three divergent forks of the Barrelfish codebase. One fork ran the message-passing microbenchmarks, one the shared-memory workloads (SPLASH and OpenMP), and the third ran the network driver and web-server benchmarks. Between the paper submission and camera-ready, besides refining the paper, a huge effort was put into preparing for release a single self-contained version of the system that could reliably reproduce all the reported experiments. Major components of system functionality (including an IDL compiler for message-passing) were introduced at this time. We’re pleased to see that artifact evaluation is now making this standard of completeness the norm for systems papers.

From the very beginning, we devoted substantial effort to building Barrelfish from scratch as a clean-slate system. While all of us had written research OSes before, it was for most our first time doing so on the x86 PC with its labyrinthine boot process from single-processor 16-bit real mode to multi-processor 64-bit mode. We implemented the code to walk ACPI tables, discover APICs and boot CPUs (each with their own kernel image), program PCI ranges, and bring up device drivers for NICs. We built shared compute infrastructure and a regression testing system to keep it all functional on a range of diverse hardware from laptops and desktops to the 8-socket monster above. We wrote a rudimentary virtual filesystem, NFS client, web server, and shell, not to mention actual benchmark workloads. Shortly after the SOSP paper was published, with the help of Richard Black and Orion Hodson at MSR Cambridge, we ported the system to ARM, Intel’s “Single-Chip Cloud Computer”, and the MSR Beehive soft-CPU (it eventually ran on many more systems) and designed new message-passing transports for each. We hosted the Barrelfish website on a PC running Barrelfish. To give talks, we brought up Barrelfish on a Lenovo laptop, and implemented drivers for the framebuffer and keyboard along with a simple slide viewer. Getting the display output to work with any given projector was always something of a gamble (and, awkwardly, blanked the laptop’s internal display) but it worked every time — so smoothly that many members of the SOSP audience in Big Sky didn’t even realise that my laptop was running Barrelfish during the talk!

We intended Barrelfish to be an open-source system, with open collaboration between researchers at ETH Zurich and Microsoft Research. At the time, this was a significant departure from the norm for Microsoft, requiring not-insignificant string-pulling and legal wrangling behind the scenes by Paul when establishing the project. There were some exciting moments after we released the first version of the code (just prior to SOSP), unintentionally ruffling feathers in Redmond with a wave of highly misleading headlines such as “Microsoft, researchers release new operating system” and “Microsoft unveils Barrelfish multi-core optimized OS”. Ultimately however, projects like Barrelfish helped pave the way for Microsoft’s embrace of open source.

The demo

Our collaborators in Cambridge put together a fantastic demo to demonstrate the use of efficient messaging protocols in a large multicore system. The system was modified with lightweight in-memory tracing to trace key events such as message send and receive, context switches, and user execution to a per-core ring buffer. A shared daemon then collected these trace logs and streamed them over the network to a GUI application that visualised the trace. This allowed us to visualise how different message-passing protocols (particularly the optimised broadcast and multicast trees we evaluated in the paper) behaved in real-time.

To give this demo at SOSP we needed a suitably large multicore machine running Barrelfish, which required its own non-trivial logistics effort. A system was shipped from Cambridge to Redmond, and then driven 12+ hours to Big Sky by Sameh Elnikety (to whom we remain grateful), arriving in the small hours of Saturday morning to spend the rest of the night outside in well-below-freezing temperatures. Carrying a deep-frozen 4U server across a windy carpark is not an experience that Simon, Mothy or I wish to relive, but the machine itself survived without apparent issue and after some last-minute fine tuning we were ready for the poster session.

Big Sky Resort on a frozen morningBig Sky Resort on a frozen morning during SOSP’09. Photo by Chris Frost, CC BY-NC-SA 2.0.

Besides demonstrating message-passing workloads, our demo had a party trick that we always saved for the end. By programming a per-core service to busy wait (spin) for a configurable number of cycles in response to each message, and then sending messages to specific cores, it was possible to treat 8 individual cores as pixels, and cause messages such as “Barrelfish posse in full effect!!!” to scroll across our trace visualiser in a retro 8-bit font. This produced quite a reaction in our audience, but sadly we have no record of it in action.

An “audience choice award” was announced and voted on at the conference, but mysteriously the results were never released… far be it for us to indulge in conspiracy theories, but we did have our hopes up!

Impact and lessons learned

The multikernel paper has arguably proven influential in the research community; most significantly it established a term for a new point in the OS design space, and led to follow-on work both by the Barrelfish team and others. For example, 2010’s An Analysis of Linux Scalability to Many Cores was in some ways an explicit response to the multikernel proposal. Popcorn showed how to turn Linux into a multikernel. More recently, LegoOS applied a multikernel design to disaggregated servers.

Systems such as Arrakis and FlexNIC not only leveraged the multikernel design, but built on the Barrelfish codebase (helping validate all the work we put into open source). LXDs borrowed and refined the shared-memory message-passing protocol we developed for Barrelfish, along with the AC async framework for using it effectively. Although most of the founding members have moved on, Barrelfish remains an active project. Besides the Enzian system, Barrelfish is also used for teaching advanced OS courses at ETH Zurich and UT Austin.

If there’s a final takeaway message here for the OS research community, perhaps it’s that we should continue to encourage the risky business of building systems for blue-sky research. In simplistic terms of research output measured as papers over time, the first three years of Barrelfish were highly unproductive, and the effort we put into building and demoing a complete system was irrelevant. Yet, it seems notable that of the 23 papers presented at SOSP 2009, three described new research OSes (seL4, Helios, and Barrelfish), and two of those three are now in the SIGOPS Hall of Fame. If we had instead spent our time incrementally extending Linux or Windows, would we be reminiscing about it a decade later? Probably not.

Acknowledgements

Thanks are due to Paul Barham, Rebecca Isaacs, Simon Peter and Timothy Roscoe for feedback on this article, and to the many other contributors to Barrelfish.