The journey of real-life industry work behind an OSDI paper — global capacity management for millions of servers

As a new hire at Meta (formerly Facebook) in 2020, I was unbelievably fortunate to be entrusted with the question of how we should manage the company’s global server capacity at the scale of millions of servers. I was entrusted with this work despite having zero background in capacity management, and against all odds, not only did this work successfully land in production, but we also published a paper in OSDI’23 titled “Global Capacity Management With Flux.” This post shares my personal story—half anecdote and half technical—and attempts to shed some light on how real industry work is done at hyperscale companies, through quick iterative improvements and by leveraging great teamwork.

I joined Meta in January 2020, after having spent more than 15 years working on infrastructure problems across a number of companies and industries, including Google, Twitter, and GRAIL. I have always been fascinated by the software infrastructure that powers our modern world, and I was drawn to Meta because of its distinct culture, immense scale, and sometimes heterodox technical approaches: for example, Meta was investing in a big software monolith when the industry was firmly going the other direction, and pioneered building hardware platforms in the open.

Throughout my career I have come to believe in two big ideas that have driven both my career choices as well as the specific projects I’ve chosen to pursue:

Scale introduces unique challenges. At large scales, existing solutions often break down, requiring novel approaches to make progress. Scale also changes the underlying economies: suddenly it makes sense to pursue comparatively smaller gains that require a high degree of technical sophistication. (Interestingly, this is similar to the natural world, where the laws of nature derived at one scale do not apply readily to the another.)
People and processes matter just as much—and sometimes more—than technology choices. Solving hard challenges at scale is rarely a matter of a lone genius coming up with a solution. Instead, it requires a group of talented people to work well together.

And so it was with our recent paper on Flux, presented at OSDI’23. I believe Flux exemplifies how research and development is done at a company like Meta. It highlights how the immense scale and infrastructure complexities we face lead to novel challenges, and how addressing these challenges requires us to develop novel solutions to tackle the underlying business problems. Finally, solving those problems requires a large group of people to work together to define the problem, brainstorm solutions, and to execute and iterate on the resulting system.

The beginnings of Flux

The story begins in February 2020, as I was graduating from Meta’s bootcamp. Bootcamp is our month-long intensive program for onboarding every new employee. It takes you through a whirlwind tour of Meta’s technologies, tools, and culture, giving you the working knowledge you need to apply your skills at Meta.

After I had attended bootcamp for about a month, Kaushik Veeraraghavan suggested I come to Seattle for a day to discuss an interesting problem: our hardware orders were getting intractably complex, and we needed to find a way to simplify them. This was befuddling! As I had not had any previous experience with capacity planning, I couldn’t really appreciate at the time why that would be. Shouldn’t you just be able to order the hardware you need and call it the day?

The day in Seattle turned out to be quite fascinating. I discovered that hardware ordering is indeed a complex problem, and not at all the straightforward task that I had assumed. This is because, fundamentally, hardware ordering is a highly constrained optimization problem. Limitations in both physical space and power availability restrict how much hardware can be ordered in any given datacenter region. Additionally, the process of hardware refreshes — which involves replacing aging hardware — adds yet more constraints. These refreshes temporarily decrease a datacenter’s available capacity, since they are used to replace existing hardware, which must first be decommissioned to make room for new racks. Furthermore, hardware refresh has to account for double occupancy, the period of time where services migrate from old to new hardware, temporarily doubling the required capacity footprint.

This is all compounded by the fact that datacenter capacity is partitioned into geographically separate datacenter regions. Services generally prefer to grow their capacity footprint in specific regions, adding additional constraints to the hardware ordering problem. What’s more, services form a dependency graph: we cannot add capacity to a service in isolation, without also considering its downstream dependencies, which may in turn require additional capacity to handle the increased load created by a more capacious upstream service.

Thus we found ourselves with a kind of Gordian knot. On the one hand, capacity ordering complexity was driven by the increasing scale and complexity of our infrastructure. On the other, services had come to rely on being able to have their capacity fulfilled in specific regions. The result is that each capacity ordering cycle required an intricate negotiation process between our capacity planners, and service owners, leaving nobody happy in the process. This was further complicated by the fact that space and power limitations sometimes required us to fulfill hardware refreshes out of region. In other words, we required services to redistribute their capacity footprint in order to enable hardware refresh.

Untying the knot would require us to find a way to better decouple service placement from capacity planning.

Back to our February 2020 meeting. Kaushik invited a number of folks working on different aspects of capacity management. Our capacity planning leads were in attendance, as were some of the leads in our fleet management group. The idea was to use our combined knowledge to brainstorm solutions for what had by then become known as the “regionalization problem”.

Kaushik started off by highlighting that we had two important puzzle pieces:

First, our Infrastructure as a Service (IaaS) project was well underway. IaaS was fundamentally about moving away from services managing their own machines, and instead managing our whole fleet through a small set of shared pools. This moved us to a model where we could move away from planning for individual services, and instead plan for capacity pools.
Our experience managing Meta’s Disaster Recovery efforts showed us that we could safely redistribute traffic across our various regions, and that our large service dependency graph was manageable in practice.

However, beyond this we had nothing concrete! We spent the day 1) sharing context; 2) defining what a successful outcome looks like; and 3) setting some near term goals. This structure neatly illustrates how work gets done in an organization like ours:

Because we operate in a complex environment with large teams of people, communicating both technical and organizational context is a vital part of successfully solving problems. Nobody has the entire picture in their head, so we have to create strong shared spaces of context to actually tackle these problems successfully.
Without a shared sense of what we want to accomplish, we can fail to make progress, as we all move in different directions.
Concrete, short-term goals force us to find a way to take the next step, to make concrete progress towards our goal. In addition to being closer to solving the problem itself, doing so often brings additional clarity.

At the end of the day, we left with a much greater clarity about the problem we were facing, but only with an abstract idea of how to go about solving it. We set ourselves an ambitious goal: while the hardware orders for the current quarter had already been made, we were only a month or so away from making our next hardware orders. Could we sprint to make “hardware first” orders, requiring us to also understand the service placement problem well enough by then so that we could make the orders with confidence?

The initial hardware orders

What followed was an intense month of working with the same group of people, gathered in the same room for 6 hours a day. (We spent the remainder of the time, individually retreating to work on models, spreadsheets, and code.) As we proceeded, we uncovered many new aspects of the problem, and we slowly converged on a methodology for determining how to order hardware in this new world.

By this point we had subdivided the problem into three parts, each of which could be solved somewhat independently.

Problem 1: This was the hardware ordering itself: how much hardware do we order into our pools, and into which regions?
Problem 2: This was the service placement problem: given a hardware distribution, how do we distribute capacity to services in an optimal way? And how should product traffic be distributed?
Problem 3: This is what we call the execution problem: given a particular placement plan, how do redistribute traffic and services to execute this plan?

(The Flux paper focuses primarily on how we solved problems 2 and 3.)

Since the hardware ordering deadline was looming, we focused on the ordering problem first. We prototyped many solutions, cobbling together Excel spreadsheets with Python notebooks, and occasionally even some C++ code. Our goal was to build models to help us understand the problem better, and to uncover aspects of it we had not yet considered. We relied mainly on existing datasets, and thus had to impute a lot of data. (And doing so helped inform our roadmaps for metrics collection.) We also met with different service owners—the recipients of the capacity we were ordering—to help validate the kind of capacity distributions we were coming up with, to see whether they would be workable in practice, given their knowledge of the systems they operated.

We finally arrived at a set of hardware orders, using a combination of algorithms, constraint solvers, and manual tuning. We then validated these orders with both service owners. We also used some initial service placement models we had started building. These models helped us to validate whether the hardware orders would be feasible: could we actually allocate the ordered hardware efficiently? In this first iteration, we built some simple, ad-hoc models. After the orders were made, we spent most of our time understanding how to improve these models, and use them to directly allocate capacity to services.

Preparing for landing

We used our initial solution to problem 1 to make our hardware orders. The ordered hardware would land in about 6 months, giving us relatively little time to come up with solutions for problems 2 (service placement) and 3 (plan execution).

We decided fairly early on to use our Disaster Recovery runbook system for our initial execution. These runbooks are designed as a kind of human-in-the-loop automation system that mixes automation with human operations, feedback, and verification. While they were designed for other purposes, and for plans with lower complexities than the ones we anticipated Flux needed, it presented a good starting point so that we could focus our energies on service placement.

Lesson-1: Try to work on one part of the system at a time. Using an existing—though not purpose-fit—system for execution allowed us to focus most of our energies on solving the comparatively more novel problem.

We knew that our model needed to capture both how much capacity is required by each service, and also how service dependencies affect placement. Our model also needed an independent variable: something that we could use to affect capacity placement. We realized that we could use our traffic management systems to control how product traffic (e.g,. Facebook, WhatsApp, Instagram) is distributed across our regions, and that we could model service capacity allocation for all services as a function of product traffic. The intuition is as follows: ultimately, demand for a service is driven by its callers. These callers are in turn services whose demand is driven by their callers. This chain terminates at a frontend service (e.g., a web server) that directly receives user traffic. This also holds for internal services whose global demand distribution we control. For example, XFaaS is a global serverless platform. We can control its demand distribution by influencing how requests are enqueued across different regions.

Thus, product traffic distribution became our model’s independent variable. Intuitively, then, we can take the capacity assigned to a service and break it down into slices representing each product. For example, a shared service like memcached might be broken down into 60% Facebook, 20% Instagram, 10% WhatsApp, and 10% XFaaS. We can then combine this overall breakdown with the observed traffic distribution across regions to explain a service’s capacity footprint.

Another modeling challenge is this: which units do we use to represent capacity? Not only do different services make use of different hardware types (e.g., CPU dominant, DRAM dominant, accelerators, flash, HDD, etc.) but each service can have a different bottleneck resource (for example, some services may be bottlenecked by CPU, others by DRAM), which also affects how we model capacity for a service. In our first iteration, we decided to simplify by denominating capacity in power (Watts) per rack type. While it doesn’t capture all nuances of a service’s capacity requirements, power is a useful way to normalize capacity in the aggregate, and it represents the actual availability of floor space in our datacenters. Our paper details how we solved this problem in later iterations by incorporating bottleneck resources into our service placement models.

As we raced to prepare for hardware arriving at our datacenters, we worked to combine multiple datasets, post processing them, and occasionally adjusting them manually. We “fanned out” our team to work on different aspects of these baseline data. Ultimately we combined them all in a Python notebook. This gave us the flexibility to iterate on both data and algorithms while providing reproducibility for our experiments and ultimately our production plans.

Once we had a baseline we were comfortable with, we built a greedy algorithm to come up with capacity allocations for each service that incorporated our orders and hardware refreshes. This algorithm was simple, but legible: we could easily understand how the allocations were made, and in turn easily explain these to the service teams receiving the capacity. The main drawback of our greedy algorithm is that it made it difficult to implement more complex objective functions.

Lesson-2: Introduce complexity gradually. Even if this results in suboptimal outcomes, doing simple things first helps you better understand the problem, and to understand what constitutes an appropriate level of solution complexity.

Once we built an allocation plan this way, we worked directly with service teams to validate and adjust them. Since this was a brand new model, and a brand new way of even approaching capacity allocation at Meta, we felt we needed manual validation to be comfortable with executing on it. We also wanted to make sure that we did not impose undue operational complexity on teams that received capacity through this process.

The first execution cycle

Once we had our capacity plan in place—vetted and approved by the many teams involved—we were ready to execute. Using our runbook infrastructure, we worked with a team of operational project managers to help execute the plan. They worked with various teams to upsize their capacity allocations in regions where they increased, helped the teams navigate traffic shifts, and ultimately relinquish capacity in regions where their allocation decreased.

This process was highly manual—we did not yet have our automation systems in place—but also highly educational. As it was our first time orchestrating such large scale capacity changes in our infrastructure, we uncovered many issues we had not foreseen, and we found edge cases in our models that were not well captured by our models or baselines. This helped us understand the problem much more deeply, and to guide our ongoing efforts to automate execution.

Lesson-3: First, do non-scalable things. Without our first manual execution, we would have built the wrong kind of automation. Manual execution taught us what the real problems are, and how we could best deploy automation in subsequent execution cycles.

While we executed our first placement plan successfully, our experience showed us that automation would be a strict requirement to support any larger-scale plans. We also ran into instances of teams that had intended to use capacity in new ways, not codified in our models. This helped to determine how to evolve our modeling approach as well, and to drive towards establishing a kind of capacity contract between our capacity organization and its clients: It would otherwise be exceedingly difficult to continue our push for increased automation if we could not codify the intent behind how services intended to use their allocated capacity.

A maturing system

By early 2021, we had:

Ordered hardware using a new process that was (mostly) decoupled from service placement.
Built a capacity placement baseline that captured the relationship between service capacity allocation and frontend product traffic distribution.
Algorithmically computed a service placement plan that best utilized the ordered hardware. The placement plan took into account service interdependencies, and placed product traffic according to capacity availability.
Executed the plan using our existing disaster recovery runbook system.

It was now time to assemble these pieces into a highly automated end-to-end system. We now had enough clarity to better parallelize our work: we understood the boundaries between the constituent parts of the system, and we could work more-or-less independently within these.

Model baselines. In our initial prototype, we imputed service dependency data from numerous baselines and expert knowledge. We switched to using RPC tracing to measure these dependencies directly. This involved working with both the team responsible for our RPC tracing systems, as well as various service teams to validate the models. We joined the RPC tracing data with baseline capacity data to translate service demand data into the capacity footprints (cs,r) required for our model.

Placement planning. We abandoned our greedy solver approach in favor of formulating a mixed-integer-optimization program (MIP) to which we could apply commercially available optimizers. This allowed us to implement more complex optimization objectives, and to guarantee optimality against those objectives. While it is more difficult to reason about the results that a greedy solver gives you, the additional expressiveness and optimization more than justified this increase in complexity.

Automated execution. We built a workflow execution engine for our placement plans. This system is able to seamlessly mix automation with human actions. When services did not integrate with our capacity automation systems, our workflow engine would coordinate execution by prompting operators to perform the equivalent actions. For services that were using our capacity automation systems, the workflow might ask a human operator to ratify the results. This was used in circumstances where our models did not yield sufficiently high quality results for a service.

Data as interface. One design principle is worth highlighting: all of these systems interfaced through “plain old data”, forming a pipeline. For example, our model baseline is a dataset that is fed into the placement planning algorithm. The placement planning stage outputs another dataset containing the placement plan. Finally, our execution system translates that plan into a graph of actions to be executed. This approach had many important advantages:

It means that we could iterate on each component individually. For example, we could freeze baseline inputs and adjust various optimization parameters in our planner.
We could store these datasets to later be used for later reproducing different parts of the pipeline. This helped us to debug issues that manifested only later. For example, we could easily debug a plan defect that only showed up in the execution phase.
It allowed us to track the lineage of each plan by maintaining versions of each dataset. This helped us to understand the potential scope of issues as they came up. For example, it allowed us to determine which plans may have been affected by a defect in the baseline datasets.

We extended our use of this principle to execution as well. We built a simple capacity control plane that was used as the interface between our workflow execution engine and the underlying capacity automation systems. The execution engine stores capacity intents into a shared database. Capacity intents include upsizing (of a service in a region), downsizing (of a service in a region), and product traffic shifts (between regions). Each intent includes an estimated time of execution, and contains a state of WAITING, READY, or COMPLETE. The workflow engine changes the state of the intent from READY whenever it is safe to execute; the capacity automation system responsible for the intent changes its state to COMPLETE when it has completed.

This capacity control plane services two purposes:

It decouples the orchestration of execution from its mechanisms. Thus, we can support multiple types of capacity automation.
It provides an authoritative timeline of anticipated capacity changes. Systems besides Flux also write such capacity intents, and Flux in turn adjusts its baselines to anticipate future capacity changes. Similarly, our capacity automation systems use the record of anticipated traffic shifts to correctly size services for fault tolerance prior to a traffic shift.

Lesson-4: Decouple systems through plain old data. Data is easy to store, manipulate, and reason about—it is first and foremost legible. By structuring our systems in this way, we could easily reason about each component in isolation, and support our increasingly complex workflows with simple infrastructure.

We now had a basic system in place. It worked end to end, with reasonable automation for many services. Now came the time to iterate on improving completeness and quality. We worked out how to best measure the performance of our system, balancing technical performance (the kind of measurements you’ll see in evaluation sections of systems papers) with “product performance” — that is, how well the system works for its clients. Incorporating measurements at an early stage in the project lets us create an important positive feedback loop: the measurements expose where the system isn’t performing well enough. This often begins by looking at a high-level metric, then breaking it down into its constituent components. Pick the biggest contributor, and work to improve it. Rinse, repeat.

We also used these measurements to set project level goals. This caused the various sub-teams (we were a little more than 10 people working on Flux full time now) to figure out how to contribute to the overall goals. It’s a great way to manage large and complex projects.

Lesson-5: Measure what matters. Find a way to incorporate measurements into your project at an early stage. Use these to guide execution and maturation.

By focusing on improving these metrics, we were ultimately able to dramatically increase the amount of automation available for execution (see paper for details) and commensurately decrease the execution time, improving overall capacity efficiency.

Flux today

While Flux has now operated in production for more than 3 years, the world has done anything but stand still. In the time Flux has been used to manage capacity at Meta, we have undergone both supply and demand shocks, as well as a large increase in AI investments. Responding to these realities has pushed our capacity planning and management systems to their limits—and Flux has played a central role in our ability to respond well.

At the same time, Flux is decidedly a system that has been retrofitted onto an existing capacity management regime: Flux models (observes) how capacity is used, and influences the allocation of that capacity through automation built on top of existing capacity automation systems.

The limitations of this approach is that there is only a weak notion of a “contract” between our infrastructure and the services using that infrastructure. Namely, if a service changes its regionalization behavior, Flux will continue to accept this as a valid change, and update its models accordingly. Flux assumes that the current capacity baseline constitutes the ground truth about how to operate our services.

We realized over time that, in order to operate our infrastructure more predictably, and to be able to provide longer planning horizons, we needed to elevate the concepts of Flux into a capacity contract. For example, a service can “purchase” capacity by specifying how that capacity is intended to follow product traffic. Now, the separation of responsibilities between infrastructure and services is much more cleanly delineated, and we can make strong assumptions during planning. This also solves one of the main sticking points with Flux, as it was outlined in the paper: models can be wrong, and they often are.

By codifying regionalization constraints into capacity contracts, we instead use Flux to enforce the contract, and we can tell whether Flux has been able to deliver capacity where it is needed (per the contract). It is then up to service owners to correctly use that capacity. In the past, if Flux did not deliver enough capacity to support a service in a region, then Flux was naturally seen as the cause of this issue, as its capacity models did not accurately capture a real requirement. With capacity contracts, we are instead declaring what the capacity model is, and requiring services to specify which product traffic entities they want their capacity to follow.

Lesson-6: Codify models into contracts. Managing capacity invariably involves modeling its usage. Without a clearly defined boundary between the contract provided by the capacity management system and the services that use them, the model complexity will invariably grow intractable.

Global abstraction. Flux was designed to scale capacity management to 10+ datacenter regions. We’re now operating 20+ datacenter regions, and, while Flux (and its successor systems) solves many important capacity management problems, there are also service management problems that arise from such broad deployments. For example: How do you coordinate deployment of a service across 20+ regions? How do you load balance traffic to services that are deployed across those same regions? How do you manage replicas of sharded services? How should services manage the myriad correlated and uncorrelated failure scenarios presented by 20+ regions? These problems can only be solved if we begin treating regions as “cattle, not pets”.

Our answer to this is an effort we call Global Abstractions. The purpose of this effort is to remove “region” from the vocabulary of our service management tools. Instead of deploying a service into N regions, the service is instead deployed globally through a single control plane. The control plane is responsible for managing how to spread service jobs across regions, how to replicate data, how to route traffic, and how to prepare for and manage failure scenarios. Crucially, service owners reason about their services globally, while our control plane takes on the burden of managing how services are replicated, routed to, and deployed.

We chose this name because our intent is to introduce new, global abstractions. For example, we expect service routing to be managed in terms of latency targets, and not by managing how traffic is routed between specific regions. Similarly, we manage a global capacity allowance, rather than one for each region. Sharded services are managed against a replication budget and a set of disaster scenarios.

Providing global abstractions also lets us support services that are not deployed to every region (and whose clients can tolerate the additional latency incurred by cross-region calls). For example, some stateful services find that it’s wasteful to replicate data to every region; in some cases, we are also constrained by hardware availability in some regions. Supporting such deployments will require us to also codify latency expectations in our placement models.

Global Abstractions present fantastic opportunities to simultaneously simplify our infrastructure offerings, while presenting more optimization opportunities while managing that infrastructure. We’ll have more to say about these efforts soon.

Coda

Capacity management is: 1) important; 2) technically hard; and 3) overlooked in the systems community. We think there abound opportunities in this space, and our presentation of Flux really only scratched the surface of this rich problem space.

Thanks to CQ Tang for many suggestions and edits; and to all of my colleagues who helped make Flux a reality.

About the author: Marius Eriksen is a software engineer in Meta’s Core Systems group. His work there includes capacity management, service management, control plane infrastructure, and AI infrastructure.

Disclaimer: Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGOPS.

Editor: Dong Du (Shanghai Jiao Tong University)