The past five years have seen a significant change in the way cloud servers look. Traditionally homogeneous cloud systems are progressively shifting to heterogeneous designs, either through special purpose chips, like Google’s TPUs, or reconfigurable fabrics, like Microsoft’s Catapult and Brainwave projects; sometimes, they even adopt a combination of the two.
This post provides an introduction to the reasons behind this increasing heterogeneity, the ways it is realized in warehouse-scale systems, and what its increased prevalence means for the rest of the system stack.
Why heterogeneity now?
The cloud has achieved its prevalence by offering resource flexibility and cost efficiency, both for the cloud operator and the end user. Cost efficiency, specifically, comes from leveraging the economies of scale of buying tens of thousands of – mostly – the same type of servers. The benefits of this homogeneity, however, go beyond cost efficiency; managing a homogeneous system is also much easier from the perspective of the operating system, the cluster manager (the datacenter-wide resource manager and scheduler), the compiler, and the application design itself. Application deployment becomes much simpler when all servers look the same, and the main placement decision has to do with resource availability. Debugging and tracing tools, which are in place in any major cloud provider, are also much easier to design and deploy in a homogeneous system.
So why change? The obvious reason is the slowdown of technology scaling, aka the obligatory computer architect’s reference to the end of Moore’s Law. Transistors can still be made smaller, although at a slower pace than before, but packing more of them in the same area results in a significant increase in power, which was not the case while the – lesser known – law of Dennard’s Scaling was still in effect. This means that if applications require more compute capabilities without significantly increasing power and/or cost, they need to look at special-purpose designs. The specific materialization of such designs varies depending on how set-in-stone the target application is, how much emphasis is placed on programmability, and the cost trade-off between fabricating a new chip versus reusing an expensive device for longer.
Despite the slowdown of technology scaling being a major force behind specialized computing, it is not the only one. Cloud applications themselves have changed. In place of large monolithic services, where the functionality of an entire application, like a social network or search engine, was compiled and deployed as a single service, cloud services have increasingly adopted fine-grained, modular designs, like microservices and serverless compute. By design, these applications consist of hundreds if not thousands of short-lived functions, each of which has very strict latency requirements. Where the end-to-end tail latency target of a cloud service may be several milliseconds, the tail latency target of an individual microservice is in the order of microseconds. Missing these requirements does not only affect the offending function itself, but through dependencies across microservices, it can cause cascading performance issues across a cluster. By default, traditional general-purpose servers are not designed to achieve such low and predictable latency. Instead, they are designed for compatibility to any application a user may launch on them. The hardware support and, more importantly, the software stack needed to enable this often introduces unpredictable performance, i.e., performance jitter. By narrowing down the hardware and software design to the needs of a specific application, heterogeneity can achieve more consistent, predictable performance.
What does heterogeneity look like?
The specific form heterogeneity takes in a cloud system depends on several factors, the need for programmability, power efficiency, cost overheads, and system integration being some among them, but it usually follows one of two approaches: special purpose chips (ASICs), designed for a specific application or application class, or reconfigurable acceleration fabrics (FPGAs), where the hardware can be adjusted to accommodate changes in application logic. The former works better for mostly-stable applications, whose core computation does not fundamentally change very frequently and the chip fabrication cost is tolerable, while the latter is better suited for services where accommodating changes in application logic is important and/or the overheads of chip fabrication may be undesirable.
An example of the first type of hardware acceleration are Google’s TPUs; ASICs designed to accelerate deep learning training and inference. An example of the second are Microsoft’s FPGA fabrics for accelerating aspects of websearch (Catapult), deep learning (Brainwave), and network processing (Azure SmartNIC). Both approaches have shown significantly better performance and performance predictability, compared to general-purpose designs for their respective target applications. Other types of acceleration, e.g., those arguing for processing-in-memory (PIM) or near-data-processing (NDP), where wimpy cores are placed under 3D stacked memory, are also gaining traction in the cloud.
What are the implications of heterogeneity?
While it may seem like cloud heterogeneity via acceleration is primarily a computer architecture issue, it has deep implications across the system stack. Below is a short overview of only a few of them.
- Programmability
The lack of programmability in hardware accelerators has been one of the main roadblocks towards a more widespread adoption of heterogeneous platforms in the cloud. So far, programming accelerators has been limited to expert cloud developers, with deep understanding of both the application and hardware platform. For acceleration to reach its full potential, it is important to further explore high-level programming abstractions (e.g., Merge, and ParallelXL) that abstract away the complexity of heterogeneity from the user.
- Resource management
Despite the performance and power benefits of heterogeneity, when it comes to resource management in the cloud, it can also introduce complexity. Where previously every server could be expected to yield similar levels of performance, as heterogeneity becomes more prevalent, the range of performance and power behaviors across hardware platforms becomes more diverse. Cluster schedulers (e.g., Borg or Quasar) need to be aware of the different profiles of heterogeneous machines when allocating resources to applications, to avoid exacerbating performance unpredictability, especially in interactive, latency-critical services.
- Cross-ISA compilation
Linked to the issue of resource management is the fact that applications should be able to be placed on different hardware platforms, which means that systems need to account for variation in ISAs when compiling and deploying an application. This becomes additionally challenging as cloud applications change frequently, with daily and at least weekly roll-outs being common, and for the cases where migration across platforms needs to happen in the middle of computation.
- Debugging tools
Monitoring, tracing, and debugging infrastructure is a critical part of any cloud system. These systems track application and system behavior over time, and help service reliability engineers (SREs) debug the sources of poor performance. These systems, already complex as of now (see Dapper, The Mystery Machine, X-Trace, and Seer for some examples), need to be enhanced to account for the impact of heterogeneity in their monitoring and trace analysis. This extends from architects designing hardware accelerators that enable monitoring low-level architectural characteristics, and high-level performance metrics, to distributed systems engineers extending trace analytics to account for the impact of heterogeneity when searching for the culprit of unpredictable performance.
One promising way forward is abstracting away the complexity that hardware heterogeneity introduces by leveraging automated, data-driven techniques in the design and management of cloud systems. Machine learning has been shown to be beneficial in many aspects of hardware and software design, which can extend to also handling the implications of heterogeneity. This SIGARCH blog post goes deeper into the premise and perils of this approach.
Conclusion
Unlike mobile computing, the many benefits of homogeneous computing forced cloud systems to take a long time before embracing heterogeneity. Both the slowdown of technology scaling and the increasingly tighter latency constraints of new cloud programming models signal that heterogeneity, in its many forms, is here to stay and will only keep increasing in scale. It becomes clear that, while the introduction of hardware heterogeneity in the cloud is beneficial, it comes with significant challenges. While so far, the study of heterogeneity has primarily centered at the computer architecture level, its many implications require also revisiting the rest of the system stack.
Bio: Christina Delimitrou is an Assistant Professor in Electrical and Computer Engineering at Cornell University. Her main interests are in computer architecture and computer systems. Specifically, she works on improving the resource efficiency of large-scale datacenters through QoS-aware scheduling and resource management techniques.
Disclaimer: This post was written by Christina Delimitrou for the SIGOPS blog. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGOPS, or their parent organization, ACM.