EuroSys '19- Proceedings of the Fourteenth EuroSys Conference 2019

Full Citation in the ACM Digital Library

SESSION: Security

Time Protection: The Missing OS Abstraction

Qian Ge
Yuval Yarom
Tom Chothia
Gernot Heiser

Timing channels enable data leakage that threatens the security of computer systems, from cloud platforms to smartphones and browsers executing untrusted third-party code. Preventing unauthorised information flow is a core duty of the operating system, however, present OSes are unable to prevent timing channels. We argue that OSes must provide time protection, the temporal equivalent of the established memory protection, for isolating security domains. We examine the requirements of time protection, present a design and its implementation in the seL4 microkernel, and evaluate efficacy and cost on x86 and Arm processors.

Why Joanie Can Encrypt: Easy Email Encryption with Easy Key Management

John S. Koh
Steven M. Bellovin
Jason Nieh

Email privacy is of crucial importance. Existing email encryption approaches are comprehensive but seldom used due to their complexity and inconvenience. We take a new approach to simplify email encryption and improve its usability by implementing receiver-controlled encryption: newly received messages are transparently downloaded and encrypted to a locally-generated key; the original message is then replaced. To avoid the problem of moving a single private key between devices, we implement per-device key pairs: only public keys need be synchronized via a simple verification step. Compromising an email account or server only provides access to encrypted emails. We implemented this scheme on several platforms, showing it works with PGP and S/MIME, is compatible with widely used mail clients and email services including Gmail, has acceptable overhead, and that users consider it intuitive and easy to use.

Conclave: secure multi-party computation on big data

Nikolaj Volgushev
Malte Schwarzkopf
Ben Getchell
Mayank Varia
Andrei Lapets
Azer Bestavros

Secure Multi-Party Computation (MPC) allows mutually distrusting parties to run joint computations without revealing private data. Current MPC algorithms scale poorly with data size, which makes MPC on "big data" prohibitively slow and inhibits its practical use.

Many relational analytics queries can maintain MPC's end-to-end security guarantee without using cryptographic MPC techniques for all operations. Conclave is a query compiler that accelerates such queries by transforming them into a combination of data-parallel, local cleartext processing and small MPC steps. When parties trust others with specific subsets of the data, Conclave applies new hybrid MPC-cleartext protocols to run additional steps outside of MPC and improve scalability further.

Our Conclave prototype generates code for cleartext processing in Python and Spark, and for secure MPC using the Sharemind and Obliv-C frameworks. Conclave scales to data sets between three and six orders of magnitude larger than state-of-the-art MPC frameworks support on their own. Thanks to its hybrid protocols and additional optimizations, Conclave also substantially outperforms SMCQL, the most similar existing system.

ConfLLVM: A Compiler for Enforcing Data Confidentiality in Low-Level Code

Ajay Brahmakshatriya
Piyus Kedia
Derrick P. McKee
Deepak Garg
Akash Lal
Aseem Rastogi
Hamed Nemati
Anmol Panda
Pratik Bhatu

We present a compiler-based scheme to protect the confidentiality of sensitive data in low-level applications (e.g. those written in C) in the presence of an active adversary. In our scheme, the programmer marks sensitive data by lightweight annotations on the top-level definitions in the source code. The compiler then uses a combination of static dataflow analysis, runtime instrumentation, and a novel taint-aware form of control-flow integrity to prevent data leaks even in the presence of low-level attacks. To reduce runtime overheads, the compiler uses a novel memory layout.

We implement our scheme within the LLVM framework and evaluate it on the standard SPEC-CPU benchmarks, and on larger, real-world applications, including the NGINX webserver and the OpenLDAP directory server. We find that the performance overheads introduced by our instrumentation are moderate (average 12% on SPEC), and the programmer effort to port the applications is minimal.

SESSION: Exploiting CPU Architecture

Per-Application Power Delivery

Akhil Guliani
Michael M. Swift

Datacenter servers are often under-provisioned for peak power consumption due to the substantial cost of providing power. When there is insufficient power for the workload, servers can lower voltage and frequency levels to reduce consumption, but at the cost of performance. Current processors provide power limiting mechanisms, but they generally apply uniformly to all CPUs on a chip. For servers running heterogeneous jobs, though, it is necessary to differentiate the power provided to different jobs. This prevents interference when a job may be throttled by another job hitting a power limit. While some recent CPUs support per-CPU power management, there are no clear policies on how to distribute power between applications. Current hardware power limiters, such as Intel's RAPL throttle the fastest core first, which harms high-priority applications.

In this work, we propose and evaluate priority-based and share-based policies to deliver differential power to applications executing on a single socket in a server. For share-based policies, we design and evaluate policies using shares of power, shares of frequency, and shares of performance. These variations have different hardware and software requirements, and different results. Our results show that power shares have the worst performance isolation, and that frequency shares are both simpler and generally perform better than performance shares.

AnDrone: Virtual Drone Computing in the Cloud

Alexander Van't Hof
Jason Nieh

With the continued proliferation of drones, unmanned aerial vehicles, additional uses for them are growing and the demand for their services is on the rise. We present AnDrone, a drone-as-a-service solution that makes drones accessible in the cloud. AnDrone pairs a cloud service with the first drone virtualization architecture. This enables a physical drone to run multiple virtual drones simultaneously in an isolated and secure manner at little additional cost, as computational costs are cheap compared to the operational and energy costs of putting a drone in the air. AnDrone virtualizes drones using a novel Linux container architecture. Android Things virtual drone containers provide a familiar user and development environment that can run existing Android apps. A real-time Linux flight controller container supports existing drone flight software and provides virtual drones with geofenced flight control. A device container transparently multiplexes access from virtual drones to a full range of drone hardware devices, including cameras and other sensors. Upon flight completion, virtual drones and their data can be uploaded to the cloud for offline access. We have implemented an AnDrone prototype based on Raspberry Pi 3 drone hardware. We demonstrate that it incurs minimal runtime performance and energy overhead, supports real-time virtual drone flight control, and runs untrusted third-party software in virtual drones in a secure manner while ensuring that the service provider maintains control of the drone hardware.

When eXtended Para - Virtualization (XPV) Meets NUMA

Bao Bui
Djob Mvondo
Boris Teabe
Kevin Jiokeng
Lavoisier Wapet
Alain Tchana
Gaël Thomas
Daniel Hagimont
Gilles Muller
Noel DePalma

This paper addresses the problem of efficiently virtualizing NUMA architectures. The major challenge comes from the fact that the hypervisor regularly reconfigures the placement of a virtual machine (VM) over the NUMA topology. However, neither guest operating systems (OSes) nor system runtime libraries (e.g., Hotspot) are designed to consider NUMA topology changes at runtime, leading end user applications to unpredictable performance. This paper presents eXtended Para-Virtualization (XPV), a new principle to efficiently virtualize a NUMA architecture. XPV consists in revisiting the interface between the hypervisor and the guest OS, and between the guest OS and system runtime libraries (SRL) so that they can dynamically take into account NUMA topology changes. The paper presents a methodology for systematically adapting legacy hypervisors, OSes, and SRLs. We have applied our approach with less than 2k line of codes in two legacy hypervisors (Xen and KVM), two legacy guest OSes (Linux and FreeBSD), and three legacy SRLs (Hotspot, TCMalloc, and jemalloc). The evaluation results showed that XPV outperforms all existing solutions by up to 304%.

Make the Most out of Last Level Cache in Intel Processors

Alireza Farshin
Amir Roozbeh
Gerald Q. Maguire, Jr.
Dejan Kostić

In modern (Intel) processors, Last Level Cache (LLC) is divided into multiple slices and an undocumented hashing algorithm (aka Complex Addressing) maps different parts of memory address space among these slices to increase the effective memory bandwidth. After a careful study of Intel's Complex Addressing, we introduce a slice-aware memory management scheme, wherein frequently used data can be accessed faster via the LLC. Using our proposed scheme, we show that a key-value store can potentially improve its average performance ~12.2% and ~11.4% for 100% & 95% GET workloads, respectively. Furthermore, we propose CacheDirector, a network I/O solution which extends Direct Data I/O (DDIO) and places the packet's header in the slice of the LLC that is closest to the relevant processing core. We implemented CacheDirector as an extension to DPDK and evaluated our proposed solution for latency-critical applications in Network Function Virtualization (NFV) systems. Evaluation results show that CacheDirector makes packet processing faster by reducing tail latencies (90-99th percentiles) by up to 119 μs (~21.5%) for optimized NFV service chains that are running at 100 Gbps. Finally, we analyze the effectiveness of slice-aware memory management to realize cache isolation.

SESSION: OS Kernel

SkyBridge: Fast and Secure Inter-Process Communication for Microkernels

Zeyu Mi
Dingji Li
Zihan Yang
Xinran Wang
Haibo Chen

Microkernels have been extensively studied over decades. However, IPC (Inter-Process Communication) is still a major factor of runtime overhead, where fine-grained isolation usually leads to excessive IPCs. The main overhead of IPC comes from the involvement of the kernel, which includes the direct cost of mode switches and address space changes, as well as indirect cost due to the pollution of processor structures.

In this paper, we present SkyBridge, a new communication facility designed and optimized for synchronous IPC in microkernels. SkyBridge requires no involvement of kernels during communication and allows a process to directly switch to the virtual address space of the target process and invoke the target function. SkyBridge retains the traditional virtual address space isolation and thus can be easily integrated into existing microkernels. The key idea of SkyBridge is to leverage a commodity hardware feature for virtualization (i.e., VMFUNC) to achieve efficient IPC. To leverage the hardware feature, SkyBridge inserts a tiny virtualization layer (Rootkernel) beneath the original microkernel (Subkernel). The Rootkernel is carefully designed to eliminate most virtualization overheads. SkyBridge also integrates a series of techniques to guarantee the security properties of IPC.

We have implemented SkyBridge on three popular open-source microkernels (seL4, Fiasco.OC, and Google Zircon). The evaluation results show that SkyBridge improves the speed of IPC by 1.49x to 19.6x for microbenchmarks. For real-world applications (e.g., SQLite3 database), SkyBridge improves the throughput by 81.9%, 1.44x and 9.59x for the three microkernels on average.

CoPart: Coordinated Partitioning of Last-Level Cache and Memory Bandwidth for Fairness-Aware Workload Consolidation on Commodity Servers

Jinsu Park
Seongbeom Park
Woongki Baek

Workload consolidation is a widely-used technique to maximize server resource utilization in cloud and datacenter computing. Recent commodity CPUs support last-level cache (LLC) and memory bandwidth partitioning functionalities that can be used to ensure the fairness of the consolidated workloads. While prior work has proposed a variety of resource partitioning techniques, it still remains unexplored to characterize the impact of LLC and memory bandwidth partitioning on the fairness of the consolidated workloads and investigate system software support to dynamically control LLC and memory bandwidth partitioning in a coordinated manner.

To bridge this gap, we present an in-depth performance and fairness characterization of LLC and memory bandwidth partitioning. Guided by the characterization results, we propose CoPart, coordinated partitioning of LLC and memory bandwidth for fairness-aware workload consolidation on commodity servers. CoPart dynamically analyzes the characteristics of the consolidated applications and allocates the LLC and memory bandwidth across the applications in a coordinated manner to improve the overall fairness. Our quantitative evaluation shows that CoPart significantly improves the fairness of the consolidated applications (e.g., 57.3% higher fairness on average than the resource allocation policy that equally allocates the resources to the consolidated applications), robustly provides high fairness across various application and system configurations, and incurs small performance overhead.

LockDoc: Trace-Based Analysis of Locking in the Linux Kernel

Alexander Lochmann
Horst Schirmeier
Hendrik Borghorst
Olaf Spinczyk

For fine-grained synchronization of application and kernel threads, the Linux kernel provides a multitude of different locking mechanisms that are being used on various individually locked data structures. Understanding which locks are required in which order for a particular member variable of a kernel data structure has become truly difficult, even for Linux-kernel experts themselves.

In this paper we introduce LockDoc -- an approach that, based on the analysis of execution traces of an instrumented Linux kernel, automatically deduces the most likely locking rule for all members of arbitrary kernel data structures. From these locking rules, LockDoc generates documentation that supports kernel developers and helps avoiding concurrency bugs. Additionally, the (very limited) existing documentation can be verified, and locking-rule violations -- potential bugs in the kernel code -- can be found.

Our results include generated locking rules for previously predominantly undocumented member variables of 11 different Linux-kernel data structures. Manually inspecting the scarce source-code documentation for five of these data structures reveals that only 53 percent of the variables with a documented locking rule are actually consistently accessed with the required locks held. This indicates possible documentation or synchronization bugs in the Linux kernel, of which one has already been confirmed by kernel experts.

Compact NUMA-aware Locks

Dave Dice
Alex Kogan

Modern multi-socket architectures exhibit non-uniform memory access (NUMA) behavior, where access by a core to data cached locally on a socket is much faster than access to data cached on a remote socket. Prior work offers several efficient NUMA-aware locks that exploit this behavior by keeping the lock ownership on the same socket, thus reducing remote cache misses and inter-socket communication. Virtually all those locks, however, are hierarchical in their nature, thus requiring space proportional to the number of sockets. The increased memory cost renders NUMA-aware locks unsuitable for systems that are conscious to space requirements of their synchronization constructs, with the Linux kernel being the chief example.

In this work, we present a compact NUMA-aware lock that requires only one word of memory, regardless of the number of sockets in the underlying machine. The new lock is a variant of an efficient (NUMA-oblivious) MCS lock, and inherits its performant features, such as local spinning and a single atomic instruction in the acquisition path. Unlike MCS, the new lock organizes waiting threads in two queues, one composed of threads running on the same socket as the current lock holder, and another composed of threads running on a different socket(s).

We implemented the new lock in user-space as well as integrated it in the Linux kernel's qspinlock, one of the major synchronization constructs in the kernel. Our evaluation using both user-space and kernel benchmarks shows that the new lock has a single-thread performance of MCS, but significantly outperforms the latter under contention, achieving a similar level of performance when compared to other, state-of-the-art NUMA-aware locks that require substantially more space.

SESSION: Storage Systems

Project Almanac: A Time-Traveling Solid-State Drive

Xiaohao Wang
Yifan Yuan
You Zhou
Chance C. Coats
Jian Huang

Preserving the history of storage states is critical to ensuring system reliability and security. It facilitates system functions such as debugging, data recovery, and forensics. Existing software-based approaches like data journaling, logging, and backups not only introduce performance and storage cost, but also are vulnerable to malware attacks, as adversaries can obtain kernel privileges to terminate or destroy them.

In this paper, we present Project Almanac, which includes (1) a time-travel solid-state drive (SSD) named TimeSSD that retains a history of storage states in hardware for a window of time, and (2) a toolkit named TimeKits that provides storage-state query and rollback functions. TimeSSD tracks the history of storage states in the hardware device, without relying on explicit backups, by exploiting the property that the flash retains old copies of data when they are updated or deleted. We implement TimeSSD with a programmable SSD and develop TimeKits for several typical system applications. Experiments, with a variety of real-world case studies, demonstrate that TimeSSD can retain all the storage states for eight weeks, with negligible performance overhead, while providing the device-level time-travel property.

ShieldStore: Shielded In-memory Key-value Storage with SGX

Taehoon Kim
Joongun Park
Jaewook Woo
Seungheun Jeon
Jaehyuk Huh

The shielded computation of hardware-based trusted execution environments such as Intel Software Guard Extensions (SGX) can provide secure cloud computing on remote systems under untrusted privileged system software. However, hardware overheads for securing protected memory restrict its capacity to a modest size of several tens of megabytes, and more demands for protected memory beyond the limit cause costly demand paging. Although one of the widely used applications benefiting from the enhanced security of SGX, is the in-memory key-value store, its memory requirements are far larger than the protected memory limit. Furthermore, the main data structures commonly use fine-grained data items such as pointers and keys, which do not match well with the coarse-grained paging of the SGX memory extension technique. To overcome the memory restriction, this paper proposes a new in-memory key-value store designed for SGX with application-specific data security management. The proposed key-value store, called ShieldStore, maintains the main data structures in unprotected memory with each key-value pair individually encrypted and integrity-protected by its secure component running inside an enclave. Based on the enclave protection by SGX, ShieldStore provides secure data operations much more efficiently than the baseline SGX key-value store, achieving 8--11 times higher throughput with 1 thread, and 24--30 times higher throughput with 4 threads.

URSA: Hybrid Block Storage for Cloud-Scale Virtual Disks

Huiba Li
Yiming Zhang
Dongsheng Li
Zhiming Zhang
Shengyun Liu
Peng Huang
Zheng Qin
Kai Chen
Yongqiang Xiong

This paper presents URSA, a hybrid block store that provides virtual disks for various applications to run efficiently on cloud VMs. Trace analysis shows that the I/O patterns served by block storage have limited locality to exploit. Therefore, instead of using SSDs as a cache layer, URSA proposes an SSD-HDD-hybrid storage structure that directly stores primary replicas on SSDs and replicates backup replicas on HDDs, using journals to bridge the performance gap between SSDs and HDDs. URSA integrates the hybrid structure with designs for high reliability, scalability, and availability. Experiments show that URSA in its hybrid mode achieves almost the same performance as in its SSD-only mode (storing all replicas on SSDs), and outperforms other block stores (Ceph and Sheepdog) even in their SSD-only mode while achieving much higher CPU efficiency (performance per core). We also discuss some practical issues in our deployment.

VStore: A Data Store for Analytics on Large Videos

Tiantu Xu
Luis Materon Botelho
Felix Xiaozhu Lin

We present VStore, a data store for supporting fast, resource-efficient analytics over large archival videos. VStore manages video ingestion, storage, retrieval, and consumption. It controls video formats along the video data path. It is challenged by i) the huge combinatorial space of video format knobs; ii) the complex impacts of these knobs and their high profiling cost; iii) optimizing for multiple resource types. It explores an idea called backward derivation of configuration: in the opposite direction along the video data path, VStore passes the video quantity and quality expected by analytics backward to retrieval, to storage, and to ingestion. In this process, VStore derives an optimal set of video formats, optimizing for different resources in a progressive manner.

VStore automatically derives large, complex configurations consisting of more than one hundred knobs over tens of video formats. In response to queries, VStore selects video formats catering to the executed operators and the target accuracy. It streams video data from disks through decoder to operators. It runs queries as fast as 362x of video realtime.

SESSION: Datacenter Systems

Managing Tail Latency in Datacenter-Scale File Systems Under Production Constraints

Pulkit A. Misra
María F. Borge
Íñigo Goiri
Alvin R. Lebeck
Willy Zwaenepoel
Ricardo Bianchini

Distributed file systems often exhibit high tail latencies, especially in large-scale datacenters and in the presence of competing (and possibly higher priority) workloads. This paper introduces techniques for managing tail latencies in these systems, while addressing the practical challenges inherent in production datacenters (e.g., hardware heterogeneity, interference from other workloads, the need to maximize simplicity and maintainability). We implement our techniques in a scalable distributed file system (an extension of HDFS) used in production at Microsoft. Our evaluation uses 70k servers in 3 datacenters, and shows that our techniques reduce tail latency significantly for production workloads.

Wormhole: A Fast Ordered Index for In-memory Data Management

Xingbo Wu
Fan Ni
Song Jiang

In-memory data management systems, such as key-value stores, have become an essential infrastructure in today's big-data processing and cloud computing. They rely on efficient index structures to access data. While unordered indexes, such as hash tables, can perform point search with O(1) time, they cannot be used in many scenarios where range queries must be supported. Many ordered indexes, such as B+ tree and skip list, have a O(log N) lookup cost, where N is number of keys in an index. For an ordered index hosting billions of keys, it may take more than 30 key-comparisons in a lookup, which is an order of magnitude more expensive than that on a hash table. With availability of large memory and fast network in today's data centers, this O(log N) time is taking a heavy toll on applications that rely on ordered indexes.

In this paper we introduce a new ordered index structure, named Wormhole, that takes O(log L) worst-case time for looking up a key with a length of L. The low cost is achieved by simultaneously leveraging strengths of three indexing structures, namely hash table, prefix tree, and B+ tree, to orchestrate a single fast ordered index. Wormhole's range operations can be performed by a linear scan of a list after an initial lookup. This improvement of access efficiency does not come at a price of compromised space efficiency. Instead, Wormhole's index space is comparable to those of B+ tree and skip list. Experiment results show that Wormhole outperforms skip list, B+ tree, ART, and Masstree by up to 8.4x, 4.9x, 4.3x, and 6.6x in terms of key lookup throughput, respectively.

Scalable RDMA RPC on Reliable Connection with Efficient Resource Sharing

Youmin Chen
Youyou Lu
Jiwu Shu

RDMA provides extremely low latency and high bandwidth to distributed systems. Unfortunately, it fails to scale and suffers from performance degradation when transferring data to an increasing number of targets on Reliable Connection (RC). We observe that the above scalability issue has its root in the resource contention in the NIC cache, the CPU cache and the memory of each server. In this paper, we propose ScaleRPC, an efficient RPC primitive using one-sided RDMA verbs on reliable connection to provide scalable performance. To effectively alleviate the resource contention, ScaleRPC introduces 1) connection grouping to organize the network connections into groups, so as to balance the saturation and thrashing of the NIC cache; 2) virtualized mapping to enable a single message pool to be shared by different groups of connections, which reduces CPU cache misses and improve memory utilization. Such scalable connection management provides substantial performance benefits: By deploying ScaleRPC both in a distributed file system and a distributed transactional system, we observe that it achieves high scalability and respectively improves performance by up to 90% and 160% for metadata accessing and SmallBank transaction processing.

FlyMC: Highly Scalable Testing of Complex Interleavings in Distributed Systems

Jeffrey F. Lukman
Huan Ke
Cesar A. Stuardo
Riza O. Suminto
Daniar H. Kurniawan
Dikaimin Simon
Satria Priambada
Chen Tian
Feng Ye
Tanakorn Leesatapornwongsa
Aarti Gupta
Shan Lu
Haryadi S. Gunawi

We present a fast and scalable testing approach for datacenter/cloud systems such as Cassandra, Hadoop, Spark, and ZooKeeper. The uniqueness of our approach is in its ability to overcome the path/state-space explosion problem in testing workloads with complex interleavings of messages and faults. We introduce three powerful algorithms: state symmetry, event independence, and parallel flips, which collectively makes our approach on average 16x (up to 78x) faster than other state-of-the-art solutions. We have integrated our techniques with 8 popular datacenter systems, successfully reproduced 12 old bugs, and found 10 new bugs --- all were done without random walks or manual checkpoints.

SESSION: Networking

The Case For In-Network Computing On Demand

Yuta Tokusashi
Huynh Tu Dang
Fernando Pedone
Robert Soulé
Noa Zilberman

Programmable network hardware can run services traditionally deployed on servers, resulting in orders-of-magnitude improvements in performance. Yet, despite these performance improvements, network operators remain skeptical of in-network computing. The conventional wisdom is that the operational costs from increased power consumption outweigh any performance benefits. Unless in-network computing can justify its costs, it will be disregarded as yet another academic exercise.

In this paper, we challenge that assumption, by providing a detailed power analysis of several in-network computing use cases. Our experiments show that in-network computing can be extremely power-efficient. In fact, for a single watt, a software system on commodity CPU can be improved by a factor of x100 using an FPGA, and a factor of x1000 utilizing ASIC implementations. However, this efficiency depends on the system load. To address changing workloads, we propose in-network computing on demand, where services can be dynamically moved between servers and the network. By shifting the placement of services on-demand, data centers can optimize for both performance and power efficiency.

I Sent It: Where Does Slow Data Go to Wait?

Youngbin Im
Parisa Rahimzadeh
Brett Shouse
Shinik Park
Carlee Joe-Wong
Kyunghan Lee
Sangtae Ha

Emerging applications like virtual reality (VR), augmented reality (AR), and 360-degree video aim to exploit the unprecedentedly low latencies promised by technologies like the tactile Internet and mobile 5G networks. Yet these promises are still unrealized. In order to fulfill them, it is crucial to understand where packet delays happen, which impacts protocol performance such as throughput and latency. In this work, we empirically find that sender-side protocol stack delays can cause high end-to-end latencies, though existing solutions primarily address network delays. Unfortunately, however, current latency diagnosis tools cannot even distinguish between delays on network links and delays in the end hosts. To close this gap, we present ELEMENT, a latency diagnosis framework that decomposes end-to-end TCP latency into endhost and network delays, without requiring admin privileges at the sender or receiver.

We validate that ELEMENT achieves more than 90% accuracy in delay estimation compared to the ground truth in different production networks. To demonstrate ELEMENT's potential impact on real-world applications, we implement a relatively simple user-level library that uses ELEMENT to minimize delays. For evaluation, we integrate ELEMENT with legacy TCP applications and show that it can reduce latency by up to 10 times while maintaining throughput and fairness. We finally demonstrate that ELEMENT can significantly reduce the latency of a virtual reality application that needs extremely low latencies and high throughput.

Efficient and Safe Network Updates with Suffix Causal Consistency

Sheng Liu
Theophilus A. Benson
Michael K. Reiter

Though centrally managed by a controller, a software-defined network (SDN) can still encounter routing inconsistencies among its switches due to the non-atomic updates to their forwarding tables. In this paper, we propose a new method to rectify these inconsistencies that is inspired by causal consistency, a consistency model for shared-memory systems. Applied to SDNs, causal consistency would imply that once a packet is matched to ("reads") a forwarding rule in a switch, it can be matched in downstream switches only to rules that are equally or more up-to-date. We propose and analyze a relaxed but functionally equivalent version of this property called suffix causal consistency (SCC) and evaluate an implementation of SCC in Open vSwitch and P4 switches, in conjunction with the Ryu and P4Runtime controllers. Our results show that SCC provides greater efficiency than competing consistent-update alternatives while offering consistency that is strong enough to ensure high-level routing properties (black-hole freedom, bounded looping, etc.).

TAS: TCP Acceleration as an OS Service

Antoine Kaufmann
Tim Stamler
Simon Peter
Naveen Kr. Sharma
Arvind Krishnamurthy
Thomas Anderson

As datacenter network speeds rise, an increasing fraction of server CPU cycles is consumed by TCP packet processing, in particular for remote procedure calls (RPCs). To free server CPUs from this burden, various existing approaches have attempted to mitigate these overheads, by bypassing the OS kernel, customizing the TCP stack for an application, or by offloading packet processing to dedicated hardware. In doing so, these approaches trade security, agility, or generality for efficiency. Neither trade-off is fully desirable in the fast-evolving commodity cloud.

We present TAS, TCP acceleration as a service. TAS splits the common case of TCP processing for RPCs in the datacenter from the OS kernel and executes it as a fastpath OS service on dedicated CPUs. Doing so allows us to streamline the common case, while still supporting all of the features of a stock TCP stack, including security, agility, and generality. In particular, we remove code and data of less common cases from the fastpath, improving performance on the wide, deeply pipelined CPU architecture common in today's servers. To be workload proportional, TAS dynamically allocates the appropriate amount of CPUs to accommodate the fastpath, depending on the traffic load. TAS provides up to 90% higher throughput and 57% lower tail latency than the IX kernel bypass OS for common cloud applications, such as a key-value store and a realtime analytics framework. TAS also scales to more TCP connections, providing 2.2x higher throughput than IX with 64K connections.

SESSION: Big Data

GraphBolt: Dependency-Driven Synchronous Processing of Streaming Graphs

Mugilan Mariappan
Keval Vora

Efficient streaming graph processing systems leverage incremental processing by updating computed results to reflect the change in graph structure for the latest graph snapshot. Although certain monotonic path-based algorithms produce correct results by refining intermediate values via numerical comparisons, directly reusing values that were computed before mutation does not work correctly for algorithms that require BSP semantics. Since structural mutations in streaming graphs render the intermediate results unusable, exploiting incremental computation while simultaneously providing synchronous processing guarantees is challenging.

In this paper we develop GraphBolt which incrementally processes streaming graphs while guaranteeing BSP semantics. GraphBolt incorporates dependency-driven incremental processing where it first tracks dependencies to capture how intermediate values get computed, and then uses this information to incrementally propagate the impact of change across intermediate values. To support wide variety of graph-based analytics, GraphBolt provides a generalized incremental programming model that enables development of incremental versions of complex aggregations. Our evaluation shows that GraphBolt's incremental processing eliminates redundant computations and efficiently processes streaming graphs with varying mutation rates, starting from just a single edge mutation all the way up to 1 million edge mutations at a time. Furthermore, being specialized for graph computations, GraphBolt extracts high performance compared to Differential Dataflow.

Supporting Very Large Models using Automatic Dataflow Graph Partitioning

Minjie Wang
Chien-chin Huang
Jinyang Li

This paper presents Tofu, a system that partitions very large DNN models across multiple GPU devices to reduce per-GPU memory footprint. Tofu is designed to partition a dataflow graph of fine-grained tensor operators used by platforms like MXNet and TensorFlow. In order to automatically partition each operator, we propose to describe the semantics of an operator in a simple language inspired by Halide. To optimally partition different operators in a dataflow graph, Tofu uses a recursive search algorithm that minimizes the total communication cost. Our experiments on an 8-GPU machine show that Tofu enables the training of very large CNN and RNN models. It also achieves 25% - 400% speedup over alternative approaches to train very large models.

Matrix Algebra Framework for Portable, Scalable and Efficient Query Engines for RDF Graphs

Fuad Jamour
Ibrahim Abdelaziz
Yuanzhao Chen
Panos Kalnis

Existing query engines for RDF graphs follow one of two design paradigms: relational or graph-based. We explore sparse matrix algebra as a third paradigm and propose MAGiQ: a framework for implementing SPARQL query engines that are portable on various hardware architectures, scalable over thousands of compute nodes, and efficient for very large RDF datasets. MAGiQ represents the RDF graph as a sparse matrix and defines a domain-specific language of algebraic operations. SPARQL queries are translated into matrix algebra programs that are oblivious to the underlying computing infrastructure. Existing matrix algebra libraries, optimized for each particular architecture, are called to execute the program and handle the performance issues. We present three case studies of matrix algebra back-end libraries: SuiteSparse, MATLAB, and CombBLAS; we demonstrate how MAGiQ can effortlessly be ported on a variety of architectures such as Intel CPUs, NVIDIA GPUs, and Cray XC40 supercomputers. Our experiments on large-scale real and synthetic datasets show that MAGiQ performs comparably to or better than existing specialized SPARQL query engines for data-intensive queries, scales to very large computing infrastructures, and handles datasets with up to 512 billion triples.

Runtime Object Lifetime Profiler for Latency Sensitive Big Data Applications

Rodrigo Bruno
Duarte Patricio
José Simão
Luis Veiga
Paulo Ferreira

Latency sensitive services such as credit-card fraud detection and website targeted advertisement rely on Big Data platforms which run on top of memory managed runtimes, such as the Java Virtual Machine (JVM). These platforms, however, suffer from unpredictable and unacceptably high pause times due to inadequate memory management decisions (e.g., allocating objects with very different lifetimes next to each other, resulting in severe memory fragmentation). This leads to frequent and long application pause times, breaking Service Level Agreements (SLAs). This problem has been previously identified, and results show that current memory management techniques are ill-suited for applications that hold in memory massive amounts of long-lived objects (which is the case for a wide spectrum of Big Data applications).

Previous works reduce such application pauses by allocating objects in off-heap, in special allocation regions/generations, or by using ultra-low latency Garbage Collectors (GC). However, all these solutions either require a combination of programmer effort and knowledge, source code access, offline profiling (with clear negative impacts on programmer's productivity), or impose a significant impact on application throughput and/or memory to reduce application pauses.

We propose ROLP, a Runtime Object Lifetime Profiler that profiles application code at runtime and helps pretenuring GC algorithms allocating objects with similar lifetimes close to each other so that the overall fragmentation, GC effort, and application pauses are reduced. ROLP is implemented for the OpenJDK 8 and was evaluated with a recently proposed open-source pretenuring collector (NG2C). Results show long tail latencies reductions of up to 51% for Lucene, 85% for GraphChi, and 69% for Cassandra. This is achieved with negligible throughput (< 6%) and memory overhead, with no programmer effort, and no source code access.

SESSION: Distributed Systems

Keeping Master Green at Scale

Sundaram Ananthanarayanan
Masoud Saeida Ardekani
Denis Haenikel
Balaji Varadarajan
Simon Soriano
Dhaval Patel
Ali-Reza Adl-Tabatabai

Giant monolithic source-code repositories are one of the fundamental pillars of the back end infrastructure in large and fast-paced software companies. The sheer volume of everyday code changes demands a reliable and efficient change management system with three uncompromisable key requirements --- always green master, high throughput, and low commit turnaround time. Green refers to a master branch that always successfully compiles and passes all build steps, the opposite being red. A broken master (red) leads to delayed feature rollouts because a faulty code commit needs to be detected and rolled backed. Additionally, a red master has a cascading effect that hampers developer productivity--- developers might face local test/build failures, or might end up working on a codebase that will eventually be rolled back.

This paper presents the design and implementation of SubmitQueue. It guarantees an always green master branch at scale: all build steps (e.g., compilation, unit tests, UI tests) successfully execute for every commit point. SubmitQueue has been in production for over a year, and can scale to thousands of daily commits to giant monolithic repositories.

Serving Mobile Apps: A Slice at a Time

Ketan Bhardwaj
Matt Saunders
Nikita Juneja
Ada Gavrilovska

End users wanting to do more and more with mobile apps has led to explosive growth in the number of available apps. This has widened the gap between developers making apps available and end users being able to install all the apps they want on their device. To address this, Google introduced Instant Apps for Android where users can access selective app features on demand without having to download and install entire apps. But this requires developers to refactor apps and limits the apps' functionality.

In this paper, we present AppSlicer -- a solution that automates the creation, delivery, execution and cleanup of lightweight app slices from native apps. With AppSlicer, app slices are created from existing native apps, without requiring any additional developer effort. App slices run on end-user devices and correspond to arbitrary single functionality (task) that is carried out using an app. We demonstrate that app slicing is practical, it provides users with seamless access to app functionality with performance matching that of native installed apps, and better than other technologies for on-demand delivery of apps.

A Lightweight Framework for Fine-Grained Lifecycle Control of Android Applications

Yuru Shao
Ruowen Wang
Xun Chen
Ahemd M. Azab
Z. Morley Mao

The lifecycle of Android apps is dynamically managed by the system in an ad hoc manner, which leads to apps' abusing lifecycle entry points to automatically start up and gaming the priority-based memory management mechanism to evade being killed. Such apps exhibit diehard behaviors that keep them long-running in the background, resulting in excessive battery consumption and device performance degradation. Existing battery-saving features are far from being effective in restricting diehard behaviors, due to the lack of systematic, fine-grained control of app lifecycle.

In this paper, we propose the Application Lifecycle Graph (ALG), a holistic modeling of system-wide app lifecycle. We present a lightweight runtime framework that builds ALG and utilizes it to realize fine-grained lifecycle control of apps. The framework exposes APIs that provide ALG information and lifecycle control capabilities to developers and device vendors, empowering them to leverage the framework to implement rich functionalities. Evaluation results show that the proposed framework is competent and incurs low performance overhead. It introduces 4.5MB additional memory usage on average, and approximately 5% and 0.2% CPU usage during system booting and at idle state.

Teaching Rigorous Distributed Systems With Efficient Model Checking

Ellis Michael
Doug Woos
Thomas Anderson
Michael D. Ernst
Zachary Tatlock

Writing correct distributed systems code is difficult, especially for novice programmers. The inherent asynchrony and need for fault-tolerance make errors almost inevitable. Industrial-strength testing and model checking have been shown to be effective at uncovering bugs, but they come at a cost --- in both time and effort --- that is far beyond what students can afford. To address this, we have developed an efficient model checking framework and visual debugger for distributed systems, with the goal of helping students find and fix bugs in near real-time. We identify two novel techniques for reducing the search state space to more efficiently find bugs in student implementations. We report our experiences using these tools to help over two hundred students build a correct, linearizable, fault-tolerant, dynamically-sharded key--value store.

SESSION: Cloud Computing

Resource Deflation: A New Approach For Transient Resource Reclamation

Prateek Sharma
Ahmed Ali-Eldin
Prashant Shenoy

Data centers and clouds are increasingly offering low-cost computational resources in the form of transient virtual machines. Whenever demand for computational resources exceeds their availability, transient resources can reclaimed by preempting the transient VMs. Conventionally, these transient VMs are used by low-priority applications that can tolerate the disruption caused by preemptions.

In this paper we propose an alternative approach for reclaiming resources, called resource deflation. Resource deflation allows applications to dynamically shrink (and expand) in response to resource pressure, instead of being preempted outright. Deflatable VMs allow applications to continue running even under resource pressure, and increase the utility of low-priority transient resources. Deflation uses a dynamic, multi-level cascading reclamation technique that allows applications, operating systems, and hypervisors to implement their own policies for handling resource pressure. For distributed data processing, machine learning, and deep neural network training, our multi-level approach reduces the performance degradation by up to 2x compared to existing preemption-based approaches. When deflatable VMs are deployed on a cluster, our policies allow up to 1.6x utilization without the risk of preemption.

GrandSLAm: Guaranteeing SLAs for Jobs in Microservices Execution Frameworks

Ram Srivatsa Kannan
Lavanya Subramanian
Ashwin Raju
Jeongseob Ahn
Jason Mars
Lingjia Tang

The microservice architecture has dramatically reduced user effort in adopting and maintaining servers by providing a catalog of functions as services that can be used as building blocks to construct applications. This has enabled datacenter operators to look at managing datacenter hosting microservices quite differently from traditional infrastructures. Such a paradigm shift calls for a need to rethink resource management strategies employed in such execution environments. We observe that the visibility enabled by a microservices execution framework can be exploited to achieve high throughput and resource utilization while still meeting Service Level Agreements, especially in multi-tenant execution scenarios.

In this study, we present GrandSLAm, a microservice execution framework that improves utilization of datacenters hosting microservices. GrandSLAm estimates time of completion of requests propagating through individual microservice stages within an application. It then leverages this estimate to drive a runtime system that dynamically batches and reorders requests at each microservice in a manner where individual jobs meet their respective target latency while achieving high throughput. GrandSLAm significantly increases throughput by up to 3x compared to the our baseline, without violating SLAs for a wide range of real-world AI and ML applications.

Hourglass: Leveraging Transient Resources for Time-Constrained Graph Processing in the Cloud

Pedro Joaquim
Manuel Bravo
Luís Rodrigues
Miguel Matos

This paper addresses the key problems that emerge when one attempts to use transient resources to reduce the cost of running time-constrained jobs in the cloud. Previous works fail to address these problems and are either not able to offer significant savings or miss termination deadlines. First, the fact that transient resources can be evicted, requiring the job to be re-started (even if not from scratch) may lead provisioning policies to fall-back to expensive on-demand configurations more often than desirable, or even to miss deadlines. Second, when a job is restarted, the new configuration can be different from the previous, which might make eviction recovery costly, e.g., transferring the state of graph data between the old and new configurations. We present HOURGLASS, a system that addresses these issues by combining two novel techniques: a slack-aware provisioning strategy that selects configurations considering the remaining time before the job's termination deadline, and a fast reload mechanism to quickly recover from evictions. By switching to an on-demand configuration when (but only if) the target deadline is at risk of not being met, we are able to obtain significant cost savings while always meeting the deadlines. Our results show that, unlike previous work, HOURGLASS is able to significantly reduce the operating costs in the order of 60-70% while guaranteeing that deadlines are met.

Efficient, Consistent Distributed Computation with Predictive Treaties

Tom Magrino
Jed Liu
Nate Foster
Johannes Gehrke
Andrew C. Myers

To achieve good performance, modern applications often partition their state across multiple geographically distributed nodes. While this approach reduces latency in the common case, it can be challenging for programmers to use correctly, especially in applications that require strong consistency. We introduce predictive treaties, a mechanism that can significantly reduce distributed coordination without losing strong consistency. The central insight behind our approach is that many computations can be expressed in terms of predicates over distributed state that can be partitioned and enforced locally. Predictive treaties improve on previous work by allowing the locally enforced predicates to depend on time. Intuitively, by predicting the evolution of system state, coordination can be significantly reduced compared to static approaches. We implemented predictive treaties in a distributed system that exposes them in an intuitive programming model. We evaluate performance on several benchmarks, including TPC-C, showing that predictive treaties can significantly increase performance by orders of magnitude and can even outperform customized algorithms.

SESSION: Programming Languages and Verification

Multiverse: Compiler-Assisted Management of Dynamic Variability in Low-Level System Software

Florian Rommel
Christian Dietrich
Michael Rodin
Daniel Lohmann

System software, such as the Linux kernel, typically provides a high degree of versatility by means of static and dynamic variability. While static variability can be completely resolved at compile time, dynamic variation points come at a cost arising from extra tests and branches in the control flow. Kernel developers use it (a) only sparingly and (b) try to mitigate its overhead by run-time binary code patching, for which several problem/architecture-specific mechanisms have become part of the kernel.

We think that means for highly efficient dynamic variability should be provided by the language and compiler instead and present multiverse, an extension to the C programming language and the GNU C compiler for this purpose. Multiverse is easy to apply and targets program-global configuration switches in the form of (de-)activatable features, integer-valued configurations, and rarely-changing program modes. At run time, multiverse removes the overhead of evaluating them on every invocation. Microbenchmark results from applying multiverse to performance-critical features of the Linux kernel, cPython, the musl C-library and GNU grep show that multiverse can not only replace and unify the existing mechanisms for run-time code patching, but may in some cases even improve their performance by adding new dynamic variability options.

Grapple: A Graph System for Static Finite-State Property Checking of Large-Scale Systems Code

Zhiqiang Zuo
John Thorpe
Yifei Wang
Qiuhong Pan
Shenming Lu
Kai Wang
Guoqing Harry Xu
Linzhang Wang
Xuandong Li

Many real-world bugs in large-scale systems are related to object state that is supposed to obey a specified finite state machine (FSM). They are triggered when unexpected events occur on objects in certain states, making these objects transition in a way that violates their specifications. Detecting such FSM-related bugs with static analysis is challenging, especially in distributed systems that have large codebases.

This paper presents a single-machine, disk-based graph system, called Grapple, which was designed to conduct precise and scalable checking of finite-state properties for very large codebases. Grapple detects bugs through context-sensitive, path-sensitive alias and dataflow analyses, which are both formulated as dynamic transitive-closure computations and automatically parallelized by the system. We propose a novel path constraint encoding/decoding algorithm to attach a path constraint to a graph edge, allowing the graph engine to efficiently recover a path and compute its constraint during the computation. We have implemented Grapple and conducted a comprehensive evaluation over widely deployed distributed systems. Grapple reported a total of 376 warnings, of which only 17 are false positives. Our results also demonstrate the scalability of Grapple: it took between 51 minutes and 33 hours to finish all the analyses on a low-end desktop with 16G memory and 1T SSD space, while the traditional approaches ran out of memory in all cases.

Replayable Execution Optimized for Page Sharing for a Managed Runtime Environment

Kai-Ting Amy Wang
Rayson Ho
Peng Wu

We present Replayable Execution, a system for improving the efficiency of Function-as-a-Service (FaaS) frameworks. It takes advantage of standard kernel features to reduce memory usage and accelerate cold startup speed without changes to the OS kernel, language runtimes, and the surrounding FaaS deployment environment. Replayable Execution exploits the intensive-deflated execution characteristics of the majority of target applications. It uses checkpointing to save an image of an application, allowing this image to be shared across containers and resulting in speedy restoration at service startup. We apply Replayable Execution to a representative FaaS Java framework to create a ReplayableJVM execution, which together with benefits from deterministic execution of a warmed up runtime, offers 2X memory footprint reduction, and over 10X startup time improvement.

Deferred Runtime Pipelining for contentious multicore software transactions

Shuai Mu
Sebastian Angel
Dennis Shasha

DRP is a new concurrency control protocol for software transactional memory that achieves high throughput, even for skewed workloads that exhibit high contention. DRP builds on prior works that chop transactions into pieces to expose more concurrency opportunities, but unlike these works, DRP performs no static analyses and supports arbitrary workloads. DRP achieves a high degree of concurrency across most workloads and guarantees deadlock freedom, strict serializability, and opacity. We incorporate DRP into the software transactional objects library STO [18] and find that DRP improves STO's throughput on several STAMP benchmarks by up to 3.6x. Additionally, an in-memory multicore database implemented with our modified variant of STO outperforms databases that use OCC or transaction chopping for concurrency control. Specifically, DRP achieves 6.6x higher throughput than OCC when contention is high. Compared to transaction chopping, our DRP achieves 3.3x higher throughput when contention is medium or low. Furthermore, our implementation achieves comparable performance to OCC and transaction chopping at other contention levels.

SESSION: Systems for Machine Learning

GRNN: Low-Latency and Scalable RNN Inference on GPUs

Connor Holmes
Daniel Mawhirter
Yuxiong He
Feng Yan
Bo Wu

Recurrent neural networks (RNNs) have gained significant attention due to their effectiveness in modeling sequential data, such as text and voice signal. However, due to the complex data dependencies and limited parallelism, current inference libraries for RNNs on GPUs produce either high latency or poor scalability, leading to inefficient resource utilization. Consequently, companies like Microsoft and Facebook use CPUs to serve RNN models.

This work demonstrates the root causes of the unsatisfactory performance of existing implementations for RNN inference on GPUs from several aspects, including poor data reuse, low on-chip resource utilization, and high synchronization overhead. We systematically address these issues and develop a GPU-based RNN inference library, called GRNN, that provides low latency, high throughput, and efficient resource utilization. GRNN minimizes global memory accesses and synchronization overhead, as well as balancing on-chip resource usage through novel data reorganization, thread mapping, and performance modeling techniques. Evaluated on extensive benchmarking and real-world applications, we show that GRNN outperforms the state-of-the-art CPU inference library by up to 17.5X and state-of-the-art GPU inference libraries by up to 9X in terms of latency reduction.

Automating Dependence-Aware Parallelization of Machine Learning Training on Distributed Shared Memory

Jinliang Wei
Garth A. Gibson
Phillip B. Gibbons
Eric P. Xing

Machine learning (ML) training is commonly parallelized using data parallelism. A fundamental limitation of data parallelism is that conflicting (concurrent) parameter accesses during ML training usually diminishes or even negates the benefits provided by additional parallel compute resources. Although it is possible to avoid conflicting parameter accesses by carefully scheduling the computation, existing systems rely on programmer manual parallelization and it remains a question when such parallelization is possible.

We present Orion, a system that automatically parallelizes serial imperative ML programs on distributed shared memory. The core of Orion is a static dependence analysis mechanism that determines when dependence-preserving parallelization is effective and maps a loop computation to an optimized distributed computation schedule. Our evaluation shows that for a number of ML applications, Orion can parallelize a serial program while preserving critical dependences and thus achieve a significantly faster convergence rate than data-parallel programs and a matching convergence rate and comparable computation throughput to state-of-the-art manual parallelizations including model-parallel programs.

Parallax: Sparsity-aware Data Parallel Training of Deep Neural Networks

Soojeong Kim
Gyeong-In Yu
Hojin Park
Sungwoo Cho
Eunji Jeong
Hyeonmin Ha
Sanha Lee
Joo Seong Jeong
Byung-Gon Chun

The employment of high-performance servers and GPU accelerators for training deep neural network models have greatly accelerated recent advances in deep learning (DL). DL frameworks, such as TensorFlow, MXNet, and Caffe2, have emerged to assist DL researchers to train their models in a distributed manner. Although current DL frameworks scale well for image classification models, there remain opportunities for scalable distributed training on natural language processing (NLP) models. We found that current frameworks show relatively low scalability on training NLP models due to the lack of consideration to the difference in sparsity of model parameters. In this paper, we propose Parallax, a framework that optimizes data parallel training by utilizing the sparsity of model parameters. Parallax introduces a hybrid approach that combines Parameter Server and AllReduce architectures to optimize the amount of data transfer according to the sparsity. Experiments show that Parallax built atop Tensor-Flow achieves scalable training throughput on both dense and sparse models while requiring little effort from its users. Parallax achieves up to 2.8x, 6.02x speedup for NLP models than TensorFlow and Horovod with 48 GPUs, respectively. The training speed for the image classification models is equal to Horovod and 1.53x faster than TensorFlow.

Fast Distributed Deep Learning over RDMA

Jilong Xue
Youshan Miao
Cheng Chen
Ming Wu
Lintao Zhang
Lidong Zhou

Deep learning emerges as an important new resource-intensive workload and has been successfully applied in computer vision, speech, natural language processing, and so on. Distributed deep learning is becoming a necessity to cope with growing data and model sizes. Its computation is typically characterized by a simple tensor data abstraction to model multi-dimensional matrices, a dataflow graph to model computation, and iterative executions with relatively frequent synchronizations, thereby making it substantially different from Map/Reduce style distributed big data computation.

RPC, commonly used as the communication primitive, has been adopted by popular deep learning frameworks such as TensorFlow, which uses gRPC. We show that RPC is suboptimal for distributed deep learning computation, especially on an RDMA-capable network. The tensor abstraction and dataflow graph, coupled with an RDMA network, offers the opportunity to reduce the unnecessary overhead (e.g., memory copy) without sacrificing programmability and generality. In particular, from a data access point of view, a remote machine is abstracted just as a "device" on an RDMA channel, with a simple memory interface for allocating, reading, and writing memory regions. Our graph analyzer looks at both the data flow graph and the tensors to optimize memory allocation and remote data access using this interface. The result is up to 169% improvement against an RPC implementation optimized for RDMA, leading to faster convergence in the training process.

μLayer: Low Latency On-Device Inference Using Cooperative Single-Layer Acceleration and Processor-Friendly Quantization

Youngsok Kim
Joonsung Kim
Dongju Chae
Daehyun Kim
Jangwoo Kim

Emerging mobile services heavily utilize Neural Networks (NNs) to improve user experiences. Such NN-assisted services depend on fast NN execution for high responsiveness, demanding mobile devices to minimize the NN execution latency by efficiently utilizing their underlying hardware resources. To better utilize the resources, existing mobile NN frameworks either employ various CPU-friendly optimizations (e.g., vectorization, quantization) or exploit data parallelism using heterogeneous processors such as GPUs and DSPs. However, their performance is still bounded by the performance of the single target processor, so that realtime services such as voice-driven search often fail to react to user requests in time. It is obvious that this problem will become more serious with the introduction of more demanding NN-assisted services.

In this paper, we propose μLayer, a low latency on-device inference runtime which significantly improves the latency of NN-assisted services. μLayer accelerates each NN layer by simultaneously utilizing diverse heterogeneous processors on a mobile device and by performing computations using processor-friendly quantization. Two key findings motivate our work: 1) the existing frameworks are limited by single-processor performance as they execute an NN layer using only a single processor, and 2) the CPU and the GPU on the same mobile device achieve comparable computational throughput, making cooperative acceleration highly promising. First, to accelerate an NN layer using both the CPU and the GPU at the same time, μLayer employs a layer distribution mechanism which completely removes redundant computations between the processors. Next, μLayer optimizes the per-processor performance by making the processors utilize different data types that maximize their utilization. In addition, to minimize potential latency increases due to overly aggressive workload distribution, μLayer selectively increases the distribution granularity to divergent layer paths. Our experiments using representative NNs and mobile devices show that μLayer significantly improves the speed and the energy efficiency of on-device inference by up to 69.6% and 58.1%, respectively, over the state-of-the-art NN execution mechanism.