Read Me
23rd ACM Symposium on Operating Systems Principles
Poster Abstracts
Oolong: Programming Asynchronous Distributed Applications with Triggers
Existing programming frameworks provide useful abstractions for synchronous computation involving multiple rounds, for which most of its input data is examined in each round. Asynchronous computation differs from synchronous computation in that the computation does not proceed in lockstep across many rounds. Many problems are solved by efficient asynchronous computation, e.g. single-source shortest path, asynchronous PageRank, web crawling etc. Unfortunately, such computation does not fit into existing programming frameworks which enforce global synchronization. We present Oolong, a programming framework designed to address the needs of asynchronous applications using a programming abstraction called the trigger, sections of code that are invoked when data is updated.
OS design for non-cache-coherent systems
Which operating system structures are appropriate for machines with either no cache coherence, or else a number of distinct “coherence islands”? While a clear consensus on the architecture of future multicore processors has yet to emerge, it seems that machines without system-wide cache coherence are likely. In this paper we explore the tradeoffs which characterize the OS design space for such machines, including statically-partitioned cluster-on-a-chip schemes and the recently-proposed multikernel model. Specifically, we modify the open-source Barrelfish research operating system to support non-cache-coherent hardware (in particular, the Intel Single-Chip Cloud Computer). We use the resulting system to explore the benefits and costs of managing resources across the whole machine rather than within a single coherence island, and show three examples where applications such as databases and parallel applications can obtain tangible benefits from system-wide resource management: dynamically reallocating physical memory between cores or coherence islands, coordinated scheduling of some parallel workloads, and the use of a shared buffer cache between coherence islands.
Refactoring the Web Interface
Web applications are compelling because of four key properties (the "IRON" properties) exhibited by the conventional web interface (HTML, DOM, JavaScript...). Yet these properties could be implemented, and implemented better, by a different interface. We propose such an alternative interface, and argue that it can do a better job implementing the IRON properties than today's web.
Energy-Aware Programming Utilizing the SEEP Framework and Symbolic Execution
Resource-constrained devices, such as wireless sensor nodes, smart phones, laptops, and tablet computers most notably suffer from the limited amount of energy they have available. Yet, emerging battery technologies addressing this issue were unable to induct significant improvements. This is well documented by rates of growth in the various technology areas. While clock speed, data storage, and data transfer rates have seen growth rates from factor 10^3 to 10^6 during the last two decades, battery life could merely be improved by factor 10^1. At the same time, research efforts lead to energy-aware system software exploiting the available energy resources in the most efficient way. Static and dynamic optimizations for energy-aware execution have been widely explored. Though, writing energy-efficient programs in the first place has only received limited attention. To address this, we present SEEP, a framework which exploits symbolic execution and platform-specific energy profiles to provide the basis for energy-aware programming. SEEP equips developers with the necessary knowledge to take energy demand into account during the task of writing programs.
How big Hadoop clusters break in the real world
Hadoop is among today's most widely deployed "big data" systems. Cloudera is a company offering paid Hadoop services and support. This poster proposal describes lessons from examining a sample of several hundred support tickets, from February through July of 2011. We show that misconfigurations are the dominant cause of failures. We also describe some of the design "anti-patterns" and missing platform features that contribute to the problems we observed.
Lock-free Transactional Support for Distributed Data Stores
In this paper, we introduce CrSO, a client-replicated, lock-free design for transactional support in large-scale storage systems such as HBase. CrSO uses a centralized scheme and implements snapshot isolation, a property that guarantees that all read operations of a trans- action are performed on a consistent snapshot of the data store. The centralized scheme of CrSO enables a lock- free commit algorithm that prevents unreleased locks of a failed transaction from blocking others. Moreover, being client-replicated, CrSO not only does not require any modification into the data store, but also imposes a close- to-zero overhead on data servers. The experimental results show that our implementation on a simple dual-core machine can scale up to a thousand of client servers and service up to 50K transactions per second (TPS), which is multiple times larger than the maximum achieved traffic in similar data stores. Consequently, we do not expect CrSO to be a bottleneck even for current large distributed storage systems.
Naiad: The animating spirit of rivers and streams
We report on the design and implementation of the Naiad system for distributed data-parallel computation. Naiad addresses similar workloads to existing data-parallel computation platforms such as Hadoop and Dryad, but its data-model and implementation are carefully designed to permit incremental updates and thus support efficient streaming and iterative computation.
High-performance Disk Imaging With Deduplicated Storage
Infrastructures such as clouds and network testbeds need to provide a large collection of “disk images”. These disk images require significant resources to store. On the other side, these disk images need to be distributed to physical hosts and installed quickly. To satisfy both of these requirements, we design a new storage service to improve the storage efficiency for storing disk images, with a specific goal to distribute and install them quickly. “Data deduplication” is the technique that we utilize to reduce storage consumption. We also discuss several design tradeoffs we make for a fast disk image distribution. Evaluations of our new system demonstrate that dedupli- cation saves more than 60% of disk space and by keeping the overall disk imaging pipeline balanced, our new sys- tem introduces a negligible performance overhead.
To Trim or Not to Trim: Judicious TRIMing for Solid State Drives
The TRIM command is a relatively new addition to the ATA standard. As Solid State Drives (SSDs) operate differently from hard disks, there exists a discrepancy between the operating system and device's view of storage that each individually manages. The TRIM command works to eliminate this discrepancy. In this study, we first show evidence that naive use of TRIM anactually harm SSD performance. Then, we propose a policy that allows judicious use of TRIM to reap the benefits that were intended.
Encoded Protocols: Overhead Analysis on Elections
Practical distributed systems are typically built under a crash-fault model. Recently, arbitrary faults such as bit flips have been observed surprisingly often, and have disrupted large services such as Amazon S3. We present a framework for building distributed protocols which automatically improves their fault coverage by means of an encoded processing compiler. Although encoded protocols cannot withstand attacks by malicious adversaries, they can tolerate a wide variety of non-malicious arbitrary faults. This preliminary work focuses on leader election, a fundamental primitive in distributed systems. In these protocols, a bit flips can possibly violate liveness and safety properties, for instance, by never electing a leader, or by electing more than one leader at the same time. We implement two election algorithms in our framework and experimentally analyze the transformation’s overhead on CPU utilization and election time.
Vertical Caching: Web Caching for Challenged Networks
Many network links in developing regions are low-bandwidth or intermittent. In addition to the inherent connection constraints, web browsing over these networks is an extremely painful experience due to inappropriate cache design. A single cache miss during browsing may cause a stall on the order of minutes or even hours. To improve web cache performance, we propose a new model of web caching particularly tailored to low-bandwidth and intermittent environments called vertical caching. Vertical caching extends existing caching mechanisms based on URLs to aggregates of cached pages across topics. We show that vertical caching has the potential to dramatically improve performance in these settings.
Software Side Channel Attack on Memory Deduplication
Memory deduplication merges same-content memory pages and reduces the consumption of physical memory. It is effective on environments that run many virtual machines with the same operating system. Memory deduplication, however, is subject to software side channel attacks, which disclosure memory contents. It is used to reveal the existence of an application or file on another virtual machine. Such an attack takes advantage of a difference in write access times on deduplicated memory pages that are re-created by Copy-On-Write. The vulnerability of memory deduplication has showed in [EuroSec’11] by us, but the exploits were immature. In this poster we show untouched problems and refined exploits. Furthermore we show new application used the technique, which enables secret communication between virtual machines on a processor. A secret marker on memory is used for teaching existence of VM on a processor of multi-tenant cloud computing.
Tracking Behavioral Changes in Distributed Systems using Distalyzer
Distributed systems executions are complex and hard to understand. Understanding the behavior of distributed systems is useful in finding avenues for improvements, performance benefits, identification of bugs and unexpected operations. Large distributed systems are hard to study because of their complex and numerous interaction possibilities, stemming from variable network delays and node processing delays that are characteristic to heterogeneous environments. Currently, there is no technique for analyzing these executions to identify problems that affect global system metrics, thus severely hindering the maintainability of these systems. We present this work on Distalyzer which automatically analyzes large distributed systems at their repositories, and tracks updates to the software, together with monitoring of global system metrics. This will hugely help developers in understanding impacts of bug fixes, feature changes, trend of system growth and so on, thus assisting in development and maintenance.
EV: Replicating Multithreaded Servers
This poster presents EV, a new Execute-Verify architecture that allows state machine replication to scale to multi-core servers. EV departs from the traditional agree-execute architecture of state machine replication: replicas first concurrently and nondeterministically execute requests; then they verify, agree, and converge on the state and the outputs produced by a correct replica. EV minimizes divergence through a mixer stage that applies application-specific rules to organize requests into batches of requests that are unlikely to interfere. Our evaluation suggests that EV's unique ability to combine execution independence with nondeterminism enables high-performance replication for multi-core servers while offering tolerance to a wide range of faults, including elusive concurrency bugs.
Scalable Data Middleware for Smart Grids
Smart grids promise to improve the efficiency of power grid systems and reduce green house emissions through incorporating power generation from renewable sources and shaping demands to match the supply. Renewable sources include solar or wind. Power generation from these sources is affected by weather factors that can be highly fluctuating. To ensure these energy sources can be utilized efficiently, smart grid systems often shape demand through incentive to match the supply. As a result, the whole system becomes highly dynamic and requires constant adjusting. How to adjust the system can have a great impact on the efficiency and reliability of power grid systems, which offer many opportunities for innovation. In the previous work by us and other researchers, we have identified and developed several applications can be used to optimize power grid operations.
X-ray: Root-cause Diagnosis of Performance Anomalies in Production Software
Understanding and troubleshooting performance problems in complex software systems is notoriously challenging. This challenge is compounded for software in production for several reasons. To avoid slowing down production systems, analysis and troubleshooting must incur minimal overhead. Further, performance issues in production can be both rare and non-deterministic, making the issues hard to reproduce.

However, we argue that the most important reason why troubleshooting performance in production systems is challenging is that current tools only solve half the problem. Troubleshooting a performance anomaly is essentially the process of determining why certain events, such as high latency or resource usage, happened in a system. Unfortunately, most current analysis tools, such as profilers and logging, only determine what events happened during a performance anomaly — they leave the more challenging question of why those events happened unanswered. Administrators and developers must manually infer the root cause of performance issue from the observed events based upon their expertise and knowledge of the software.

This poster will describe X-ray, a tool that uses performance summarization to determine what events occurred during a performance anomaly and also why the anomaly occured. Performance summarization first attributes performance costs such as latency and I/O utilization to fine-grained events (individual instructions and system calls). Then, it uses dynamic information flow analysis to associate each such event with a set of probable root causes such as configuration settings or specific data from input requests. The cost of each event is assigned to potential root causes weighted by the probability that the particular root cause led to the execution of that event. Finally, the per-cause costs for all events in the program execution are summed together. The end result is a list of root causes ordered by their performance costs.

Arosa: testbed resource allocation using late-binding and constraints
Resource allocation is an increasing challenge for distributed network testbeds as computational and network resources are involved. In this poster, we describe an new approach to negotiate testbed resources between clients and testbed providers: the clients specify their requests as constraints, and the providers reply with resource allocations expressed also as declarative set of constraints on resources. This gives providers more flexibility in late-binding of resources to requests, and opens up a wide design space to optimize resource allocation for efficiency, cost, utilization, or other metrics. We are now in the process of applying this idea into the implementation of a resource negotiation protocol with constraint-based tickets and leases, and exploring the optimization space for ProtoGENI (more than 1000 hosts) given a trace of network requests submitted to Emulab.
Mitigating Sybil attacks on content rating systems
Online content sharing services like Digg and YouTube serve as a powerful mechanism for users to find new and interesting content. Most of these services work by (a) allowing users to vote on or rate content and (b) providing features that present highly rated content to users. However, these sites are vulnerable to Sybil attacks, where a single person creates multiple identities and votes multiple times. Because most content receives few votes, malicious votes can quickly overwhelm the honest votes, allowing advertisements and spam to appear highly rated.

In this work we investigate a novel approach to mitigating such attacks by leveraging the structure of the social network that most services also possess. We model the problem as a multi-commodity max flow problem over the social network, ensuring that a user with multiple identities can only influence a vote as much as he could as if he had a single identity. If successful, the proposal offers a scalable approach that mitigates the impact of Sybil attacks.

OS Fundamentalism: Using XOmB for fundamental OS research
This work describes the XOmB project, a novel exokernel-inspired operating system designed for building more efficient systems and building systems more efficiently. The XOmB kernel is stateless and exploits hardware virtualization to place untrusted drivers in the applications. Several years of effort have gone into its initial development, with contributions primarily by full-time undergraduate students trading their spare time for the self-education opportunity.
InkTag: Secure Applications On An Untrusted Operating System
InkTag is a novel system architecture that gives strong safety guarantees to high-assurance processes even in the presence of a malicious operating system. The InkTag hypervisor guarantees privacy, integrity and consistency (reads return the latest write) of secure address spaces, file storage, and inter-process communication. The InkTag architecture supports user-level access control that applies flexible user-defined policies to private data. We measure our prototype which runs Apache as a high-assurance process at a performance cost of about 1.7x.
Augmenting MapReduce with Active Volunteer Resources
The migration of interactive workloads, such as desktop applications, into clouds presents significant opportunities for efficiency improvements. The bursty and interactive nature of such workloads makes it challenging to aggressively consolidate them on multi-tenant systems. In such scenarios, utilizing residual or wasted CPU cycles is particularly appealing, which helps amortize the cost of physical hardware as well as improve energy efficiency.

In this work, we propose to harvest residual CPU cycles in interactive clouds by aggregating distributed, opportunistic compute capacity with batch MapReduce jobs. To accomplish this, we leverage a hybrid cluster design consisting of a small set of dedicated batch nodes supplemented by a larger pool of volunteers, which executes and gives priority to foreground interactive workloads. Further, we outline the design of an autonomous management framework, which manipulates a hybrid cluster to achieve runtime and energy performance goals under dynamic residual resource availability. We implement our cluster design on top of Apache Hadoop, and highlight our proposed techniques to cope with the diverse performancelimiting factors in this challenging environment.

Practical Hardening of Crash-Tolerant Systems
While crashes constitute the predominant failure mode in practical distributed systems, commission faults leading to incorrect behavior do happen in practice. The effects of state corruptions are hard to predict, handle, and even diagnose, so they are are often neglected by developers. While Byzantine-fault tolerant state machine replication represents a viable solution, it has not yet been adopted by the industry. We propose a novel approach to systematically and transparently hardening crash-tolerant algorithms. We propose a novel fault model, called ASC, to represent Arbitrary State Corruptions making the whole state of faulty processes transition to an arbitrary value. Corruptions can also hit the program counter and subvert the control flow of the process. We propose a formal hardening technique to make crash-tolerant algorithms ASC-tolerant, implement it in form of a Java library, and show initial evaluation on a Paxos protocol.
Modularity Meets Batching: Towards an Experimental Platform for High-speed Software Routers
Batch processing is a well-known system implementation technique that could drastically improve the performance by minimizing the per-unit data handling cost. However, applying batching in the context of a modular software router has not been popular due to inherent packet processing path diversity. Our goal is to build a high-performance modular software router platform for network researchers with benefits from massively-parallel processors (e.g., GPUs).

We start from an already high-performing prototype, PacketShader, as a well-established starting point in favor of batch processing and GPU utilization. We implement a prototype of modular architecture by reusing the packet I/O engine and applications for GPUs from PacketShader that are optimized for batching, and investigate technical challenges to catch two rabbits: the performance and the modularity.

Policy Learning for Adaptive Allocation of Cloud Resources to Multi-tier Web Applications
Dynamic resource scaling in cloud computing enables cloud providers to offer specific response time guarantees to consumers. For Web applications, it is possible in principle to achieve low operational costs by minimizing resource allocation, automatically detecting or predicting violations of service time guarantees, and dynamically scaling allocated resources as needed to handle increased loads. However, for multi-tier Web applications in particular, it is difficult to automatically identify bottlenecks and scale the appropriate resource tier. To address this issue, in this paper, we propose a system that first identifies workload patterns for a multi-tier Web application from access logs using unsupervised machine learning, reactively identifies bottlenecks for specific workload patterns, and learns optimal resource allocation policies by monitoring access logs in real time. We demonstrate the feasibility of the approach in an experimental evaluation with a EUCALYPTUS-based testbed cloud. The experimental results show that the proposed system can enable cloud providers to offer response time guarantees for multi-tier Web applications using adaptive resource allocation without any prior knowledge of the application's resource utilization or workload patterns.
The Case for Region Serializability
It is difficult to write correct multithreaded code. This difficulty is compounded by the weak memory model provided to multithreaded applications running on commodity multicore hardware, where there is not an easily understood semantics for applications containing data races. For example, the DRF0 memory model only guarantees sequential consistency to data-race free programs, and while the Java memory model does specify behavior for racy programs, these semantics are difficult to understand. Without clear semantics, it is difficult to write, test, and debug multithreaded code running on commodity multicore hardware.

While sequential consistency is often regarded as the preferred semantics for buggy lock-based programs, it only guarantees a global ordering at the granularity of instructions, which is too fine a granularity to reason about thread interleavings — especially if there are data races. To reduce the burden of multithreaded programmer, the runtime system should serialize larger regions of code to reduce the number of thread interleavings and the compiler should guarantee it does not optimize across these region boundaries. Such a system provides region serializable semantics with respect to the original source code.

Increasing the granularity of serialization benefits multithreaded programming in multiple ways: it makes it easier for programmers and software testing tools to understand multithreaded code because there are fewer interleavings and it can make software more robust in production. Moreover, these benefits extend to all multithreaded code, including those containing data races.

Three Pieces of the MapReduce Workload Management Puzzle
MapReduce has become the de facto platform for cost-effective analytics over “Big Data”. There is an increasing number of MapReduce applications associated with live business intelligence that require completion time guarantees. There is a lack of performance models and workload analysis tools for automated performance management of such MapReduce jobs. In this work, we introduce and analyze a set of complementary mechanisms that enhance workload management decisions for processing MapReduce jobs with deadlines. The three mechanisms we consider are the following: 1) a policy for job ordering in the processing queue; 2) a mechanism for allocating a tailored number of map and reduce slots to each job with a completion time requirement; 3) a mechanism for allocating and deallocating (if necessary) spare resources in the system among the active jobs. We analyze the functionality and performance benefits of each mechanism via an extensive set of simulations over diverse workload sets. The proposed mechanisms form the integral pieces in the performance puzzle of automated workload management in MapReduce environments. We implement a novel deadline-based Hadoop scheduler that integrates all these three mechanisms and provides an efficient support for serving MapReduce jobs with deadlines. The results of our simulation study are validated through experiments on a 66-node Hadoop cluster.
Multi-scale computing over heterogeneous resources
Parallel processing of large data sets on clusters of commodity machines has become an active field of research as well as a thriving business in recent years, and task-parallel frameworks have enabled programmers to harness managed large-scale parallelism while writing mostly sequential code. The systems used to run such jobs across many machines, however, essentially treat them as a homogeneous collection, partition them into many VMs at the hypervisor level or use coarse-grained partitioning into a number of "slots" per machine. We believe that modern data centres are more complex than this: heterogeneity is a reality at multiple levels and may even be introduced deliberately.

We believe that there has been no appropriate general treatment of resource heterogeneity in compute clusters thus far. In this work, we introduce the concept of independent "resource ensembles", marking a departure from the classic master-worker design of task-parallel compute frameworks, and propose a new approach to detect and characterise resource heterogeneity within and between ensembles, enabling heterogeneity-aware scheduling decisions. In explicitly supporting heterogeneity-awareness and dynamic extension of a computation across resource ensembles, we introduce the idea of "multi-scale computation" — that is, we can exploit parallelism at many scales, from fine-grained multi-core processing with communication over an on-chip network to coarse-grained clusters of virtual machines.

Macho II: Even More Macho
Despite years of work on programming languages, programming is still slow and error-prone. In this paper we describe Macho, a system which combines a natural language parser, a database of code, and an automated debugger to write simple programs from natural language and examples of their correct execution. Adding examples to natural language makes it easier for Macho to actually generate a correct program, because it can test its candidate solutions and fix simple errors. Macho is able to synthesize basic versions of six out of nine small coreutils from short natural language descriptions based on their man pages and sample runs.
CQuest: Collaborative Neighbor Discovery in Dual-Radio Mobile Social Networks
The recent surge in the use of personal mobile communication devices have opened up new avenues for communication. While most existing applications designed to exploit this potential are infrastructure based, there is a growing trend to leverage physical proximity between end-users to enable direct peer-to-peer communication. However, the success of these applications relies on the ability to efficiently detect contact opportunities. Devices that participate in such opportunistic communication often come equipped with multiple radios, especially in pocket-switched networks. For an individual node, performing neighbor discovery can be too expensive with a high-power, long-range radio (e.g., 802.11), even with duty-cycling. On the other hand, relying only on a low-power, short-range radio for detecting neighbors results in significantly fewer available contacts. To mitigate this problem, we have developed CQuest, a novel scheme for more efficient long-range neighbor discovery that leverages the clustering of nodes as well as the radio heterogeneity of mobile devices. The basic idea is that coordination over a low-power, short-range radio can help clustered nodes distribute the load of high-power, long-range scanning. We present results from a successful implementation of the protocol on a testbed of Android G1/G2 phones that shows the feasibility of the protocol in a real network.
Elastic Replication for Scalable Consistent Services
Most of the scalable and high-performance services used in datacenters today provide relaxed consistency guarantees in order to achieve good responsiveness. One reason for this is that it is believed that expensive majority-based consensus protocols are needed in order to provide strong consistency in asynchronous and partially synchronous environments such as a datacenter or the Internet.

In this extended abstract, we briefly describe our research into building a new lightweight replication protocol that does not use majority voting and yet provides strong consistency in the presence of crash faults and imperfect failure detectors.