Distributed Systems (I)
Summarized by Steve Gribble, U.C. Berkeley


Manageability, Availability and Performance in Porcupine: A Highly Scalable, Cluster-Based Mail Service
Yasushi Saito, Brian N. Bershad, and Henry M. Levy (Univ. of Washington)

Yasushi, a UW student, presented this award paper. He described Porcupine, a cluster-based mail service that scales to handle very large client populations and that achieves high availability, high manageability, and graceful degradation in the face of failures. Porcupine makes extensive use of "functional homogeneity": any node in the cluster can perform any task, although tasks have affinity to some nodes that make use of soft-state indexes of the hard-state distributed across the cluster. Operationally, DNS round-robin is used to distributed SMTP requests across front ends; the front ends consult a soft-state "user map" (replicated on each node) to determine which node manages the user. The front end contacts that node, which looks up the user in a soft-state "mail map" to determine which nodes in the cluster currently store fragments of that user's mail. If a change in the cluster is detected, a cluster-wide membership protocol is run; a side-effect of this protocol is the generation of a new user map. A (parallel) scan of all disks in the cluster is then used to generate a new mail map; this scan only has to touch a small portion of the on-disk state, roughly proportional to the number of nodes redistributed in the user map after a cluster change. Porcupine also uses optimistic, eventually consistent replication - there is a small window of inconsistency after failures, but the nature of the inconsistency is extra copies of mail will be stored, rather than mail will be lost. (However, the exact details of this inconsistency and a rigorous analysis of failure modes were not in either the presentation or the paper.) Porcupine uses dynamic load balancing to decide where in the cluster new mail fragments will be stored; a "spread" constrains the number of nodes on which a single user's messages will be stored, but within this (soft) constraint, queue lengths of pending RPCs on destination nodes are used to decide where to send message operations.

Measurements show that Porcupine's message throughput scales linearly with cluster size (the measurements show up to a 30 node cluster). With no replication, Porcupine handles about 800 messages/second on a 30 node cluster. With replication (and NVRAM to absorb log writes), Porcupine handles 400 messages/second. By comparison, sendmail+popd handles about 250 messages/sec on the same cluster. Measurements also show that Porcupine handles heterogeneity in the cluster and in the workload very well: with spread of 4 storage nodes per user, if 10% of nodes are made faster, Porcupine gets a 25% increase in throughput.

Questions

Greg Nelson, Compaq SRC:  Q: Your performance graph showing message throughput during recovery suggests that your reconfiguration algorithm works well for 30 nodes, but how does that scale to larger cluster sizes? A: We only had 30 nodes, so we couldn't explicit measure higher sizes. The most expensive step in reconfiguration is the disk scan--takes 30 millisec over all nodes, and that should be independent of cluster size. So, it should be just as fast for larger clusters.

Jeanna Neefe Matthews, UC Berkeley:  Q: You talked about spread, and also about not wanting voodoo constants, but it seems like spread is a voodoo constant. Is there a formula for calculating spread based on some parameters? A: The optimal spread depends on file system algorithms, buffering strategies, etc., but the basic rule is spread size of 2 is pretty much optimal heuristically, given a heterogeneous cluster. Q: Why is that fundamental, and not a voodoo parameter? A: Well, you only have a small number to choose from: not really voodoo; not too many choices make sense.

Gaurav Banga, Network Appliance:  Q: A detail question: on the graph that simulated replication, you got performance to increase by 100 messages/sec by using NVRAM. Did you simulate the fact that accessing NVRAM is slower than memory? Would that bring replicated performance back down? A: No, we didn't take that into account.

Andy Tanenbaum, Vrije Universiteit:  Q: You're barking up the wrong tree. One giant mail server for all the world is fundamentally wrong. The proper way is really distributed: sender sends to destination user's machine, giving distribution of mail all over world--the correct way to make it scale worldwide is to send mail to the owners machine. A: Firstly, AOL has a giant mail cluster, and stores user's messages in the AOL infrastructure, so they need a scalable mail server. Secondly, you can benefit from centralization, such as by enabling mail access all over world. Plus, you now have a homogeneous service architecture.

Jay Lepreau, University of Utah:  Q: Users increasingly do want central mail systems, but several aspects of your architecture don't work well for that. Your workload is POP-like (i.e., retrieve all mail all of the time, deleting it from the server). Consider the Yahoo workload instead--all mail is always kept on servers. In this case, your spread will increase. You will need to migrate fragments together. Also, what about mailing lists? You don't want to replicate mail for all users in list. A: You're right. That's all future work.

Mohit Aron, Rice University:  Q: Let's talk about load balancing in your system. Each node keeps track of all load in the system. What about oscillations and such? What if everybody gangs up on the least loaded node? A: Add randomization. You can compute many possibilities, the distribution over them, spreads over the most least loaded, probabilistically.

Mike Rogers, University of Kentucky:  Q: You balance load at the granularity of a message. Are there advantages to making this granularity smaller (i.e., if you divide up messages)? How much effort would that be? A: We haven't thought about it, but I can't see a specific advantage. Our assumption is that mail comes from all over in a highly concurrent/parallel fashion, so spreading individual messages doesn't seem like it would help. Maybe it would if you get a gigantic message.


On the Scale and Performance of Cooperative Web Proxy Caching
Alec Wolman, Geoffrey M. Voelker, Nitin Sharma, Neal Cardwell, Anna Karlin, and Henry M. Levy (Univ. of Washington)

In this presentation, Alec, a UW student, politely explained why large scale cooperative caching is a dead-end research area. The authors of the paper gathered simultaneous traces of the entire University of Washington campus (UW) and of Microsoft Corporation's Redmond site (MS), and used these traces to present a detailed analysis of cache hit rates under various population sizes, numbers of organizations, and cachability assumptions. The UW trace contained 7 days of traffic and 82.8 million requests, generated from 200 identifiable organizations within the campus (e.g., the music department). The average ideal hit rate (ignoring HTTP caching rules) local to an organization was 43%; adding a "perfect" cooperative cache across all organizations would increase the ideal hit rate to 69%. Including the HTTP caching rules decreases these numbers to 21% per organization, and 41% for a perfect cooperative cache. However, this perfect cache servers only 22984 clients, and could easily be run on a single machine serving the entire campus. Extrapolating on the client population size, hit rates increase only logarithmically with client population, up to the maximum cachability "ceilings" of 100% for an ideal cache and 61% for an HTTP-obedient cache. A city-wide cooperative cache (i.e., one serving around 1,000,000 clients) would hit these cachability ceilings, meaning that there is no point in having cooperative caches at larger scales than this.

In a second part of their study, the authors explored the benefits of cooperative caching across two large organizations by combining the UW and MS traces at a hypothetical shared cooperative cache. The MS traces (with 60,223 clients) increase the UW population by a factor of 3.6, but only increase the UW hit rates by 5.1% (for an ideal cache) and 4.2% (for an HTTP obeying cache). The authors also show that there is a negligible effect on user-perceived document delivery latency. From these results, they conclude that without significant workload changes, scalable cooperative caching schemes are essentially irrelevant. The speaker alluded to a similar result obtained from a detailed analytical model; these results are in the paper.

Questions

Jochen Liedtke, Universität Karlsruhe:  Q: Assume you partitioned your traces not by department, but in some different way. Would you get the same results? Is there any difference when changing populations of organizations, not just the population of all clients? A: We got our organizations from considering departments. However, we also looked at our organizations vs. random organizations: hit rates were 5-10% higher with UW organizations vs. random organizations. So, it makes a slight difference, but not a huge difference.

Karin Peterson, Xerox PARC:  Q: Have you run these traces at significant times apart? Is the amount of personalization that is creeping up making caching even worse? A: We've been tracing for about a year. Most of the high-level characteristics have been constant over the year. The most relevant characteristic is the rate of overall uncachability. That rate hasn't changed in the past year, even though request rates have about doubled.

Amin Vahdat, Duke University:  Q: Where do your misses come from? Is this a fair summary: about 1/2 of the requests from each person at UW aren't shared with anyone else at UW? A: The distribution of client requests to servers is very skewed: 50% of all requests go to the 800 most popular servers, even though there are over 200,000 servers in the trace. Q: Are the interests of people across your organizations so disparate that it's tough to get overlap? A: Some sites are very popular, but there is a very long tail.

David Cheriton, Stanford University:  Q: I hate to defend cooperative caching, but: I thought that people wanted to use this for organizations on the scale of Saudi Arabia and Western Europe, where the cost of a cache miss is 100x the cost of getting the page out of a neighbor (due to poor connectivity to the USA). Doesn't this situation make it more beneficial to do cooperative caching? A: Proxy caching is of course beneficial. However, cooperative caching is also expensive in such environments. Q: I assume you're well-connected in Saudi Arabia, but leaving is expensive. This is exactly what you want for cooperative caching. A: We've shown your improvement in hit rate with cooperative caching. You should do your own cost/benefit analysis. If you think it's free to do cooperative caching, then fine, but if it's not free, it's not a good idea.

Preston Crow, EMC Corporation:  Q: Clearly you showed some improved hit rate, and got concrete from the user's perspective with latency, but who cares about the users? ISPs care about bandwidth savings. A: True! In the paper, we do have graphs that show corresponding savings for bandwidth utilization. Q: Images are more cachable, and images are larger on average. Can we conclude therefore that larger things are more cachable? A: I don't remember the details, but I don't believe we saw a correlation between size and cachability.

Ram Rajamony, IBM Austin Research Lab:  Q: What would happen if you compared UW to another university? In other words, what if you compared across similar organizations? A: Microsoft is less diverse than UW. For example, the hit rate for Microsoft is a bit higher than the hit rate for UW. Q: Would you get better cooperative caching with Rice and Houston than with UW and Microsoft? A: Once you're on the flat part of the hit rate curve, it just doesn't matter!