Summarized by Jeanna Neefe Matthews, U.C. Berkeley
Werner Vogels, Cornell
Werner Vogels started the session with an entertaining talk describing his study of file system usage in Windows NT 4.0. He had data on 237 million trace records from 45 NT workstations--plus cartoons, a top-ten list, and cookies!
Vogels began by explaining the goals of his study. He wanted to provide a new data point in the tradition of the 1985 and 1991 Sprite and BSD file system tracing studies on which much of the file system research in the last decade has been based. He pointed out that many of the measurements presented in these earlier studies are not statistically significant. Therefore, a major motivation for his work was to provide rigorous statistical analysis of file system trace data. Vogels also wanted to study the interactions of components within the Windows NT File I/O system, including the use of the important--but virtually undocumented--"FastIO path." He mentioned that unlike earlier file system tracing studies, he began his analysis with nothing in particular he was trying to prove.
Vogels then administered a little quiz, giving out cookies for the best answers (or at least for the best answers from people close enough to throw cookies to). He stumped the audience with questions like, "What is the most active directory in an NT file system?" and, "What is the cache read-ahead size set by NTFS?" He concluded that everyone really needed to read the paper since he heard so many wrong answers.
The majority of the talk was structured as a list of top ten observations of NT file system usage (a la David Letterman). Some interesting facts included: i) The file size distribution on various NT file systems looks remarkably similar because executables, DLLs, and fonts are huge and dominate the distribution. ii) 95% of changes to the file system are in the WWW cache and those occur mainly in the user profile. iii) 90% of files are open for less than 10 seconds for data transfer and less than 20 milliseconds for control operations. iv) 74% of opens are for control operations versus only 26% for data access. v) 80% of newly created files are deleted within 4 seconds.
Throughout his talk, Vogels returned to two major points: the presence of high variation and extremely heavy-tailed distributions of almost every measurement taken and the high variability between the systems traced. As a result, he concluded that it is quite misleading to combine all the trace data and treat it as characteristic of a typical NT workload. He hypothesized that one cause of this high variability may be that, unlike for UNIX, many people of varying skill levels develop Windows applications. He gave several examples of file access patterns which can only be described as extreme inefficiencies in the application.
Werner mentioned his intention to make his traces publicly available, but said that he was not yet ready to do so.
Bill Bolosky, Microsoft Research, recalled Vogel's data point that three-quarters of the files are overwritten within 0.7 milliseconds after close and pointed out that this was an order of magnitude faster than a disk can seek. Bolosky asked Vogels to comment on what is causing this strange effect (and hopefully to advise how to stop it). Vogels hypothesized that it might be intermediate files from a compiler or other data written for consistency and then removed. Bolosky emphasized that it could not be for recovery purposes since it is obviously not making it to disk.
Margo Seltzer, Harvard University, challenged Vogels' claim that the earlier tracing studies at Berkeley had specific goals they wanted to prove. She then asked Vogels what his data suggest about the right way to build next generation file systems. Vogels said that in his observations, the NT file system actually performed quite well (e.g., cache was never full so there was no need for new caching policies). He mentioned the possibility of predicting what a given process will do next or categorizing how specific file types are typically used to optimize their handling. He said that in general, however, he had no specific advice and no particular agenda at this point.
Douglas S. Santry, Michael J. Feeley, Norman C. Hutchinson, Alistair C. Veitch (HP Labs), Ross W. Carton, and Jacob Ofir (University of British Columbia)
Mike Feeley gave the presentation on the Elephant File System. He explained that modern file systems focus on protecting against data loss due to system or media failure. However, they stop short of protecting users from themselves! Accidental deletes or overwrites due to user or application errors always cause immediate data loss. Feeley argued that this limitation is simply an artifact of limited storage capacity and that current technology trends indicate an opportunity to remedy the situation. Specifically, disk capacity is rising at approximately 60% per year; 25-30 GB disks are already available and at the same rate 250-350 GB disks will be available in 5 years. Given these trends, Feeley concluded that disks are now large enough to store some old versions, but are still not large enough to maintain everything indefinitely. Therefore, the challenge for the Elephant File System was to decide when data can be safely deleted.
The key idea in Elephant is for the system, rather than the user, to decide when storage can be reclaimed. Old versions which are sufficiently important to retain must be distinguished from old versions which can be regenerated or which have become virtually indistinguishable from other retained versions.
Elephant has two distinct strategies for versioning. First, to provide near-term reversibility, the system retains all versions for a limited time. Second, the system identifies landmark versions to retain indefinitely. Elephant uses heuristics to automatically identify the landmark versions of user-managed files, based on the observations that (1) files tend to be edited in a bursty fashion and (2) a user's ability to differentiate between two versions degrades over time. To exploit these tendencies, Elephant observes the time line of edits to a file, identifies "barrages" of edits, and retains the version before the beginning of the barrage as a landmark version. Then, over time, the system increases the minimum time between distinct barrages.
Feeley reported the authors' experience with a prototype version implemented as a VFS in FreeBSD 2.2.8. This prototype stored 15 GB of data and was used for development and document preparation by 12 active researchers at HP Labs. Data on file type distribution and write traffic distribution was collected from the prototype in this environment. They classified files into five categories: source, documents, derived, archive, and temporary. For the derived and temporary files, no older versions were retained (keep-one). For the archive files, all versions were retained until they reached a certain age (keep-safe). For the source files, document files and all unclassified files, the system retained landmark versions as identified by the heuristics described above (keep-landmark). They found that the files designated keep-landmark accounted for 62.4% of the files but only 15.2% of the bytes in the file system. Files designated keep-safe represented 3.9% of the files and 28.5% of the bytes. Perhaps more interestingly, they found that almost all of the write traffic (98.7%) was destined for files designated keep-one. So only 1.3% of the write traffic needed to participate in any version control. This clearly demonstrates the importance of per-file version control. In another metric, Feeley said that a 30 day history computed on the file system granularity would require 3.4 GB while a 30 day history in Elephant required only 0.042 GB.
Greg Nelson, Compaq Research, said that he liked this paper a lot. However, he said that he would like to be able to access the old versions in more complicated ways. For example, a user might remember that 5 years ago he or she had processed a certain environment variable switch but not know the right file name or the right date. Feeley replied that in his study questions like that didn't come up because the system was only used for a few weeks. However, he agreed that Elephant cold be viewed as a historical database and that it would be good to support queries by attributes other than just filename and time.
Mohit Aron, Rice, asked why they didn't implement Elephant as a front end to a version control system at user level. Feeley answered that they wanted it to have the same interface as a file system. In addition, the performance would not have been as good. For example, they couldn't have done copy-on-write at user level.
David Mazières, Michael Kaminsky, M. Frans Kaashoek, Emmett Witchel (MIT)
This summary by Jon Howell, Dartmouth College.
Mazières presented a secure remote file system called SFS, built around the concept of self-certifying pathnames. The goal of the project was to allow users to access and share files globally while placing a low load on system administrators. To that end, the group took cues from the success of the Web by ensuring that anyone can create an SFS server, any user can access any server from any client, and any server can reference any other server.
Mazières framed the main contribution of SFS as separating key management from the file system. In SFS, public keys are embedded in every remote file path ("self-certifying pathname"); file system clients verify that the contacted server indeed holds the corresponding private key before trusting the server. Key management in the context of SFS, then, simply involves arranging for users to access the correct self-certifying pathnames. Key management may be manual, perhaps by installing on the client a human-readable symbolic link pointing to a self-certifying pathname. Alternately, key management may be more sophisticated, using shared secrets or certification authorities to retrieve self-certifying pathnames.
Jochen Liedtke of Universität Karlsruhe pointed out that since SFS is independent of key management, it is also independent of revocation, and thus can hold only a cache of public keys. If a public key changes (due to revocation), how can one find it again? Mazières suggested that the key management agent might install a symlink to bind the old Host ID (the root of the self-certifying path that securely identifies a server) to the new one. Liedtke asked that, since the Host ID depends on a secure hash, what do you do if the hash is proven weak? Mazières' response: "You'd better upgrade your software."
Karen Petersen from Xerox PARC asked about how to distribute a new key. If keys are contained in links, you would have to modify the link to point to a path containing the new key. Mazières replied that this was true; perhaps there would be multiple links pointing to the same server, and users could find a link that had been updated and follow that one.
Cliff Neuman of the University of Southern California asked about location transparency. As data moves, its name changes, and so does its key. Mazières replied that we already have many names for one file: symlinks. Most users will use symlinks; when the data moves, change those. Neuman asked about pathnames embedded in files, and Mazières replied, "Yeah, change those files."
Neuman also asked if it was possible to have a policy for name remapping that depends on where you are accessing the file from (for locality). Mazières replied that all clients are configured identically, but that the user agent has all user-specific state, and is forwarded site-to-site to follow the user.