Web Caching Early Papers
This thesis examines the problem of cache organization for very large-scale distributed file systems (DFSs). Conventional DFSs, based on the client--server model, suffer from bottlenecks when the total client load exceeds the server's capacity. Previous work has suggested that hierarchical client organizations can ameliorate the problem somewhat, but at the expense of a substantial increase in client latency. An analysis of existing DFS workloads reveals that there is considerable regularity in client file access patterns and that widely shared files lend themselves especially well to caching techniques. In particular, a large proportion of ``cache miss'' traffic is for files that are already copied in another client's cache. If clients can share these cached files, the server's load can be reduced by a potentially large margin, making larger-scale systems possible. We introduce the notion of {em dynamic hierarchical caching}, in which adaptive client hierarchies are constructed on a file - by - file basis. Trace - driven simulation and workload - driven runs of a prototype file system suggest that dynamic hierarchies can reduce server load substantially without the client performance penalties associated with more static schemes.
Most Distributed File Systems (DFSs) are based on a flat client-server model in which each client interacts directly with the file server for all file operations. While this model works well for relatively small systems in which the file server has adequate capacity for all its clients, it does not scale to large numbers of clients or systems in which the clients are connected to the server through low-bandwidth links. Server traffic can be reduced substantially if clients keep even a modest-sized cache of previously read files. Intuitively, the benefits of caching can be increased by organizing clients into a hierarchy, in which only a small number of machines communicate directly with the file server, providing intermediate caching services to machines below them in the hierarchy. While this potentially reduces server traffic for widely shared files, it can introduce a significant delay for clients low in the hierarchy for access to files with a low degree of sharing. This paper describes a simple method for constructing dynamic hierarchies on a file-by-file basis. The results of a trace-driven simulation of a dynamic hierarchical filesystem are presented, yielding a reduction in server traffic of a factor of more than two for shared files compared with a flat scheme and without a large increase in client access time. An algorithm to maintain cache consistency with low overhead by detecting missed cache invalidation messages is given.
A WWW proxy server, proxy for short, provides access to the Web for people on closed subnets who can only access the Internet through a firewall machine. The hypertext server developed at CERN, cern_httpd, is capable of running as a proxy, providing seamless external access to HTTP, Gopher, WAIS and FTP. cern_httpd has had gateway features for a long time, but only this spring they were extended to support all the methods in the HTTP protocol used by WWW clients. Clients don't lose any functionality by going through a proxy, except special processing they may have done for nonnative Web protocols such as Gopher and FTP. A brand new feature is caching performed by the proxy, resulting in shorter response times after the first document fetch. This makes proxies useful even to the people who do have full Internet access and don't really need the proxy just to get out of their local subnet. This paper gives an overview of proxies and reports their current status.
This thesis describes some software designed to improve access to World-Wide Web (WWW) data on the global Internet. The tools used for retrieving WWW objects allow users to be unaware of where the data actually resides. Huge inefficiencies result when objects are repeatedly transmitted across relatively slow wide area network (WAN) connections. A solution to this problem is to install object caches at strategic places in the network. Caches are implemented on proxy servers which act as intermediaries between local clients and remote servers. Frequently accessed objects will already be in the cache thereby speeding delivery time to clients and reducing WAN traffic.