Despite the growth of the Internet, dissemination of popular data to a large number of users over a network remains an expensive proposition. The expense comes from the data server repeatedly sending a separate copy of a requested piece of data to every user client that needs it. Consequently, a popular server becomes a victim of its own success.
Instead, a data server could use its network's multicast capability to send a single packet of data once, and have it addressed to reach an entire group of clients at once. Though successful delivery is not guaranteed, this network feature can still dramatically reduce the server's network consumption.
Multicast is already available natively on the Internet's multicast backbone (MBone), Internet2, and native IPv6 networks. Multicast is supported on modern operating systems including Windows XP, Mac OS X, FreeBSD, GNU/Linux, and modern commercial flavors of Unix.
In work with Professor Hector Garcia-Molina, we study the challenges in designing and building such a multicast data server. We must make it efficient, fast, reliable, and scalable for a variety of clients.
How should the server use its network connection to order the transmission of a large number of data requests? Clients' requests may vary in size, and may coincide with other clients' requests to varying degrees. Given a stream of client requests, the server must decide which of its requested data items to send next.
How should the server cope with its clients' very different network connections? Some clients may have more reliable (less loss-prone) connections than others; some clients may have higher-throughput connections from the server than others. The server must optimize performance for such heterogeneous clients while ensuring that every client will still receive the entirety of the data that it requests.
An in-development version of a file-sharing multicast facility is available upon request.
The dissertation was submitted to Stanford University on 24 Sep 2004, and is available online.
The Stanford WebBase is a World Wide Web hypertext repository designed to aid research and analysis. This project aims to develop a large-scale repository that cleanly and effectively supports a variety of research on World Wide Web pages, while conserving disk and main memory usage. This repository seeks to allow the building of new feature indices on Web pages, the graph analysis of a large slice of the Web, and the flexible multicast distribution of Web data (and computational workload).
Implementation work with Taher Haveliwala, Sriram Raghavan, Gary Wesley.
Crawler originally by Junghoo Cho, with work from Pranav Kantawala.
Initial multicast-tree code from Ashish Goel, Kameshwar Munagala.
A part of the Digital Library project: Professor Hector Garcia-Molina, Dr. Andreas Paepcke.
For information about this project, including how to get Web data from, and the software for, our repository, please see the project's main page.
The Stanford WebBase crawler must crawl the Web scalably, quickly, easily, and efficiently, while minimizing load on Web servers being crawled. Users must be able to tap the WebBase distribution infrastructure easily to fetch and process large volumes of Web data for experiments or analysis. As an example, WebBase's local indexing uses the distribution facility, just as outside users would, to create compact indexes from Web crawls. We describe how we implement WebBase to meet the features mentioned above, and measure the performance and scalability of the running resulting.