Brian Babcock -- Research

Brian Babcock

babcock@cs.stanford.edu

Rocket Fuel, Inc.
1 Lagoon Drive
Suite 475
Redwood City, CA 94065 USA

This is my research page. I also have a personal page.

Research Interests

My main current research interest is in algorithms for streaming data. I participated in the STREAM (STanford stREam dataA Management) research project. I am also interested in bioinformatics and approximate query processing.

I have graduated and am now working at Adchemy.

Publications

Towards a Robust Query Optimizer: A Principled and Practical Approch
Brian Babcock and Surajit Chaudhuri
Proceedings of the ACM SIGMOD 2005 Conference on Management of Data.

In this paper, we explore how the query optimization process can be made more robust, focusing on the important subproblem of cardinality estimation. The robust cardinality estimation technique that we propose allows for a user- or application-specified trade-off between performance and predictability, and it captures multi-dimensional correlations while remaining space- and time-efficient.

Load Shedding for Aggregation Queries over Data Streams
Brian Babcock, Mayur Datar, and Rajeev Motwani
Proceedings of the 20th International Conference on Data Engineering, 2004.

"Load shedding" (dropping unprocessed tuples to reduce system load) is a strategy whereby an overloaded system can achieve graceful performance degradation. Focusing on aggregation queries, we present algorithms that determine at what points in a query plan should load shedding be performed and what amount of load should be shed at each point in order to minimize the degree of inaccuracy introduced into query answers.

[pdf] | [postscript] | [PowerPoint slides] | [dbpubs page]

Distributed Top-K Monitoring
Brian Babcock and Chris Olston
Proceedings of the ACM SIGMOD 2003 Conference on Management of Data.

We study top-k monitoring queries, queries that continuously report the k largest values obtained from distributed data streams. In our approach, arithmetic constraints are maintained at remote stream sources to ensure that the most recently provided top-k answer remains valid to within a user-specified error tolerance. Distributed communication is only necessary on occasion, when constraints are violated.

[pdf] | [postscript] | [dbpubs page]

Chain: Operator Scheduling for Memory Minimization in Data Stream Systems
Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani.
Proceedings of the ACM SIGMOD 2003 Conference on Management of Data.

When processing bursty data streams, the choice of a query operator scheduling policy can have significant impact on peak system memory usage. We present and analyze several strategies for adaptive, load-aware scheduling of operators to minimize resource consumption during times of peak load.

[pdf] | [postscript] | [PowerPoint slides] | [dbpubs page]

Dynamic Sample Selection for Approximate Query Processing
Brian Babcock, Surajit Chaudhuri, and Gautam Das
Proceedings of the ACM SIGMOD 2003 Conference on Management of Data.

We describe an approximate query processing technique that dynamically constructs an appropriately-biased sample for each query by combining samples selected from a family of non-uniform samples that are constructed during a pre-processing phase.

[pdf] | [PowerPoint slides]

Maintaining Variance and k-Medians over Data Stream Windows
Brian Babcock, Mayur Datar, Rajeev Motwani, and Liadan O'Callaghan
Proceedings of the 22nd ACM Symposium on Principles of Database Systems, 2003.

We present a novel technique for solving two important and related problems in the sliding window model -- maintaining variance and maintaining k-medians clustering.

[pdf] | [postscript] | [PowerPoint slides] | [dbpubs page]

Query Processing, Approximation, and Resource Management in a Data Stream Management System
Rajeev Motwani, Jennifer Widom, Arvind Arasu, Brian Babcock, Shivnath Babu, Mayur Datar, Gurmeet Manku, Chris Olston, Justin Rosenstein, and Rohit Varma
Proceedings of the 2003 Conference on Innovative Data Systems Research.

This paper describes our ongoing work developing the Stanford Stream Data Manager (STREAM), a system for executing continuous queries over multiple continuous data streams. The STREAM system supports a declarative query language, and it copes with high data rates and query workloads by providing approximate answers when resources are limited.

[pdf] | [postscript] | [dbpubs page]

Models and Issues in Data Stream Systems
Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, and Jennifer Widom
Proceedings of the 21st ACM Symposium on Principles of Database Systems, 2002.

In this overview paper we motivate the need for and research issues arising from a new model of data processing. In this model, data does not take the form of persistent relations, but rather arrives in multiple, continuous, rapid, time-varying data streams. In addition to reviewing past work relevant to data stream systems and current projects in the area, the paper explores topics in stream query languages, new requirements and challenges in query processing, and algorithmic issues.

[pdf] | [postscript] | [dbpubs page]

Characterizing Memory Requirements for Queries over Continuous Data Streams
Arvind Arasu, Brian Babcock, Shivnath Babu, Jon McAlister, and Jennifer Widom
Proceedings of the 21st ACM Symposium on Principles of Database Systems, 2002.

We consider conjunctive queries with arithmetic comparisons over multiple continuous data streams. We specify an algorithm for determining whether or not a query can be evaluated using a bounded amount of memory for all possible instances of the data streams. When a query can be evaluated using bounded memory, our algorithm produces an evaluation plan based on constant-sized synopses of the data streams.

[pdf] | [postscript] | [PowerPoint slides] | [dbpubs page]

Sampling from a Moving Window Over Streaming Data
Brian Babcock, Mayur Datar, and Rajeev Motwani
Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms, 2002.

We introduce the problem of sampling from a moving window of recent items from a data stream and develop the "chain-sample" and "priority-sample" algorithms for this problem.

[pdf] | [postscript] | [PowerPoint slides] | [dbpubs page]