This is my research page. I also have a personal page. Research InterestsMy main current research interest is in algorithms for streaming data. I participated in the STREAM (STanford stREam dataA Management) research project. I am also interested in bioinformatics and approximate query processing. I have graduated and am now working at Adchemy. Publications
Towards a Robust Query Optimizer: A Principled and Practical
Approch
In this paper, we explore how the query optimization process can be made more robust, focusing on the important subproblem of cardinality estimation. The robust cardinality estimation technique that we propose allows for a user- or application-specified trade-off between performance and predictability, and it captures multi-dimensional correlations while remaining space- and time-efficient.
Load Shedding for Aggregation Queries over Data Streams
"Load shedding" (dropping unprocessed tuples to reduce system load) is a strategy whereby an overloaded system can achieve graceful performance degradation. Focusing on aggregation queries, we present algorithms that determine at what points in a query plan should load shedding be performed and what amount of load should be shed at each point in order to minimize the degree of inaccuracy introduced into query answers. [pdf] | [postscript] | [PowerPoint slides] | [dbpubs page]
Distributed Top-K Monitoring
We study top-k monitoring queries, queries that continuously report the k largest values obtained from distributed data streams. In our approach, arithmetic constraints are maintained at remote stream sources to ensure that the most recently provided top-k answer remains valid to within a user-specified error tolerance. Distributed communication is only necessary on occasion, when constraints are violated. [pdf] | [postscript] | [dbpubs page]
Chain: Operator Scheduling for Memory Minimization in Data Stream
Systems
When processing bursty data streams, the choice of a query operator scheduling policy can have significant impact on peak system memory usage. We present and analyze several strategies for adaptive, load-aware scheduling of operators to minimize resource consumption during times of peak load. [pdf] | [postscript] | [PowerPoint slides] | [dbpubs page]
Dynamic Sample Selection for Approximate Query Processing
We describe an approximate query processing technique that dynamically constructs an appropriately-biased sample for each query by combining samples selected from a family of non-uniform samples that are constructed during a pre-processing phase.
Maintaining Variance and k-Medians over Data Stream Windows
We present a novel technique for solving two important and related problems in the sliding window model -- maintaining variance and maintaining k-medians clustering. [pdf] | [postscript] | [PowerPoint slides] | [dbpubs page]
Query Processing, Approximation, and Resource Management in a
Data Stream Management System
This paper describes our ongoing work developing the Stanford Stream Data Manager (STREAM), a system for executing continuous queries over multiple continuous data streams. The STREAM system supports a declarative query language, and it copes with high data rates and query workloads by providing approximate answers when resources are limited. [pdf] | [postscript] | [dbpubs page]
Models and Issues in Data Stream Systems
In this overview paper we motivate the need for and research issues arising from a new model of data processing. In this model, data does not take the form of persistent relations, but rather arrives in multiple, continuous, rapid, time-varying data streams. In addition to reviewing past work relevant to data stream systems and current projects in the area, the paper explores topics in stream query languages, new requirements and challenges in query processing, and algorithmic issues. [pdf] | [postscript] | [dbpubs page]
Characterizing Memory Requirements for Queries over Continuous
Data Streams
We consider conjunctive queries with arithmetic comparisons over multiple continuous data streams. We specify an algorithm for determining whether or not a query can be evaluated using a bounded amount of memory for all possible instances of the data streams. When a query can be evaluated using bounded memory, our algorithm produces an evaluation plan based on constant-sized synopses of the data streams. [pdf] | [postscript] | [PowerPoint slides] | [dbpubs page]
Sampling from a Moving Window Over Streaming Data
We introduce the problem of sampling from a moving window of recent items from a data stream and develop the "chain-sample" and "priority-sample" algorithms for this problem. |