Interactive and Approximate Analytics


Needletail

NeedleTail: A System for Browsing

Analysts performing data exploration often browse; i.e., pose a query and then examine the details of a small number of the resulting records (independent of the size of the query result). In a typical session, analysts will start with one browsing query, examine a few of the resulting records, and then repeatedly issue new browsing queries by adding or removing predicates from their previous queries until they eventually gain a better understanding of the dataset. Unfortunately, traditional database systems are not engineered towards browsing: instead, these systems operate in an all-or-nothing manner, taking as long as it takes to return the entire set of results, however large it may be. To this end, we have developed NEEDLETAIL, a database system tailored towards an alternative database query interaction paradigm: browsing. NEEDLETAIL makes efficient use of memory to store special bitmap indexes (called “swift indexes”) that enable rapid retrieval of a small number of query result records. A key optimization challenge is ensuring that these indexes respect memory constraints while imposing as little additional retrieval overhead as possible.


Needletail

Synthesizing Queries

Many novice data analysts have trouble understanding and crafting queries in database query languages like SQL. Our goal is to simplify querying for such analysts, by guiding them to the query that they have in mind without much effort. In our ICDT’10 paper, given that an analyst has selected a subset of tuples of interest, we developed algorithms that automatically generate the most succinct and accurate query that captures that subset. We examined this problem as a tug-of-war between three forces: Query family (e.g., Conjunctive queries), Approximation (e.g., I'm fine with capturing 80% of the right tuples), Succintness (e.g., Give me a query that is the smallest) and tried to evaluate the complexity of solving the problem. In our VLDB’11 paper, we designed algorithms for interacting with the analyst (by asking as few questions as possible) to identify the predicates of interest.


Needletail

Expensive Query Optimization

User Defined Function(UDFs) are used increasingly to augment query languages with extra, application dependent functionality. UDFs tend to be expensive, either in terms of monetary cost or latency. We studied ways to efficiently evaluate selection queries involving UDFs. We provided a family of techniques for processing queries at low cost while satisfying user-specified precision and recall constraints. Our techniques are applicable to a wide variety of scenarios, such as when selection probabilities of tuples are available beforehand, when this information is available but noisy, or when no such prior information is available. We also generalize our techniques to more complex queries. On real datasets, our techniques achieve significant savings in cost of up to 80%, while incurring only a small reduction in accuracy.