Crowd-Powered Data Management

SCOOP: The Stanford — Santa-Cruz Project for Cooperative Computing with Algorithms, Data and People

News

  • (Sep 2013) New Technical Report on Filtering Generalizations released.

  • (May 2013) New Technical Report on Crowd-Powered Search released.

  • (Aug 2012) New Technical Reports on Confidence, Crowd-powered Finding, and Deco Query Processing released.

  • (Jun 2012) New Technical Report on Identifying Reliable Workers released.

  • (Feb 2012) New Technical Reports on the Deco System demo and Entity Resolution released.

  • (Dec 2011) New Technical Report on Human-powered Debugging of Lineage in Large Data Pipelines released.

  • (Nov 2011) New Technical Report on Human-powered Top-1 Computation released.

  • (Nov 2011) New Technical Report on the Deco system released.

  • (Sep 2011) New Technical Report on Filtering Data with Humans released.

  • (Jun 2011) We presented the data model for our system, called Deco (for Declarative Crowdsourcing), outlined some Query Processing challenges, and surveyed our work on Crowd Algorithms at a Crowdsourcing event at UC Berkeley.

  • (Jan 2011) We presented the vision for sCOOP at CIDR 2011.

Overview

There are many tasks done easily and better by people than by current computer algorithms: tasks dealing with understanding and analyzing images, video, text and speech, as well as subjective opinions and abstract concepts.

Due to the proliferation of cheap and reliable internet connectivity, a large number of people are now online and willing to answer questions for monetary gain. There are a number of human computation (a.k.a. crowdsourcing) marketplaces  —  Mechanical Turk, oDesk, LiveOps, and others  —  that enable workers to find tasks easily.

sCOOP is a project whose broad theme is to leverage people as processing units, much like computer processes or subroutines, to achieve some global objective. A primary focus of sCOOP is to optimize this computation  —  while there may be many ways to orchestrate a particular task, our goal is to use as few resources (e.g., time, money) as possible, while getting equally good or better results as unoptimized computation.

We are approaching the problem of orchestrating computing tasks that involve people via two (not necessarily mutually exclusive) directions:

Optimizing Crowd Algorithms

Here, the goal is to optimize some fundamental data processing algorithms where the unit operations are performed by people. Examples of algorithms include: sorting, clustering, classification, and categorization.

Over the last year, we worked on algorithms for max, filtering, graph search, and lineage debugging. The goal of the Max problem is to find the best item among a given set of items (e.g., photos, videos or songs), given a budget on the number of pairwise comparisons that may be asked of humans. We also looked at the Filtering problem, where we wanted to find which items in a given data set satisfy a given set of properties (that may be verified by humans), and the goal was to find the cost-optimal filtering algorithm, given constraints on error and time. We also considered the problem of human-assisted graph search, which has applications in many domains that can utilize human intelligence, including curation of hierarchies, image segmentation and categorization, interactive search and filter synthesis. (We studied this problem across many dimensions, fixed budget vs. unlimited budget, different graph structures, and multiple ‘‘target nodes’’ vs. single ‘‘target’’.) Most recently, we considered using expert human input in debugging of data provenance.

The Deco System: Declarative Querying of Humans, Algorithms and Databases

Here, the goal is to combine human and algorithmic computation with traditional database operations in order to perform complex tasks. This combination involves several optimization objectives: minimizing total elapsed time, minimizing the monetary cost to perform human computation (minimizing the number of questions and pricing them accordingly), and maximizing confidence in the obtained answers.

Our proposed approach views the crowd-sourcing service as another database where facts are computed by human processors. By promoting the crowd-sourcing service to a first-class citizen on the same level as extensional data, it is possible to write a declarative query that seamlessly combines information from both. The system becomes responsible for optimizing the order in which tuples are processed, the order in which tasks are scheduled, whether tasks are handled by algorithms or a crowd-sourcing service, the pricing of the latter tasks, and the seamless transfer of information between the database system and the external services. Moreover, it provides built-in mechanisms to handle uncertainty, so that the developer can explicitly control the quality of the query results. Using the declarative approach, we can facilitate the development of complex applications that combine knowledge from human computation, algorithmic computation, and data.

Our current design and details of our initial prototype can be found in the Deco paper.

Talks

  • Data-Centric Human Computation  —  Jennifer's Overview Talk about the Scoop Project: talk

  • Active Sampling for Entity Matching  —  This talk was given at KDD 2012: talk

  • CrowdScreen: Algorithms for Filtering Data with Humans  —  This talk was given at SIGMOD 2012: talk

  • Human-assisted Graph Search: It's Okay to Ask Questions  —  This talk was given at VLDB 2011: talk

  • Deco Data Model, Query Processing and Crowd Algorithms Talks  —  These talks were given at the Crowd-Crowd event at UC Berkeley on June 6, 2011:
    Data Model Talk
    Query Processing Talk
    Crowd Algorithms Talk

  • Answering Queries using Humans, Algorithms and Databases  —  This is the vision talk for sCOOP given at CIDR’11: talk

Relevant tech reports / publications:

  1. Optimal Crowd-Powered Rating and Filtering Algorithms, pdf
    Aditya Parameswaran, Stephen Boyd, Hector Garcia-Molina, Ashish Gupta, Neoklis Polyzotis, and Jennifer Widom
    40th International Conf. on Very Large Data Bases (VLDB), Hangzhou, China, Sep 2014

  2. DataSift: A Crowd-Powered Search Toolkit (Demo), pdf
    Aditya Parameswaran, Ming Han Teh, Hector Garcia-Molina and Jennifer Widom
    SIGMOD International Conf. on Management of Data, Snowbird, Utah, USA, Jun 2014

  3. Crowd-Powered Find Algorithms, pdf
    Anish Das Sarma, Aditya Parameswaran, Hector Garcia-Molina and Alon Halevy
    30th International Conf. on Data Engineering (ICDE), Chicago, USA, Apr 2014

  4. Finish Them!: Pricing Algorithms for Human Computation, pdf
    Yihan Gao and Aditya Parameswaran
    Technical Report, March 2014

  5. Comprehensive and Reliable Crowd Assessment Algorithms, pdf
    Manas Joglekar, Hector Garcia-Molina, and Aditya Parameswaran,
    Technical Report, March 2014

  6. An Expressive and Accurate Crowd-Powered Search Toolkit, pdf
    Aditya Parameswaran, Ming Han Teh, Hector Garcia-Molina and Jennifer Widom
    1st Conf. on Human Computation and Crowdsourcing (HCOMP), Palm Springs, USA, Nov 2013

  7. Human-Powered Data Management, pdf
    Aditya Parameswaran
    Doctoral Dissertation, Stanford University, Sep 2013
    (Winner of Stanford University's Arthur Samuel Best Thesis Award 2013-4)

  8. Active Sampling for Entity Matching with Guarantees, pdf
    Kedar Bellare, Suresh Iyengar, Aditya Parameswaran and Vibhor Rastogi
    ACM Transactions on Knowledge Discovery from Data (TKDD) - Special Issue on ACM SIGKDD 2012
    Volume 7(3), September 2013

  9. Evaluating the Crowd with Confidence, pdf
    Manas Joglekar, Hector Garcia-Molina and Aditya Parameswaran
    19th International Conf. on Knowledge Discovery and Data Mining (KDD), Chicago, USA, Aug 2013

  10. An Overview of the Deco System: Data Model and Query Language; Query Processing and Optimization, pdf
    Hyunjung Park, Richard Pang, Aditya Parameswaran, Hector Garcia-Molina, Neoklis Polyzotis, and Jennifer Widom
    SIGMOD Record, Volume 41, Dec 2012

  11. Human-Powered Debugging of Large Data Pipelines, pdf
    Nilesh Dalvi, Aditya Parameswaran and Vibhor Rastogi
    25th International Conf. on Neural Information Processing Systems (NIPS), Tahoe, Nevada, USA, Dec 2012

  12. Deco: Declarative Crowdsourcing, pdf
    Aditya Parameswaran, Hyunjung Park, Hector Garcia-Molina, Neoklis Polyzotis and Jennifer Widom
    21th International Conf. on Information and Knowledge Management (CIKM), Maui, Hawaii, USA, Nov 2012

  13. Deco: A System for Declarative Crowdsourcing (Demo), pdf
    Hyunjung Park, Richard Pang, Aditya Parameswaran, Hector Garcia-Molina, Neoklis Polyzotis and Jennifer Widom
    38th International Conf. on Very Large Data Bases (VLDB), Istanbul, Turkey, Sep 2012

  14. Query Processing over Crowdsourced Data, pdf
    Hyunjung Park, Aditya Parameswaran and Jennifer Widom
    Infolab Technical Report, Aug 2012

  15. Active Sampling for Entity Matching, pdf talk
    Kedar Bellare, Suresh Iyengar, Aditya Parameswaran and Vibhor Rastogi
    18th International Conf. on Knowledge Discovery and Data Mining (KDD), Beijing, China, Aug 2012
    (Invited to: Special Issue of TKDD Journal for KDD 2012 Best Papers.)

  16. Identifying Reliable Workers Swiftly, pdf
    Aditya Ramesh, Aditya Parameswaran, Hector Garcia-Molina and Neoklis Polyzotis
    Infolab Technical Report, Jun 2012

  17. So Who Won? Dynamic Max Discovery with the Crowd, pdf
    Stephen Guo, Aditya Parameswaran and Hector Garcia-Molina
    SIGMOD International Conf. on Management of Data, Scottsdale, Arizona, USA, Jun 2012

  18. CrowdScreen: Algorithms for Filtering Data with Humans, pdf talk
    Aditya Parameswaran, Hector Garcia-Molina, Hyunjung Park, Neoklis Polyzotis, Aditya Ramesh and Jennifer Widom
    SIGMOD International Conf. on Management of Data, Scottsdale, Arizona, USA, Jun 2012

  19. Human-assisted Graph Search: It's Okay to Ask Questions, pdf talk
    Aditya Parameswaran, Anish Das Sarma, Hector Garcia-Molina, Neoklis Polyzotis and Jennifer Widom
    37th International Conf. on Very Large Data Bases (VLDB), Seattle, USA, Sep 2011

  20. Answering Queries using Humans, Algorithms and Databases, pdf pptx
    Aditya Parameswaran and Neoklis Polyzotis
    Conference on Innovative Database Research (CIDR), Asilomar, USA, Jan 2011