Web ExtractionAt a research internship at Kosmix (now acquired by Walmart) with Anand Rajaraman, I looked at the problem of generating concepts — i.e., entities, items and ideas that users are interested in and searching for. However, the problem is that users typically do not simply type in their concept of choice into the search bar, but also extra terms, or other concepts. For example, a person searching for “Martin Scorcese” might type in “Martin Scorcese Departed”. We use ideas from association rule mining to figure out whether a sequence of n words represents a concept, relative to the n+1 word sequences that contain it, or the n-1 word sequences contained by it. We proved some theoretically desirable properties of our approach, and experimentally demonstrated effectiveness on a dataset of query logs. The next step to this work was to find where to attach an extracted concept to a topic hierarchy. Since this is a task not easily done by computers, we looked to use limited human involvement to assist our search. Finding the immediate parent of a new concept is reducible to a problem of searching on graphs. Web-pages that are script or template based prove to be invaluable for extraction of concept metadata. For instance, it is easy to ask humans to annotate a few web-pages and learn a web wrapper to extract all metadata from a script-based website such as Yelp, Amazon, Ebay and so on. (For instance, restaurant phone numbers may be extracted from Yelp.) However, these web-pages change often, and the web wrappers learnt for the web-pages may no longer extract correct data. At a research internship at Yahoo! Research Bangalore, I looked at the wrapper maintenance and management problem with Rajeev Rastogi, Director, Yahoo! Labs Bangalore, as well as Nilesh Dalvi, Yahoo! Research, Sunnyvale. We were able to design efficient algorithms that output theoretically optimal robust wrappers for two different change models, and found that these wrappers perform orders of magnitude better than existing wrappers in terms of fault tolerance. I also looked at problems in debugging large information extraction pipelines, and on building better classifiers for entity resolution, both using crowdsourcing in an efficient and optimized manner. Relevant tech reports / publications:
|