RAIN meeting

RAIN Meeting Schedule: 06-07 Winter & Spring

Research on Algorithms for the Internet

Spring 06-07

Time: 4-5pm Fridays; Room: Gates 498.

Apr 13, 2007 Ying Xu a discussion on WWW'07 papers

Apr 20, 2007 BAGT

Apr 27, 2007 Anish Das Sarma Detecting Near-Duplicates for Web Crawling

May 4, 2007 Mayur Datar Google News Personalization

May 11, 2007 Adam Meyerson Pricing Partially Compatible Products

May 18, 2007 Sergei Vassilvitskii Click Fraud Resistant Methods for Learning Click-Through Rates

May 25, 2007 Dilys Thomas Streaming Algorithms and Worm Fingerprinting

June 1, 2007 Neil Daswani Anatomy of Clickbot.A

Winter 06-07

Time: 2-3pm Fridays; Room: Gates 498.

Jan 26, 2007 Nicolas Lambert Asymptotically Optimal Repeated Auctions

Feb 2, 2007 Hamid Nazerzadeh Mechanism design based on CPA

Feb 9, 2007 Ashish Goel Reusable and multiple goods

Feb 16, 2007 Ying Xu Estimating Search Engine Index and Web size

Feb 23, 2007 Mukund Sundararajan Is efficiency expensive?

March 2, 2007 Paul Heymann Collaborative tagging systems

March 9, 2007 Mihaela Enachescu Multi-path Routing

March 16, 2007 Aleksandra Korolova / Dilys Thomas Clickthrough of and Bidding on keywords

The Anatomy of Clickbot.A
Neil Daswani, Google

In this talk, I will present a detailed case study of the architecture of the Clickbot.A botnet that attempted a low-noise click fraud attack against syndicated search engines. The botnet of over 100,000 machines was controlled using a HTTP-based botmaster. Google identified all clicks on its ads exhibiting Clickbot.A-like patterns and marked them as invalid. We disclose the results of our investigation of this botnet to educate the security research community and provide information regarding the novelties of the attack.

Data Stream Algorithms with Applications to Automatic Worm Fingerprinting
Dilys Thomas

We will first go over a few network worms and contemporary algorithms to detect these network worms. We will then go over some recent techniques to detect these worms using "vunlerability signatures". We will then use this to review two important problems in data stream computation -- distinct value estimation and frequent element computation. We show how these algorithms can be used for automatic worm fingerprinting.

Click Fraud Resistant Methods for Learning Click-Through Rates
Sergei Vassilvitskii

Presenting the paper "Click Fraud Resistant Methods for Learning Click-Through Rates" by N. Immorlica, K. Jain, M. Mahdian, and K. Talwar from WINE 2005.

Pricing Partially Compatible Products
Adam Meyerson, UCLA

Suppose two competing companies provide roughly equivalent suites of software products. Each version of a product has a possibly distinct price and quality. Customers would like to purchase a low- price, high-quality set of products. The issue is that products made by different companies need not be fully compatible. We reduce the customer's decision process to a minimum cut, and provide approximation and hardness results for the problem of computing the company's best-response strategy in several different models. We also introduce a range of questions for further work.
These results will appear in the ACM Conference on Electronic Commerce (EC) 2007, and are joint work with David Kempe, Nainesh Solanki, and Ramnath Chellappa.

Google News Personalization: Scalable Online Collaborative Filtering
Mayur Datar, Google

Several approaches to collaborative filtering have been studied but seldom have the studies been reported for large (several millions of users and items) and dynamic (the underlying item set is continually changing) settings. In this paper we describe our approach to collaborative filtering for generating personalized recommendations for users of Google News. We generate recommendations using three approaches: collaborative filtering using MinHash clustering, Probabilistic Latent Semantic Indexing (PLSI), and covisitation counts. We combine recommendations from different algorithms using a linear model. Our approach is content agnostic and consequently domain independent, making it easily adaptible for other applications and languages with minimal effort. This paper will describe our algorithms and system setup in detail, and report results of running the recommendations engine on Google News.

Detecting Near-Duplicates for Web Crawling
Anish Das Sarma

This is joint work with Gurmeet Singh Manku and Arvind Jain in Google.
Near-duplicate web documents are abundant. Two such documents differ from each other in a very small portion that displays advertisements, for example. Such differences are irrelevant for web search. So the quality of a web crawler increases if it can assess whether a newly crawled web page is a near-duplicate of a previously crawled web page or not.
In my talk I will address the problem of efficiently detecting near-duplicates in the web crawl setting. First I shall review Charikar's fingerprinting technique used in our approach. Then, I shall formally define the problem and present algorithmic solutions for two versions of the problem: online queries (single new document) and batch queries (multiple new documents). Finally, I shall briefly talk about our experimental analysis of the techniques.

Clickthrough of and Bidding on keywords
Aleksandra Korolova / Dilys Thomas

Aleksandra will talk about the paper "Predicting ClickThrough Rate Using Keyword Clusters" by Regelson and Fain from a Workshop on Sponsored Search Auctions. The results in the paper are mostly experimental. Dilys will briefly talk on "how to bid on keywords".

Multi-path Routing
Mihaela Enachescu

I will present an approach for enabling multi-path routing in large networks. This is based on a project at NEC Labs, and is joint work with Ravi Kokku. Our approach is to create multiple paths between various source-destination pairs by selecting a special subset of relay nodes. We give hardness results and provide some approximation algorithms.
This work is motivated by the observation that given two end-host in the Internet there exist potentially many independent or partially overlapping paths between them. However, only one of these is usually used by the underlay routing, leading to many inefficiencies (the single path routings are usually slow to adapt in the presence of fluctuating network conditions). Multi-path routing provides a more adaptive framework, and it also raises a variety of (still open) problems for our enjoyment.

Collaborative tagging systems
Paul Heymann

Collaborative tagging systems have recently become prevalent on the web. These systems are similar to classical keyword-based systems, but harness the input of many users to label large datasets like URLs on the web, videos, and photos. In my talk, I will give a brief description of tagging systems, and then talk about two recent projects. The first is work to predict tags annotating URLs based on the page text of those URLs. The outcome of this task has important implications for web search and for the usefulness of data created by social bookmarking systems. The second is an earlier (continuing) project on tag hierarchies.

Is Efficiency Expensive?
Mukund Sundararajan

We study the simultaneous optimization of efficiency and revenue in pay-per-click keyword auctions in a Bayesian setting. Our main result is that the efficient keyword auction yields near-optimal revenue even under modest competition. In the process, we build on classical results in auction theory to prove that increasing the number of bidders by the number of slots outweighs the benefit of employing the optimal reserve price.
This is joint work with Tim Roughgarden.

Estimating Search Engine Index and Web Sizes
Ying Xu

I will talk about estimating the index size of search engines, both empirical and theoretical results. The talk is based on our CIKM paper "Estimating Corpus Size via Queries" and ICALP paper "Estimating Sum by Weighted Sampling".

Reusable and multiple goods
Ashish Goel

The talk is based on
Online auctions with re-usable goods by Mohammad Hajiaghayi, Robert Kleinberg, Mohammad Mahdian, and David Parkes, EC'05
The Sequential Auction Problem on eBay: An Empirical Analysis and a Solution by Adam I. Juda and David C. Parkes, EC'06

Mechanism design based on Cost-Per-Acquisition (CPA)
Hamid Nazerzadeh

The Internet advertising has changed dramatically by moving from Cost-Per-Impression charging mechanism to Cost-Per-Click (CPC). I am going to talk about the possibility of going one step further and using CPA instead of CPC, and some of our results about this problem.

Asymptotically Optimal Repeated Auctions for Sponsored Search
Nicolas Lambert

We investigate asymptotically optimal keyword auctions, that is, auctions which maximize revenue as the number of bidders grows. We do so under two alternative behavioral assumptions. The first explicitly models the repeated nature of keyword auctions. It introduces a novel (and, we argue, plausible) assumption on individual bidding, namely that bidders never overbid their value, and bid their actual value if shut out for long enough. Under these conditions we present a broad class of repeated auctions that are asymptotically optimal among all sequential auctions (a superset of repeated auctions). Those auctions have varying payment schemes but share the ranking method. The Google auction belongs to this class, but not the Yahoo auction, and indeed we show that the latter is not asymptotically optimal. (Nonetheless, with some additional distributional assumptions, the Yahoo auction can be shown to belong to a broad category of auctions that are asymptotically optimal among all auction mechanisms that do not rely on ad relevance.) We then look at the one-shot keyword auction, which can be taken to model repeated auctions in which relatively myopic bidders converge on the equilibrium of the full-information stage game. In this case we show that the Google auction remains asymptotically optimal and the Yahoo auction suboptimal. The distributional assumptions under which our theorems hold are quite general, and arguably hold in all real-world situations of interest. We do however show that the Google auction is not asymptotically revenue maximizing for general distributions.

Apr 13, 2007	Ying Xu	a discussion on WWW'07 papers
Apr 20, 2007	BAGT
Apr 27, 2007	Anish Das Sarma	Detecting Near-Duplicates for Web Crawling
May 4, 2007	Mayur Datar	Google News Personalization
May 11, 2007	Adam Meyerson	Pricing Partially Compatible Products
May 18, 2007	Sergei Vassilvitskii	Click Fraud Resistant Methods for Learning Click-Through Rates
May 25, 2007	Dilys Thomas	Streaming Algorithms and Worm Fingerprinting
June 1, 2007	Neil Daswani	Anatomy of Clickbot.A

Jan 26, 2007	Nicolas Lambert	Asymptotically Optimal Repeated Auctions
Feb 2, 2007	Hamid Nazerzadeh	Mechanism design based on CPA
Feb 9, 2007	Ashish Goel	Reusable and multiple goods
Feb 16, 2007	Ying Xu	Estimating Search Engine Index and Web size
Feb 23, 2007	Mukund Sundararajan	Is efficiency expensive?
March 2, 2007	Paul Heymann	Collaborative tagging systems
March 9, 2007	Mihaela Enachescu	Multi-path Routing
March 16, 2007	Aleksandra Korolova / Dilys Thomas	Clickthrough of and Bidding on keywords