Research on Algorithms for the Internet
Spring 06-07
Time: 4-5pm Fridays; Room: Gates 498.
| Apr 13, 2007 | Ying Xu | a discussion on WWW'07 papers |
| Apr 20, 2007 | BAGT | |
| Apr 27, 2007 | Anish Das Sarma | Detecting Near-Duplicates for Web Crawling |
| May 4, 2007 | Mayur Datar | Google News Personalization |
| May 11, 2007 | Adam Meyerson | Pricing Partially Compatible Products |
| May 18, 2007 | Sergei Vassilvitskii | Click Fraud Resistant Methods for Learning Click-Through Rates |
| May 25, 2007 | Dilys Thomas | Streaming Algorithms and Worm Fingerprinting |
| June 1, 2007 | Neil Daswani | Anatomy of Clickbot.A |
Winter 06-07
Time: 2-3pm Fridays; Room: Gates 498.
| Jan 26, 2007 | Nicolas Lambert | Asymptotically Optimal Repeated Auctions |
| Feb 2, 2007 | Hamid Nazerzadeh | Mechanism design based on CPA |
| Feb 9, 2007 | Ashish Goel | Reusable and multiple goods |
| Feb 16, 2007 | Ying Xu | Estimating Search Engine Index and Web size |
| Feb 23, 2007 | Mukund Sundararajan | Is efficiency expensive? |
| March 2, 2007 | Paul Heymann | Collaborative tagging systems |
| March 9, 2007 | Mihaela Enachescu | Multi-path Routing |
| March 16, 2007 | Aleksandra Korolova / Dilys Thomas | Clickthrough of and Bidding on keywords |
The Anatomy of Clickbot.A
Neil Daswani,
Google
In this talk, I will present a detailed case study of the architecture of the Clickbot.A botnet that attempted a low-noise click fraud attack against syndicated search engines. The botnet of over 100,000 machines was controlled using a HTTP-based botmaster. Google identified all clicks on its ads exhibiting Clickbot.A-like patterns and marked them as invalid. We disclose the results of our investigation of this botnet to educate the security research community and provide information regarding the novelties of the attack.
Data Stream Algorithms with Applications to Automatic Worm
Fingerprinting
Dilys Thomas
We will first go over a few network worms and contemporary algorithms to detect these network worms. We will then go over some recent techniques to detect these worms using "vunlerability signatures". We will then use this to review two important problems in data stream computation -- distinct value estimation and frequent element computation. We show how these algorithms can be used for automatic worm fingerprinting.
Click Fraud Resistant Methods for Learning Click-Through
Rates
Sergei Vassilvitskii
Presenting the paper "Click Fraud Resistant Methods for Learning Click-Through Rates" by N. Immorlica, K. Jain, M. Mahdian, and K. Talwar from WINE 2005.
Pricing Partially Compatible Products
Adam Meyerson, UCLA
Suppose two competing companies provide roughly equivalent
suites of software products. Each version of a product has a possibly
distinct price and quality. Customers would like to purchase a low-
price, high-quality set of products. The issue is that products made
by different companies need not be fully compatible. We reduce the
customer's decision process to a minimum cut, and provide
approximation and hardness results for the problem of computing the
company's best-response strategy in several different models. We also
introduce a range of questions for further work.
These results will appear in the ACM Conference on Electronic
Commerce (EC) 2007, and are joint work with David Kempe, Nainesh
Solanki, and Ramnath Chellappa.
Google News Personalization: Scalable Online Collaborative
Filtering
Mayur Datar, Google
Several approaches to collaborative filtering have been studied but seldom have the studies been reported for large (several millions of users and items) and dynamic (the underlying item set is continually changing) settings. In this paper we describe our approach to collaborative filtering for generating personalized recommendations for users of Google News. We generate recommendations using three approaches: collaborative filtering using MinHash clustering, Probabilistic Latent Semantic Indexing (PLSI), and covisitation counts. We combine recommendations from different algorithms using a linear model. Our approach is content agnostic and consequently domain independent, making it easily adaptible for other applications and languages with minimal effort. This paper will describe our algorithms and system setup in detail, and report results of running the recommendations engine on Google News.
Detecting Near-Duplicates for Web Crawling
Anish Das Sarma
This is joint work with Gurmeet Singh Manku and Arvind Jain in Google.
Near-duplicate web documents are abundant. Two such documents
differ from each other in a very small portion that displays
advertisements, for example. Such differences are irrelevant for
web search. So the quality of a web crawler increases if it can
assess whether a newly crawled web page is a near-duplicate of a
previously crawled web page or not.
In my talk I will address the problem of efficiently detecting
near-duplicates in the web crawl setting. First I shall review
Charikar's fingerprinting technique used in our approach. Then, I
shall formally define the problem and present algorithmic solutions
for two versions of the problem: online queries (single new
document) and batch queries (multiple new documents). Finally, I
shall briefly talk about our experimental analysis of the techniques.
Clickthrough of and Bidding on keywords
Aleksandra Korolova / Dilys Thomas
Aleksandra will talk about the paper "Predicting ClickThrough Rate Using Keyword Clusters" by Regelson and Fain from a Workshop on Sponsored Search Auctions. The results in the paper are mostly experimental. Dilys will briefly talk on "how to bid on keywords".
Multi-path Routing
Mihaela Enachescu
I will present an approach for enabling multi-path routing in large
networks. This is based on a project at NEC Labs, and is joint work
with Ravi Kokku. Our approach is to create multiple paths between
various source-destination pairs by selecting a special subset of
relay nodes. We give hardness results and provide some approximation
algorithms.
This work is motivated by the observation that given two end-host in
the Internet there exist potentially many independent or partially
overlapping paths between them. However, only one of these is usually
used by the underlay routing, leading to many inefficiencies (the
single path routings are usually slow to adapt in the presence of
fluctuating network conditions). Multi-path routing provides a more
adaptive framework, and it also raises a variety of (still open)
problems for our enjoyment.
Collaborative tagging systems
Paul Heymann
Collaborative tagging systems have recently become prevalent on the web. These systems are similar to classical keyword-based systems, but harness the input of many users to label large datasets like URLs on the web, videos, and photos. In my talk, I will give a brief description of tagging systems, and then talk about two recent projects. The first is work to predict tags annotating URLs based on the page text of those URLs. The outcome of this task has important implications for web search and for the usefulness of data created by social bookmarking systems. The second is an earlier (continuing) project on tag hierarchies.
Is Efficiency Expensive?
Mukund Sundararajan
We study the simultaneous optimization of efficiency and revenue in
pay-per-click keyword auctions in a Bayesian setting. Our main
result is
that the efficient keyword auction yields near-optimal revenue
even under
modest competition. In the process, we build on classical results in
auction theory to prove that increasing the number of bidders by the
number
of slots outweighs the benefit of employing the optimal reserve
price.
This is joint work with Tim Roughgarden.
Estimating Search Engine Index and Web Sizes
Ying Xu
I will talk about estimating the index size of search engines, both empirical and theoretical results. The talk is based on our CIKM paper "Estimating Corpus Size via Queries" and ICALP paper "Estimating Sum by Weighted Sampling".
Reusable and multiple goods
Ashish Goel
The talk is based on
Online auctions with re-usable goods
by Mohammad Hajiaghayi, Robert Kleinberg, Mohammad Mahdian, and David
Parkes, EC'05
The Sequential Auction Problem on eBay: An Empirical Analysis and a
Solution by Adam I. Juda and David C. Parkes, EC'06
Mechanism design based on Cost-Per-Acquisition (CPA)
Hamid Nazerzadeh
The Internet advertising has changed dramatically by moving from Cost-Per-Impression charging mechanism to Cost-Per-Click (CPC). I am going to talk about the possibility of going one step further and using CPA instead of CPC, and some of our results about this problem.
Asymptotically Optimal Repeated Auctions for Sponsored
Search
Nicolas Lambert
We investigate asymptotically optimal keyword auctions, that is, auctions which maximize revenue as the number of bidders grows. We do so under two alternative behavioral assumptions. The first explicitly models the repeated nature of keyword auctions. It introduces a novel (and, we argue, plausible) assumption on individual bidding, namely that bidders never overbid their value, and bid their actual value if shut out for long enough. Under these conditions we present a broad class of repeated auctions that are asymptotically optimal among all sequential auctions (a superset of repeated auctions). Those auctions have varying payment schemes but share the ranking method. The Google auction belongs to this class, but not the Yahoo auction, and indeed we show that the latter is not asymptotically optimal. (Nonetheless, with some additional distributional assumptions, the Yahoo auction can be shown to belong to a broad category of auctions that are asymptotically optimal among all auction mechanisms that do not rely on ad relevance.) We then look at the one-shot keyword auction, which can be taken to model repeated auctions in which relatively myopic bidders converge on the equilibrium of the full-information stage game. In this case we show that the Google auction remains asymptotically optimal and the Yahoo auction suboptimal. The distributional assumptions under which our theorems hold are quite general, and arguably hold in all real-world situations of interest. We do however show that the Google auction is not asymptotically revenue maximizing for general distributions.