Sinergia

Javier Sanchez
Rohit Singh

Status: Draft proposal
Last modified: 02/06/2001

Abstract

Right now the search of documents in the internet has two main difficulties:

In order to obtain meaningful information from a search engine we need to combine a series of keywords and go over an iterative, time consuming and boring process of refining the query and visiting the pages until we find what we want.
Involves a direct and active role from the user in the process of finding documents that match his/her interests.
Usually does not make use of the fact that users of a community are looking for similar resources. Hence, the current process does not encourage peer-to-peer exchange of surfing information

Suppose a proxy routes the web traffic of a community of users who share common interests, for example CS students and faculty, then the proxy can be used to:

aggregate information about the net traffic in the community served by the proxy.
deduce patterns of usage specific to that community. Note that this and the previous usage are non-personalized and hence involve no active participation by the user in terms of annotating web-pages.
proactively suggest resources to users based on the usage pattern of the other community members. The users might take active participation in reviewing the web-pages they visit.

This could be extended to provide tailored suggestions to the user based upon her explicitly specified surfing/search preferences and even better would be the situation if the proxy could deduce these on its own.

User Experience and Motivation

Scenario#1 : Suppose Drew is interested in ubiquitous computing related articles on the web, but she does not know the right keywords/search-engines nor she has the time to do an extensive search and hence can not get the most relevant results using a traditional approach. But the proxy (say the CS dept proxy) which Drew is using, has information about the pages visited by other users. Among these users are people who are also interested in ubiquitous computing and the proxy has information about the pages they visited in this regard. What the proxy can do now is to suggest Drew these pages which are more likely to be relevant to what she needs. Note that the server need not explicitly "know" what Drew is looking for and then match it. The power of this approach comes from the fact that users in a community would often look for the same resources and this information can be easily aggregated and de-personalized. Another motivation is that Drew is not only interested in what she can find now about ubiquitous computing but she would also like to know about new sites that are related to that topic that she didn't find today ( and were not known to the proxy either). In the traditional approach, this will involve a new search or someone sending her a link. But if the proxy, that monitors traffic of the CS community , "knows" that Drew is interested in that topic, it can notify her about new links that were not reported to her previously.

Scenario#2 : In the previous scenario, there was very little active participation by Drew to contribute to the information that the proxy accumulates. The information accumulated by the proxy is totally automated. But we can do better. Suppose, Drew ranks the pages he visits on the basis of their usefulness or relevance vis-a-vis her search. Then others, say Cameron, can take advantage of this ranking. This ranking is especially useful because it has been produced by Cameron's peers and so it is much more likely to be relevant to her. Of course, this will require the users to proactively rank pages they visit (or otherwise review them)

System Architecture

[hold on, we're getting there] Basically, there will be a client side plug-in (like the Google toolbar). The user will need to get a username (to enable user specific services). This might not be necessary if the user chooses to only use de-personalized services (scenario#1). On the proxy, there will be document classification, matching and retrieval module. At this stage, we have not further elaborated upon the proxy design.

Team Members

Javier Sanchez
Rohit Singh

Technical Challenges and Open Issues

Lot of work has been done in related topics in Datamining, Information Retrieval and Text Document Matching. We hope to be able use a lot of the algorithms and code from these domains. We will need to borrow public domain code for document classification, search and retrieval.

In scenario#1, it might be an interesting question to estimate the amount of time a server should spend on looking for pages similar to those being accessed and how do we decide which are the right ones. If Drew searched for "everywhere computing", the plain vanilla way would be to just lookup the links other users in the community spent time on when they also searched for "everywhere computing". However, Drew might have meant "ubiquitous computing" ie she might not have known the right keyword(s). So the proxy might compare pages based on their content. However, this can be *very* *very* expensive. A search on google might return 95 links. It is useless to traverse all 95 and check all pages (in the proxy's cache) that are similar to these 95. A middle path would be that proxy learns to figure out that 'everywhere' might also refer to 'ubiquitous' and look for results under that entry. Given our meager time and resources, we'll either choose the plain vanilla way or take the middle path.
A practical question in scenario#1 would be to find freely available document classification and retrieval software that we could use of the shelf. If you know some, please mail us.
In scenario#2, a technical challenge will be to efficiently do the searches (based on the preferences of the users) to suggest links to users. This pertains to the example where Drew wants to be kept updated if new pages relating to ubiquitous computing are added to proxy's cache. We will need to make this process efficient (i.e. make it a batch job) and at the same time, not have too much latency in the updating the user.

We could try to make the proxy intelligent in that it could figure out the user's likings etc. by itself and suggest to the user new and updated pages related to her field of interest.

Demo

The demonstration will include two main deliverables:

The client which will be a toolbar like the one in Google Toolbar which will allow access to the proxy.
The proxy which will implement the following functionality:

Act as a bridge between the user and the internet only for HTTP traffic.
Maintain a repository of pages viewed by the community members, possibly indexed by the related search query (if any).
Capability to suggest other "relevant" pages to a user based on the pages viewed by other community members in relation to similar queries.
Perform user authentication to allow user profiles.
Manage a user profile database which contains explicitly the queries in which the user wants to be notified when the proxy's repository is updated, like:" ubiquitous computing", "information retrieval", "web composition"
Capability of matching above mentioned queries to the (possibly new) entries in the repository and generate suitable alerts.
Allow a user to "vote" for a page or otherwise do simple reviews for a page and store these votes/reviews and match them to corresponding links/pages
Allow other users to look at ratings/reviews of links they are going to visit.
[This might have huge implementation implications. So we are not sure we'll get to this] And perform searches on the proxy's repository based on these page rankings