Javier
Sanchez
Rohit
Singh
Status:
Draft proposal
Last
modified: 02/06/2001
Abstract
Right
now the search of documents in the internet has two main difficulties:
-
In order
to obtain meaningful information from a search engine we need to combine
a series of keywords and go over an iterative, time consuming and boring
process of refining the query and visiting the pages until we find what
we want.
-
Involves
a direct and active role from the user in the process of finding documents
that match his/her interests.
-
Usually
does not make use of the fact that users of a community are looking for
similar resources. Hence, the current process does not encourage peer-to-peer
exchange of surfing information
Suppose
a proxy routes the web traffic of a community of users who share common
interests, for example CS students and faculty, then the proxy can be used
to:
-
aggregate
information about the net traffic in the community served by the proxy.
-
deduce
patterns of usage specific to that community. Note that this and the previous
usage are non-personalized and hence involve no active participation by
the user in terms of annotating web-pages.
-
proactively
suggest resources to users based on the usage pattern of the other community
members. The users might take active participation in reviewing the web-pages
they visit.
This could
be extended to provide tailored suggestions to the user based upon her
explicitly specified surfing/search preferences and even better would
be the situation if the proxy could deduce these on its own.
User Experience
and Motivation
Scenario#1
: Suppose Drew is interested in ubiquitous computing related articles on
the web, but she does not know the right keywords/search-engines nor she
has the time to do an extensive search and hence can not get the most relevant
results using a traditional approach. But the proxy (say the CS dept proxy)
which Drew is using, has information about the pages visited by other users.
Among these users are people who are also interested in ubiquitous computing
and the proxy has information about the pages they visited in this regard.
What the proxy can do now is to suggest Drew these pages which are more
likely to be relevant to what she needs. Note that the server need not
explicitly "know" what Drew is looking for and then match it. The power
of this approach comes from the fact that users in a community would often
look for the same resources and this information can be easily aggregated
and de-personalized. Another motivation is that Drew is not only
interested in what she can find now about ubiquitous computing but she
would also like to know about new sites that are related to that topic
that she didn't find today ( and were not known to the proxy either). In
the traditional approach, this will involve a new search or someone sending
her a link. But if the proxy, that monitors traffic of the CS community
, "knows" that Drew is interested in that topic, it can notify her about
new links that were not reported to her previously.
Scenario#2
: In the previous scenario, there was very little active participation
by Drew to contribute to the information that the proxy accumulates. The
information accumulated by the proxy is totally automated. But we can do
better. Suppose, Drew ranks the pages he visits on the basis of their
usefulness or relevance vis-a-vis her search. Then others, say Cameron,
can take advantage of this ranking. This ranking is especially useful because
it has been produced by Cameron's peers and so it is much more likely to
be relevant to her. Of course, this will require the users to proactively
rank pages they visit (or otherwise review them)
System
Architecture
[hold
on, we're getting there] Basically, there will be a client side plug-in
(like the Google toolbar). The user will need to get a username (to
enable user specific services). This might not be necessary if the user
chooses to only use de-personalized services (scenario#1). On the proxy,
there will be document classification, matching and retrieval module. At
this stage, we have not further elaborated upon the proxy design.
Team Members
-
Javier
Sanchez
-
Rohit
Singh
Technical
Challenges and Open Issues
Lot of
work has been done in related topics in Datamining, Information Retrieval
and Text Document Matching. We hope to be able use a lot of the algorithms
and code from these domains. We will need to borrow public domain code
for document classification, search and retrieval.
-
In scenario#1,
it might be an interesting question to estimate the amount of time a server
should spend on looking for pages similar to those being accessed and how
do we decide which are the right ones. If Drew searched for "everywhere
computing", the plain vanilla way would be to just lookup the links other
users in the community spent time on when they also searched for "everywhere
computing". However, Drew might have meant "ubiquitous computing"
ie she might not have known the right keyword(s). So the proxy might compare
pages based on their content. However, this can be *very* *very*
expensive. A search on google might return 95 links. It is useless
to traverse all 95 and check all pages (in the proxy's cache) that are
similar to these 95. A middle path would be that proxy learns
to figure out that 'everywhere' might also refer to 'ubiquitous' and look
for results under that entry. Given our meager time and resources,
we'll either choose the plain vanilla way or take the middle path.
-
A practical
question in scenario#1 would be to find freely available document classification
and retrieval software that we could use of the shelf. If you know some,
please mail us.
-
In scenario#2,
a technical challenge will be to efficiently do the searches (based on
the preferences of the users) to suggest links to users. This pertains
to the example where Drew wants to be kept updated if new pages relating
to ubiquitous computing are added to proxy's cache. We will need to make
this process efficient (i.e. make it a batch job) and at the same time,
not have too much latency in the updating the user.
We
could try to make the proxy intelligent in that it could figure out the
user's likings etc. by itself and suggest to the user new and updated pages
related to her field of interest.
Demo
The demonstration
will include two main deliverables:
-
The client
which will be a toolbar like the one in Google Toolbar which will allow
access to the proxy.
-
The proxy
which will implement the following functionality:
-
Act as
a bridge between the user and the internet only for HTTP traffic.
-
Maintain
a repository of pages viewed by the community members, possibly
indexed by the related search query (if any).
-
Capability
to suggest other "relevant" pages to a user based on the pages viewed by
other community members in relation to similar queries.
-
Perform
user authentication to allow user profiles.
-
Manage
a user profile database which contains explicitly the queries in which
the user wants to be notified when the proxy's repository is updated, like:"
ubiquitous computing", "information retrieval", "web composition"
-
Capability
of matching above mentioned queries to the (possibly new) entries in the
repository and generate suitable alerts.
-
Allow
a user to "vote" for a page or otherwise do simple reviews for a
page and store these votes/reviews and match them to corresponding links/pages
-
Allow
other users to look at ratings/reviews of links they are going to visit.
-
[This
might have huge implementation implications. So we are not sure we'll get
to this] And perform searches on the proxy's repository based on these
page rankings