INDEXER:
A Learning Companion For Document Browsing.

Abstract

When people access the web, we can classify their activities into two broad categories. They are either searching for specific information, or they are browsing(Marko & Yoav 1995). There have been several efforts to support browsing activity and searching. Searching though narrower than browsing, can sometimes be time-consuming, given the current exponential growth in the volume of information accessible over the Internet. This is a proposal to develop a system which will help narrow down the domain of the search. The system "learns" to identify words which the user will find interesting. It first presents the user with a list of keywords, from the document in the master window, which he evaluates as RELEVANT or NOT-RELEVANT and this information is used as the standard for the learning system which maintains a "user preferences" file. The system then creates a clickable list of "interesting words" in a slave window. The user can open up the documents one after the other and he can read the context "relevant" to him by just clicking on the keyword (which is chosen by the trained system) in the slave window. The goal of this system, with its summarizer and indexer, is to impove browsing capabilities within the document which itself is the result of a search. It thus increases the value of the html document as a whole.

Definitions:

KEYWORD

It is a word which the system evaluates to be on a high priority category
using position, frequency and other factors.

INTERESTING WORD

It is a narrower category of words and contains the words which the system
after training, will present as a clickable index in the slave window.

A Real World Analogy Of The Problem:

Architectural Components:

Netscape 1.1N
Perl
A custom proxy server: It acts as an intermediate server which provides a path through which all browser requests pass.
An INDEXER program
SUMMARIZER program by Joe Felder will be included in the package.
A master client(Netscape window in which document will be displayed).
An INDEXER slave client(Netscape window with clickable keywords).
A SUMMARIZER slave client(Netscape window in which TOC will be displayed).

Solution Approach:

The Blue Print For Indexing:

 1.   Obtain training document (Display in master window).

 2.   Identify individual text words.

 3.   Use stop list to delete common words.

 4.   Use suffix stripping algorithms.

 5.   Identify the retrieved words as relevant(score=1) 
      and non-relevant(score=0) to the user.

 6.   Compute term weights of relevant words using prescribed formula.

 7.   Place words and weights in user preferences file.

 8.   Obtain new document and repeat steps 2-4.

 9.   Place the keywords in a document vector.

10.   Find similarity coefficient of u.p.f and document vector.

11.   Find weights of newly found terms.

12.   Reformulate contents of u.p.f. using relevance feedback formula.

13.   Create a clickable index of "interesting words" which are
      the contents of u.p.f. (Display in slave window)

14.   Return to step 8

Learning and Relevance Feedback:

     w(i,j) = t(i,j)*log(N/d(i))  


          where w(i,j) = weight of ith word in jth document
                t(i,j) = ith term frequency in jth document
                N      = number of documents evaluated
                d(i)   = number of documents in which word i appears

The preference vector is one whose elements are the weights of the relevant words placed in order.  
              P = w(i,j) for all i, j
          
Once the system is trained to satisfaction, a new document is retrieved and its keywords are determined. The weights of the new words, d(i,j), found using the above formula are placed in a document vector.
              D =  d(i,j) for all i,j

The similarity coefficient which gives 

placed
that the  and the system adjusts its parameters and makes a more precise list of "interesting" words. The system undergoes training till it can identify the "interesting" words on its own,  for the rest of the documents.  It then displays a clickable index of "interesting" words for each of the documents.

INDEXER: A Learning Companion For Document Browsing.