2. Automatically Derived Collection Overviews

Modern Information Retrieval
Chapter 10: User Interfaces and Visualization

Contents

Next: 3. Evaluations of Graphical Up: 2. Overviews Previous: 1. Category or Directory

2. Automatically Derived Collection Overviews

collection overviews!clusters collection overviews!automatically derived collection overviews!Scatter/Gather

Scatter/Gather

Many attempts to display overview information have focused on automatically extracting the most common general themes that occur within the collection. These themes are derived via the use of unsupervised analysis methods, usually variants of document clustering. Clustering organizes documents into groups based on similarity to one another; the centroids of the clusters determine the themes in the collections.

The Scatter/Gather browsing paradigm [#!cutting92!#,#!cutting93!#] clusters documents into topically-coherent groups, and presents descriptive textual summaries to the user. The summaries consist of topical terms that characterize each cluster generally, and a set of typical titles that hint at the contents of the cluster. Informed by the summaries, the user may select a subset of clusters that seem to be of most interest, and recluster their contents. Thus the user can examine the contents of each subcollection at progressively finer granularity of detail. The reclustering is computed on-the-fly; different themes are produced depending on the documents contained in the subcollection to which clustering is applied. The choice of clustering algorithm influences what clusters are produced, but no one algorithm has been shown to be particularly better than the rest when producing the same number of clusters [#!willett88!#].

A user study [#!pirolli96!#] showed that the use of Scatter/Gather on a large text collection successfully conveys some of the content and structure of the corpus. However, that study also showed that Scatter/Gather without a search facility was less effective than a standard similarity search for finding relevant documents for a query. That is, subjects allowed only to navigate, not to search over, a hierarchical structure of clusters covering the entire collection were less able to find documents relevant to the supplied query than subjects allowed to write queries and scan through retrieval results.

It is possible to integrate Scatter/Gather with conventional search technology by applying clustering on the results of a query to organize the retrieved documents (see Figure ). An offline experiment [#!hearst96e!#] suggests that clustering may be more effective if used in this manner. The study found that documents relevant to the query tend to fall mainly into one or two out of five clusters, if the clusters are generated from the top-ranked documents retrieved in response to the query. The study also showed that precision and recall were higher within the best cluster than within the retrieval results as a whole. The implication is that a user might save time by looking at the contents of the cluster with the highest proportion of relevant documents and at the same time avoiding those clusters with mainly non-relevant documents. Thus, clustering of retrieval results may be useful for helping direct users to a subset of the retrieval results that contain a large proportion of the relevant documents.

**Figure:** Display of Scatter/Gather clustering retrieval results [#!cutting92!#].

General themes do seem to arise from document clustering, but the themes are highly dependent on the makeup of the documents within the clusters [#!hearst96e!#,#!hearst98a!#]. The unsupervised nature of clustering can result in a display of topics at varying levels of description. For example, clustering a collection of documents about computer science might result in clusters containing documents about
artificial intelligence, computer theory, computer graphics, computer architecture, programming languages, government, and legal issues. The latter two themes are more general than the others, because they are about topics outside the general scope of computer science. Thus clustering can results in the juxtaposition of very different levels of description within a single display.

Scatter/Gather shows a textual representation of document clusters. Researchers have developed several approaches to map documents from their high dimensional representation in document space into a 2D representation in which each document is represented as a small glyph or icon on a map or within an abstract 2D space. The functions for transforming the data into the lower dimensional space differ, but the net effect is that each document is placed at one point in a scatter-plot-like representation of the space. Users are meant to detect themes or clusters in the arrangement of the glyphs. Systems employing such graphical displays include BEAD [#!chalmers92!#], the Galaxy of News [#!rennison94!#], and ThemeScapes [#!wise95!#]. The ThemeScapes view imposes a three-dimensional representation on the results of clustering (see Figure ). The layout makes use of `negative space' to help emphasize the areas of concentration where the clusters occur. Other systems display inter-document similarity hierarchically [#!maarek94!#,#!allen93!#], while still others display retrieved documents in networks based on inter-document similarity [#!fowler91!#,#!thompson89!#].

**Figure:** A three-dimensional overview based on document clustering [#!wise95!#].

collection overviews!Kohonen feature maps

Kohonen feature maps

Kohonen's feature map algorithm has been used to create maps that graphically characterize the overall content of a document collection or subcollection [#!lin91!#,#!hchen98!#] (see Figure ). The regions of the 2D map vary in size and shape corresponding to how frequently documents assigned to the corresponding themes occur within the collection. Regions are characterized by single words or phrases, and adjacency of regions is meant to reflect semantic relatedness of the themes within the collection. A cursor moved over a document region causes the titles of the documents most strongly associated with that region to be displayed in a pop-up window. Documents can be associated with more than one region.

**Figure:** A two-dimensional overview created using a Kohonen feature map learning algorithm on Web pages having to do with the topic Entertainment [#!hchen98!#].

Next: 3. Evaluations of Graphical Up: 2. Overviews Previous: 1. Category or Directory