Modern Information Retrieval Chapter 10: User Interfaces and Visualization |
![]() Contents |
collection overviews!clusters collection overviews!automatically derived collection overviews!Scatter/Gather
Scatter/Gather
Many attempts to display overview information have focused on automatically extracting the most common general themes that occur within the collection. These themes are derived via the use of unsupervised analysis methods, usually variants of document clustering. Clustering organizes documents into groups based on similarity to one another; the centroids of the clusters determine the themes in the collections.
The Scatter/Gather browsing paradigm [#!cutting92!#,#!cutting93!#] clusters documents into topically-coherent groups, and presents descriptive textual summaries to the user. The summaries consist of topical terms that characterize each cluster generally, and a set of typical titles that hint at the contents of the cluster. Informed by the summaries, the user may select a subset of clusters that seem to be of most interest, and recluster their contents. Thus the user can examine the contents of each subcollection at progressively finer granularity of detail. The reclustering is computed on-the-fly; different themes are produced depending on the documents contained in the subcollection to which clustering is applied. The choice of clustering algorithm influences what clusters are produced, but no one algorithm has been shown to be particularly better than the rest when producing the same number of clusters [#!willett88!#].
A user study [#!pirolli96!#] showed that the use of Scatter/Gather on a large text collection successfully conveys some of the content and structure of the corpus. However, that study also showed that Scatter/Gather without a search facility was less effective than a standard similarity search for finding relevant documents for a query. That is, subjects allowed only to navigate, not to search over, a hierarchical structure of clusters covering the entire collection were less able to find documents relevant to the supplied query than subjects allowed to write queries and scan through retrieval results.
It is possible to integrate Scatter/Gather with conventional search
technology by applying clustering on the results of a query to
organize the retrieved documents (see Figure
). An offline experiment [#!hearst96e!#]
suggests that clustering may be more effective if used in this manner.
The study found that documents relevant to the query tend to fall
mainly into one or two out of five clusters, if the clusters are
generated from the top-ranked documents retrieved in response to the
query. The study also showed that precision and recall were higher
within the best cluster than within the retrieval results as a whole.
The implication is that a user might save time by looking at the
contents of the cluster with the highest proportion of relevant
documents and at the same time avoiding those clusters with mainly
non-relevant documents. Thus, clustering of retrieval results may be
useful for helping direct users to a subset of the retrieval results
that contain a large proportion of the relevant documents.
General themes do seem to arise from document clustering, but the
themes are highly dependent on the makeup of the documents within the
clusters [#!hearst96e!#,#!hearst98a!#]. The unsupervised nature of
clustering can result in a display of topics at varying levels of
description. For example, clustering a collection of documents about
computer science might result in clusters containing documents about
artificial intelligence, computer theory, computer graphics, computer
architecture, programming languages, government, and legal issues.
The latter two themes are more general than the others, because they
are about topics outside the general scope of computer science.
Thus clustering can results in the juxtaposition of
very different levels of description within a single display.
Scatter/Gather shows a textual representation of document clusters.
Researchers have developed several approaches to map documents from
their high dimensional representation in document space into a 2D
representation in which each document is represented as a small glyph
or icon on a map or within an abstract 2D space. The functions for
transforming the data into the lower dimensional space differ, but the
net effect is that each document is placed at one point in a
scatter-plot-like representation of the space. Users are meant to
detect themes or clusters in the arrangement of the glyphs. Systems
employing such graphical displays include BEAD [#!chalmers92!#], the
Galaxy of News [#!rennison94!#], and ThemeScapes [#!wise95!#]. The
ThemeScapes view imposes a three-dimensional representation on the
results of clustering (see Figure ). The layout
makes use of `negative space' to help emphasize the areas of
concentration where the clusters occur. Other systems display
inter-document similarity hierarchically [#!maarek94!#,#!allen93!#], while
still others display retrieved documents in networks based on
inter-document similarity [#!fowler91!#,#!thompson89!#].
collection overviews!Kohonen feature maps
Kohonen feature maps
Kohonen's feature map algorithm has been used to create maps that
graphically characterize the overall content of a document collection
or subcollection [#!lin91!#,#!hchen98!#] (see Figure ).
The regions of the 2D map vary in size and shape corresponding to how
frequently documents assigned to the corresponding themes occur within
the collection. Regions are characterized by single words or phrases,
and adjacency of regions is meant to reflect semantic relatedness of
the themes within the collection. A cursor moved over a document
region causes the titles of the documents most strongly associated
with that region to be displayed in a pop-up window. Documents can be
associated with more than one region.