Modern Information Retrieval
Chapter 10: User Interfaces and Visualization
Boolean query specification!faceted queries faceted queries
Yet another problem with Boolean queries is that their strict interpretation tends to yield result sets that are either too large, because the user includes many terms in a disjunct, or are empty, because the user conjoins terms in an effort to reduce the result set. This problem occurs in large part because the user does not know the contents of the collection or the role of terms within the collection.
A common strategy for dealing with this problem, employed in systems with command-line-based interfaces like DIALOG's , is to create a series of short queries, view the number of documents returned for each, and combine those queries that produce a reasonable number of results. For example, in DIALOG, each query produces a resulting set of documents that is assigned an identifying name. Rather than returning a list of titles themselves, DIALOG shows the set number with a listing of the number of matched documents. Titles can be shown by specifying the set number and issuing a command to show the titles. Document sets that are not empty can be referred to by a set name and combined with AND operations to produce new sets. If this set in turn is too small, the user can back up and try a different combination of sets, and this process is repeated in pursuit of producing a reasonably sized document set.
This kind of query formulation is often called a faceted query, to indicate that the user's query is divided into topics or facets, each of which should be present in the retrieved documents [#!meadow89!#,#!harter86!#]. For example, a query on drugs for the prevention of osteoporosis might consist of three facets, indicated by the disjuncts=-1
(osteoporosis OR `bone loss')
(drugs OR pharmaceuticals)
(prevention OR cure)
This query implies that the user would like to view documents that contain all three topics.
Boolean query specification!quorum-level ranking quorum-level ranking
A technique to impose an ordering on the results of Boolean queries is what is known as post-coordinate or quorum-level ranking [#!salton89!#, Ch. 8]. In this approach, documents are ranked according to the size of the subset of the query terms they contain. So given a query consisting of `cats,' `dogs,' `fish,' and `mice,' the system would rank a document with at least one instance of `cats,' `dogs,' and `fish' higher than a document containing 30 occurrences of `cats' but no occurrences of the other terms.
Combining faceted queries with quorum ranking yields a situation intermediate between full Boolean syntax and free-form natural language queries. An interface for specifying this kind of interaction can consist of a list of entry lines. The user enters one topic per entry line, where each topic consists of a list of semantically related terms that are combined in a disjunct. Documents that contain at least one term from each facet are ranked higher than documents containing terms only from one or a few facets. This helps ensure that documents which contain discussions of several of the user's topics are ranked higher than those that contain only one topic. By only requiring that one term from each facet be matched, the user can specify the same concept in several different ways in the hopes of increasing the likelihood of a match. If combined with graphical feedback about which subsets of terms matched the document, the user can see the results of a quorum ranking by topic rather than by word. Section describes the TileBars interface which provides this type of feedback.
This idea can be extended yet another step by allowing users to weight
each facet. More likely to be readily usable, however, is a default
weighting in which the facet listed highest is assigned the most
weight, the second facet is assigned less weight, and so on, according to
some distribution over weights.