Modern Information Retrieval Chapter 1: Introduction |
Contents |
At this point, we are ready to detail our view of the retrieval process. Such a process is interpreted in terms of component subprocesses whose study yields many of the chapters in this book.
To describe the retrieval process, we use a simple and generic software architecture as shown in Figure . First of all, before the retrieval process can even be initiated, it is necessary to define the text database. This is usually done by the manager of the database, which specifies the following: (a) the documents to be used, (b) the operations to be performed on the text, and (c) the text model (i.e., the text structure and what elements can be retrieved). The text operations transform the original documents and generate a logical view of them.
Once the logical view of the documents is defined, the database
manager (using the DB Manager Module) builds an index of the text.
An index is a critical data structure because it allows fast searching over
large volumes of data. Different index structures might be used, but the most
popular one is the inverted file as indicated in
Figure .
The resources (time and storage space)
spent on defining the text database and building the index are amortized
by querying the retrieval system many times.
Given that the document database is indexed, the retrieval process can be initiated. The user first specifies a user need which is then parsed and transformed by the same text operations applied to the text. Then, query operations might be applied before the actual query, which provides a system representation for the user need, is generated. The query is then processed to obtain the retrieved documents. Fast query processing is made possible by the index structure previously built.
Before been sent to the user, the retrieved documents are ranked according to a likelihood of relevance. The user then examines the set of ranked documents in the search for useful information. At this point, he might pinpoint a subset of the documents seen as definitely of interest and initiate a user feedback cycle. In such a cycle, the system uses the documents selected by the user to change the query formulation. Hopefully, this modified query is a better representation of the real user need.
The small numbers outside thelower right corner of various boxes in Figure indicate the chapters in this book which discuss the respective subprocesses in detail. A brief introduction to each of these chapters can be found in section .
Consider now the user interfaces available with current information retrieval systems (including Web search engines and Web browsers). We first notice that the user almost never declares his information need. Instead, he is required to provide a direct representation for the query that the system will execute. Since most users have no knowledge of text and query operations, the query they provide is frequently inadequate. Therefore, it is not surprising to observe that poorly formulated queries lead to poor retrieval (as happens so often on the Web).