There is an ever-increasing amount of data that is being produced from various data
sources – this data must then be organised effectively if we hope to search though it.
Traditional information retrieval approaches search through all available data in a particular
collection in order to find the most suitable results, however, for particularly large
collections this may be extremely time consuming.
Our purposed solution to this problem is to only search a limited amount of the
collection at query-time, in order to speed this retrieval process up. Although, in doing
this we aim to limit the loss in retrieval efficacy (in terms of accuracy of results). The
way we aim to do this is to firstly identify the most “important” documents within the
collection, and then sort the documents within the collection in order of their ”importance”
in the collection. In this way we can choose to limit the amount of information to search
through by eliminating the documents of lesser importance, which should not only make
the search more efficient, but should also limit any loss in retrieval accuracy.
In this thesis we investigate various different query-independent methods that may
indicate the importance of a document in a collection. The more accurate the measure is
at determining an important document, the more effectively we can eliminate documents
from the retrieval process – improving the query-throughput of the system, as well as
providing a high level of accuracy in the returned results. The effectiveness of these
approaches are evaluated using the datasets provided by the terabyte track at the Text
REtreival Conference (TREC).