Document-Oriented Pruning of the Inverted Index in Information Retrieval Systems

Zheng, L; Cox, IJ; (2009) Document-Oriented Pruning of the Inverted Index in Information Retrieval Systems. In: 2009 INTERNATIONAL CONFERENCE ON ADVANCED INFORMATION NETWORKING AND APPLICATIONS WORKSHOPS: WAINA, VOLS 1 AND 2. (pp. 697 - 702). IEEE

Searching very large collections can be costly in both computation and storage. To reduce this cost, recent research has,focused on reducing the size (pruning) of the inverted index. The inverted index represents a table, the rows and columns of which are terms in the lexicon and documents in the collection, respectively A non-zero entry in the table, known as a posting, indicates that the corresponding document contains the term. Previous researches on static index pruning was either (i) posting-oriented, in which less important postings are removed from the table, or (ii) term-oriented, in which less important terms are removed from the table. In this paper we investigate a new, document-oriented pruning strategy that removes entire columns of the table, i.e. removes less important documents from the collection. Three methods for estimating the importance of a document are proposed. Methods I and 2 are dependent on the score function of the retrieval system (e.g. Okapi BM25), while Method 3 is independent of the retrieval System. Experimental results compare the three proposed methods with Carmel et al.'s posting-oriented approach, using both the FT and LA Times collections and using both ordinary and difficult queries. Based on mean average precision and precision at 10, experimental results show that Method 3 generally performs best on the FT collection for pruned indexes down to 35% of the original size. However for more severe pruning, Carmel et al.'s algorithm is better For the LA Times collection, the performance of Method 3 and that of Carmel et al. are reversed. This variation in performance across collections has not been previously reported.

Type:Proceedings paper
Title:Document-Oriented Pruning of the Inverted Index in Information Retrieval Systems
Event:23rd International Conference on Advanced Information Networking and Applications Workshops
Location:Bradford, ENGLAND
Dates:2009-05-26 - 2009-05-29
