Lossy index compression.
Doctoral thesis, UCL (University College London).
This thesis primarily investigates lossy compression of an inverted index. Two approaches of lossy compression are studied in detail, i.e. (i) term frequency quantization, and (ii) document pruning. In addition, a technique for document pruning, i.e. the entropy-based method, is applied to re-rank retrieved documents as query-independent knowledge. Based on the quantization theory, we examine how the number of quantization levels for coding the term frequencies affects retrieval performance. Three methods are then proposed for the purpose of reducing the quantization distortion, including (i) a non-uniform quantizer; (ii) an iterative technique; and (iii) term-specific quantizers. Experiments based on standard TREC test sets demonstrate that nearly no degradation of retrieval performance can be achieved by allocating only 2 or 3 bits for the quantized term frequency values. This is comparable to lossless coding techniques such as unary, γ and δ-codes. Furthermore, if lossless coding is applied to the quantized term frequency values, then around 26% (or 12%) savings can be achieved over lossless coding alone, with less than 2.5% (or no measurable) degradation in retrieval performance. Prior work on index pruning considered posting pruning and term pruning. In this thesis, an alternative pruning approach, i.e. document pruning, is investigated, in which unimportant documents are removed from the document collection. Four algorithms for scoring document importance are described, two of which are dependent on the score function of the retrieval system, while the other two are independent of the retrieval system. Experimental results suggest that document pruning is comparable to existing pruning approaches, such as posting pruning. Note that document pruning affects the global statistics of the indexed collection. We therefore examine whether retrieval performance is superior based on statistics derived from the full or the pruned collection. Our results indicate that keeping statistics derived from the full collection performs slightly better. Document pruning scores documents and then discards those that fall outside a threshold. An alternative is to re-rank documents based on these scores. The entropy-based score, which is independent of the retrieval system, provides a query-independent knowledge of document specificity, analogous to PageRank. We investigate the utility of document specificity in the context of Intranet search, where hypertext information is sparse or absent. Our results are comparable to the previous algorithm that induced a graph link structure based on the measure of similarity between documents. However, a further analysis indicates that our method is superior on computational complexity.
|Title:||Lossy index compression|
|Additional information:||Permission for digitisation not received|
|UCL classification:||UCL > School of BEAMS > Faculty of Engineering Science > Computer Science|
Archive Staff Only