eprintid: 10057861
rev_number: 25
eprint_status: archive
userid: 608
dir: disk0/10/05/78/61
datestamp: 2018-10-09 09:46:03
lastmod: 2021-09-17 23:01:12
status_changed: 2019-02-04 11:34:35
type: article
metadata_visibility: show
creators_name: Lipani, A
creators_name: Roelleke, T
creators_name: Lupu, M
creators_name: Hanbury, A
title: A systematic approach to normalization in probabilistic models
ispublished: pub
divisions: UCL
divisions: B04
divisions: C05
divisions: F44
keywords: Verboseness hypothesis, TF normalization, Smoothing
note: This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
abstract: Every information retrieval (IR) model embeds in its scoring function a form of term frequency (TF) quantification. The contribution of the term frequency is determined by the properties of the function of the chosen TF quantification, and by its TF normalization. The first defines how independent the occurrences of multiple terms are, while the second acts on mitigating the a priori probability of having a high term frequency in a document (estimation usually based on the document length). New test collections, coming from different domains (e.g. medical, legal), give evidence that not only document length, but in addition, verboseness of documents should be explicitly considered. Therefore we propose and investigate a systematic combination of document verboseness and length. To theoretically justify the combination, we show the duality between document verboseness and length. In addition, we investigate the duality between verboseness and other components of IR models. We test these new TF normalizations on four suitable test collections. We do this on a well defined spectrum of TF quantifications. Finally, based on the theoretical and experimental observations, we show how the two components of this new normalization, document verboseness and length, interact with each other. Our experiments demonstrate that the new models never underperform existing models, while sometimes introducing statistically significantly better results, at no additional computational cost.
date: 2018-12
date_type: published
official_url: https://doi.org/10.1007/s10791-018-9334-1
oa_status: green
full_text_type: pub
language: eng
primo: open
primo_central: open_green
verified: verified_manual
elements_id: 1588980
doi: 10.1007/s10791-018-9334-1
lyricists_name: Lipani, Aldo
lyricists_id: ALIPA33
actors_name: Stacey, Thomas
actors_id: TSSTA20
actors_role: owner
full_text_status: public
publication: Information Retrieval Journal
volume: 21
number: 6
pagerange: 565-566
issn: 1573-7659
citation: Lipani, A; Roelleke, T; Lupu, M; Hanbury, A; (2018) A systematic approach to normalization in probabilistic models. Information Retrieval Journal , 21 (6) pp. 565-566. 10.1007/s10791-018-9334-1 <https://doi.org/10.1007/s10791-018-9334-1>. Green open access

document_url: https://discovery.ucl.ac.uk/id/eprint/10057861/1/Lipani2018_Article_ASystematicApproachToNormaliza.pdf