A Concept Language Model for Ad-hoc Retrieval

We propose an extension to language models for information retrieval. Typically, language models estimate the probability of a document generating the query, where the query is considered as a set of independent search terms. We extend this approach by considering the concepts implied by both the query and words in the document. The model combines the probability of the document generating the concept embodied by the query, and the traditional language model probability of the document generating the query terms. We use a word embedding space to express concepts. The similarity between two vectors in this space is estimated using a weighted cosine distance. The weighting significantly enhances the discrimination between vectors. We evaluate our model on benchmark datasets (TREC 6--8) and empirically demonstrate it outperforms state-of-the-art baselines.


INTRODUCTION
A core task in information retrieval is to judge whether documents are relevant to a given query. The traditional Query Likelihood (QL) language model makes the assumption that both query and documents are bag-of-words and retrieves documents according to the likelihood of observing a query given the document's language model [5]. In the case where a query term is not present in a document, smoothing strategies are applied based on the statistical distribution in the overall collection. Improvements to these smoothing strategies incorporate topic modelling techniques. For early topic models, document specific language models were produced by projecting both queries and documents to the same latent semantic space such that different words that are semantically close can be easily identified [2,8]. However, these approaches rely on a predefined number of topics.
More recently, distributed representations of words, or word embeddings, have been used to capture various latent language characteristics, such as syntax, topics, semantics and spelling [3]. Language models have incorporated word embeddings in order to improve retrieval precision [1,4,6,9] In this paper, we propose a Concept Language Model (CLM) that uses word embeddings. The model considers both (i) the probability of the concept embodied by the query, cq given the concept(s) embodied in the document, c d , together with (ii) the traditional language model probability of the document d generating the query q. This latter probability also incorporates a word embedding that serves the function of term expansion.

CONCEPT LANGUAGE MODEL
At the highest level, CLM is formulated aŝ where the probability of a document d generating a query q is a weighted combination of (i) the probability that the concept, cq embodied by q, is generated by the concept(s), c d , of the document, and (ii) the probability,p(t|d), of the document generating the individual terms, t. The parameter β controls the relative weight of the two components. We now describe each component in detail. Note that theˆsymbol on p is used to stress that the model is estimated. A query q consists of a number of terms, t1 . . . tn. These terms have associated concepts ct 1 . . . ct n . Assuming independence of terms and concepts, the traditional QL model considers the probability of a document generating each term. If the term is absent from the document, smoothing based on the collection statistics is used. More recently, by incorporating the concept implied by the query term and inferred from the embedding space, a form of term (query) expansion can also be achieved [1,4,6,9]. We assume that the concept(s) embodied within a document is the sum of the concepts expressed by the individual words in the document. Thus, the probability of d generating term t is based not only on the empirical frequency of t in d, but also on the probability, p(ct|c d ), that the corresponding concept, ct, is generated by the document concept c d . Under the word independence assumption, this latter probability is approximated by the normalized sum of the individual probabilities,p(ct|cw), of each concept, cw, implied by each word in d, generating concept ct. Thus, using Dirichlet smoothing, we havê where tf d and tfD are the term frequencies in the document and collection respectively. The product of the individual probabilities of a document generating each term provides a final probability of a document generating the query. We supplement this with the first term in Eq. 1. We assume that the concept embodied by the query is the product of the concepts embodied by the individual terms, i.e. cq is equivalent to ct 1 × . . . × ct n . This is simply a generalization of term independence to concept independence. Similarly, we assume that the concept(s) embodied by a document is the sum of the concepts embodied by the individual words To estimate the probabilityp(ct|cw), we built an embedding space as discussed shortly. The probability,p(ct|cw), is then given byp where the denominator serves to normalise the probability value between 0 and 1, and Nt is a neighbourhood of the closest words to ct. The similarity function sim(ct, cw) = cos(ct, cw) θ r(c t ,cw ) , is the cosine similarity between two vectors ct and cw in the embedding space it is transformed to the interval [0, 1] via (x + 1)/2 to avoid negative sub-scores. The denominator is a decay function, the purpose of which is described next. The cosine function is often used to capture the semantic relationship between embedding vectors, but it cannot be directly used in ad-hoc retrieval. This is because, as shown in Figure 1, there are no substantial differences between the cosine similarity of the most similar term and the 50th term. To enhance discrimination, we define a monotonic decay function. Given t and the collection vocabulary V , we rank all words in V based on their cosine similarities to t. We denote the rank of cw with respect to ct as r(ct, cw). The term θ > 1 is a constant decay factor. In practice, we only consider the 50 nearest neighbours of query term t in V and denote them as Nt. Since we have a fixed vocabulary, most computations such as cosine similarities can be pre-computed. According to the red line in Figure 1, the similarities between t and words in V can be easily discriminated. Note that we also tried a sigmoid function, but it provided worse performance.

EXPERIMENTS
To evaluate CLM, we work with the TREC collection (Disks 4-5), used in TREC 6, 7 and 8 for the ad-hoc retrieval tracks, which contains 150 queries. To build the index of the collection, we apply tokenization, stemming, and stop-word removal. We evaluate performance using MAP and precision at 10 and 20. Statistical significance of observed differences between two comparisons is assessed using a two-tailed paired t-test and the significance level is set to p < 0.05.
CLM is compared to 5 baselines: the traditional QL model [5], LLM [8], BM25 [7], GLM [1] and EQE [9]. Word embeddings are trained using word2vec [3] on TREC collections. The skip-gram model with negative sampling is employed. We set the window size to 10, and the dimension of embedding vectors to 300. The Dirichlet smoothing parameter µ is set to 1,500 and the decay factor θ is set to 3. The smoothing parameter β is set to 0.7. Note that optimal parameters are chosen via 2-fold cross validation. The performance of CLM and the baselines is shown in Table 1. In terms of MAP, CLM statistically significantly outperforms all the baselines on all the datasets. In addition, as can be seen in the table, CLM which integrates a decay function outperforms the models, i.e. GLM and EQE, that do not use one. This demonstrates the effectiveness of the proposed decay function in word embedding language models. The CLM achieves an improvement of 7.60% over GLM and 5.98% over EQE, in terms of MAP.

DISCUSSION AND CONCLUSION
In this paper, we proposed a concept language model using word embeddings. The model estimates the probability of a document generating the individual terms in the query, and the probability of the document's concepts generating the query concept. These two probabilities are weighted and summed. The CLM requires estimating the probability that a concept cw implied by word w in the document generates a concept ct implied by a term t in the query. This probability is estimated based on the cosine distance between the two vectors cw and ct in the embedding space normalized by a decay function that aimes to make the probabilities more discriminatory. The CLM was evaluated on the ad-hoc tasks of TREC 6, 7 and 8, and was shown to significantly outperform state-of-the-art baselines, according to MAP. Future work will focus on learning task-specific embedding vectors. We will also consider combining our model with pseudo-relevance feedback.