    Non-Compositional Term Dependence for Information Retrieval

    Modelling term dependence in IR aims to identify co-occurring terms that are too heavily dependent on each other to be treated as a bag of words, and to adapt the indexing and ranking accordingly. Dependent terms are predominantly identified using lexical frequency statistics, assuming that (a) if terms co-occur often enough in some corpus, they are semantically dependent; (b) the more often they co-occur, the more semantically dependent they are. This assumption is not always correct: the frequency of co-occurring terms can be separate from the strength of their semantic dependence. E.g. "red tape" might be overall less frequent than "tape measure" in some corpus, but this does not mean that "red"+"tape" are less dependent than "tape"+"measure". This is especially the case for non-compositional phrases, i.e. phrases whose meaning cannot be composed from the individual meanings of their terms (such as the phrase "red tape" meaning bureaucracy). Motivated by this lack of distinction between the frequency and strength of term dependence in IR, we present a principled approach for handling term dependence in queries, using both lexical frequency and semantic evidence. We focus on non-compositional phrases, extending a recent unsupervised model for their detection [21] to IR. Our approach, integrated into ranking using Markov Random Fields [31], yields effectiveness gains over competitive TREC baselines, showing that there is still room for improvement in the very well-studied area of term dependence in IR

    A Comparison of Retrieval Models using Term Dependencies

    Incorporating Query Term Dependencies in Language Models for Document Retrieval

    Introduction Recent advances in Information Retrieval are based on using Statistical Language Models (SLM) for representing documents and evaluating their relevance to user queries [6, 3, 4]. Language Modeling (LM) has been explored in many natural language tasks including machine translation and speech recognition [1]. In LM approach to document retrieval, each document, D, is viewed to have its own language model, MD . Given a query, Q, documents are ranked based on the probability, P (Q|MD ), of their language model generating the query. While the LM approach to information retrieval has been motivated from di#erent perspectives [3, 4], most experiments have used smoothed unigram language models that assume term independence for estimating document language models. N-gram, specifically, bigram language models that capture context provided by the previous word(s) perform better than unigram models [7]. Biterm language models [8] that ignore the word order constraint in bigram lang