21,242 research outputs found

    A comparative study of probabilistic and language models for information retrieval

    Get PDF
    Language models for information retrieval have received much attention in recent years, with many claims being made about their performance. However, previous studies evaluating the language modelling approach for information retrieval used different query sets and heterogeneous collections, which make reported results difficult to compare. This research is a broad-based study that evaluates language models against a variety of search tasks --- topic finding, named-page finding and topic distillation. The standard Text REtrieval Conference (TREC) methodology is used to compare language models to the probabilistic Okapi BM25 system. Using consistent parameter choices, we compare results of different language models on three different search tasks, multiple query sets and three different text collections. For ad hoc retrieval, the Dirichlet smoothing method was found to be significantly better than Okapi BM25, but for named-page finding Okapi BM25 was more effective than the language modelling methods. Optimal smoothing parameters for each method were found to be dependent on the collection and the query set. For longer queries, the language modelling approaches required more aggressive smoothing but they were found to be more effective than with shorter queries. The choice of smoothing method was also found to have a significant effect on the performance of language models for information retrieval

    Language Models and Smoothing Methods for Information Retrieval

    Get PDF
    Language Models and Smoothing Methods for Information Retrieval (Sprachmodelle und GlĂ€ttungsmethoden fĂŒr Information Retrieval) Najeeb A. Abdulmutalib Kurzfassung der Dissertation Retrievalmodelle bilden die theoretische Grundlage fĂŒr effektive Information-Retrieval-Methoden. Statistische Sprachmodelle stellen eine neue Art von Retrievalmodellen dar, die seit etwa zehn Jahren in der Forschung betrachtet werde. Im Unterschied zu anderen Modellen können sie leichter an spezifische Aufgabenstellungen angepasst werden und liefern hĂ€ufig bessere Retrievalergebnisse. In dieser Dissertation wird zunĂ€chst ein neues statistisches Sprachmodell vorgestellt, das explizit DokumentlĂ€ngen berĂŒcksichtigt. Aufgrund der spĂ€rlichen Beobachtungsdaten spielen GlĂ€ttungsmethoden bei Sprachmodellen eine wichtige Rolle. Auch hierfĂŒr stellen wir eine neue Methode namens 'exponentieller GlĂ€ttung' vor. Der experimentelle Vergleich mit konkurrierenden AnsĂ€tzen zeigt, dass unsere neuen Methoden insbesondere bei Kollektionen mit stark variierenden DokumentlĂ€ngen ĂŒberlegene Ergebnisse liefert. In einem zweiten Schritt erweitern wir unseren Ansatz auf XML-Retrieval, wo hierarchisch strukturierte Dokumente betrachtet werden und beim fokussierten Retrieval möglichst kleine Dokumentteile gefunden werden sollen, die die Anfrage vollstĂ€ndig beantworten. Auch hier demonstriert der experimentelle Vergleich mit anderen AnsĂ€tzen die QualitĂ€t unserer neu entwickelten Methoden. Der dritte Teil der Arbeit beschĂ€ftigt sich mit dem Vergleich von Sprachmodellen und der klassischen tf*idf-Gewichtung. Neben einem besseren VerstĂ€ndnis fĂŒr die existierenden GlĂ€ttungsmethoden fĂŒhrt uns dieser Ansatz zur Entwicklung des Verfahrens der 'empirischen GlĂ€ttung'. Die damit durchgefĂŒhrten Retrievalerexperimente zeigen Verbesserungen gegenĂŒber anderen GlĂ€ttungsverfahren.Language Models and Smoothing Methods for Information Retrieval Najeeb A. Abdulmutalib Abstract of the Dissertation Designing an effective retrieval model that can rank documents accurately for a given query has been a central problem in information retrieval for several decades. An optimal retrieval model that is both effective and efficient and that can learn from feedback information over time is needed. Language models are new generation of retrieval models and have been applied since the last ten years to solve many different information retrieval problems. Compared with the traditional models such as the vector space model, they can be more easily adapted to model non traditional and complex retrieval problems and empirically they tend to achieve comparable or better performance than the traditional models. Developing new language models is currently an active research area in information retrieval. In the first stage of this thesis we present a new language model based on an odds formula, which explicitly incorporates document length as a parameter. To address the problem of data sparsity where there is rarely enough data to accurately estimate the parameters of a language model, smoothing gives a way to combine less specific, more accurate information with more specific, but noisier data. We introduce a new smoothing method called exponential smoothing, which can be combined with most language models. We present experimental results for various language models and smoothing methods on a collection with large document length variation, and show that our new methods compare favourably with the best approaches known so far. We discuss the collection effect on the retrieval function, where we investigate the performance of well known models and compare the results conducted using two variant collections. In the second stage we extend the current model from flat text retrieval to XML retrieval since there is a need for content-oriented XML retrieval systems that can efficiently and effectively store, search and retrieve information from XML document collections. Compared to traditional information retrieval, where whole documents are usually indexed and retrieved as single complete units, information retrieval from XML documents creates additional retrieval challenges. By exploiting the logical document structure, XML allows for more focussed retrieval that identifies elements rather than documents as answers to user queries. Finally we show how smoothing plays a role very similar to that of the idf function: beside the obvious role of smoothing, it also improves the accuracy of the estimated language model. The within document frequency and the collection frequency of a term actually influence the probability of relevance, which led us to a new class of smoothing function based on numeric prediction, which we call empirical smoothing. Its retrieval quality outperforms that of other smoothing methods

    Discrete language models for video retrieval

    Get PDF
    Finding relevant video content is important for producers of television news, documentanes and commercials. As digital video collections become more widely available, content-based video retrieval tools will likely grow in importance for an even wider group of users. In this thesis we investigate language modelling approaches, that have been the focus of recent attention within the text information retrieval community, for the video search task. Language models are smoothed discrete generative probability distributions generally of text and provide a neat information retrieval formalism that we believe is equally applicable to traditional visual features as to text. We propose to model colour, edge and texture histogrambased features directly with discrete language models and this approach is compatible with further traditional visual feature representations. We provide a comprehensive and robust empirical study of smoothing methods, hierarchical semantic and physical structures, and fusion methods for this language modelling approach to video retrieval. The advantage of our approach is that it provides a consistent, effective and relatively efficient model for video retrieval

    ISCAS in English-Chinese CLIR at NTCIR-5

    Get PDF
    Abstract We participated in the Chinese single language information retrieval(SLIR) C-C task and EnglishChinese cross-language information retrieval(CLIR) E-C tasks in NTCIR5. Our project concentrates on the two aspects of the CLIR research: 1) We test various IR models especially language models for Chinese SLIR using the training corpus provided by the NTCIR organizer, and different smoothing methods have been studied for Chinese SLIR; 2) Our C-E CLIR task is based on the dictionary-based translation approach, and a new context-based translation algorithm using web corpus is proposed to solve the outof-vocabulary(OOV) problem in CLIR

    Dating Texts without Explicit Temporal Cues

    Full text link
    This paper tackles temporal resolution of documents, such as determining when a document is about or when it was written, based only on its text. We apply techniques from information retrieval that predict dates via language models over a discretized timeline. Unlike most previous works, we rely {\it solely} on temporal cues implicit in the text. We consider both document-likelihood and divergence based techniques and several smoothing methods for both of them. Our best model predicts the mid-point of individuals' lives with a median of 22 and mean error of 36 years for Wikipedia biographies from 3800 B.C. to the present day. We also show that this approach works well when training on such biographies and predicting dates both for non-biographical Wikipedia pages about specific years (500 B.C. to 2010 A.D.) and for publication dates of short stories (1798 to 2008). Together, our work shows that, even in absence of temporal extraction resources, it is possible to achieve remarkable temporal locality across a diverse set of texts

    Language Models

    Get PDF
    Contains fulltext : 227630.pdf (preprint version ) (Open Access

    Topic based language models for ad hoc information retrieval

    Get PDF
    We propose a topic based approach lo language modelling for ad-hoc Information Retrieval (IR). Many smoothed estimators used for the multinomial query model in IR rely upon the estimated background collection probabilities. In this paper, we propose a topic based language modelling approach, that uses a more informative prior based on the topical content of a document. In our experiments, the proposed model provides comparable IR performance to the standard models, but when combined in a two stage language model, it outperforms all other estimated models

    Improving Term Frequency Normalization for Multi-topical Documents, and Application to Language Modeling Approaches

    Full text link
    Term frequency normalization is a serious issue since lengths of documents are various. Generally, documents become long due to two different reasons - verbosity and multi-topicality. First, verbosity means that the same topic is repeatedly mentioned by terms related to the topic, so that term frequency is more increased than the well-summarized one. Second, multi-topicality indicates that a document has a broad discussion of multi-topics, rather than single topic. Although these document characteristics should be differently handled, all previous methods of term frequency normalization have ignored these differences and have used a simplified length-driven approach which decreases the term frequency by only the length of a document, causing an unreasonable penalization. To attack this problem, we propose a novel TF normalization method which is a type of partially-axiomatic approach. We first formulate two formal constraints that the retrieval model should satisfy for documents having verbose and multi-topicality characteristic, respectively. Then, we modify language modeling approaches to better satisfy these two constraints, and derive novel smoothing methods. Experimental results show that the proposed method increases significantly the precision for keyword queries, and substantially improves MAP (Mean Average Precision) for verbose queries.Comment: 8 pages, conference paper, published in ECIR '0
    • 

    corecore