21,242 research outputs found
A comparative study of probabilistic and language models for information retrieval
Language models for information retrieval have received much attention in recent years, with many claims being made about their performance. However, previous studies evaluating the language modelling approach for information retrieval used different query sets and heterogeneous collections, which make reported results difficult to compare. This research is a broad-based study that evaluates language models against a variety of search tasks --- topic finding, named-page finding and topic distillation. The standard Text REtrieval Conference (TREC) methodology is used to compare language models to the probabilistic Okapi BM25 system. Using consistent parameter choices, we compare results of different language models on three different search tasks, multiple query sets and three different text collections. For ad hoc retrieval, the Dirichlet smoothing method was found to be significantly better than Okapi BM25, but for named-page finding Okapi BM25 was more effective than the language modelling methods. Optimal smoothing parameters for each method were found to be dependent on the collection and the query set. For longer queries, the language modelling approaches required more aggressive smoothing but they were found to be more effective than with shorter queries. The choice of smoothing method was also found to have a significant effect on the performance of language models for information retrieval
Language Models and Smoothing Methods for Information Retrieval
Language Models and Smoothing Methods for Information Retrieval
(Sprachmodelle und GlĂ€ttungsmethoden fĂŒr Information Retrieval)
Najeeb A. Abdulmutalib
Kurzfassung der Dissertation
Retrievalmodelle bilden die theoretische Grundlage fĂŒr effektive Information-Retrieval-Methoden. Statistische Sprachmodelle stellen eine neue Art von Retrievalmodellen dar, die seit etwa zehn Jahren in der Forschung betrachtet werde. Im Unterschied zu anderen Modellen können sie leichter an spezifische Aufgabenstellungen angepasst werden und liefern hĂ€ufig bessere Retrievalergebnisse.
In dieser Dissertation wird zunĂ€chst ein neues statistisches Sprachmodell vorgestellt, das explizit DokumentlĂ€ngen berĂŒcksichtigt. Aufgrund der spĂ€rlichen Beobachtungsdaten spielen GlĂ€ttungsmethoden bei Sprachmodellen eine wichtige Rolle. Auch hierfĂŒr stellen wir eine neue Methode namens 'exponentieller GlĂ€ttung' vor. Der experimentelle Vergleich mit konkurrierenden AnsĂ€tzen zeigt, dass unsere neuen Methoden insbesondere bei Kollektionen mit stark variierenden DokumentlĂ€ngen ĂŒberlegene Ergebnisse liefert.
In einem zweiten Schritt erweitern wir unseren Ansatz auf XML-Retrieval, wo hierarchisch strukturierte Dokumente betrachtet werden und beim fokussierten Retrieval möglichst kleine Dokumentteile gefunden werden sollen, die die Anfrage vollstÀndig beantworten. Auch hier demonstriert der experimentelle Vergleich mit anderen AnsÀtzen die QualitÀt unserer neu entwickelten Methoden.
Der dritte Teil der Arbeit beschĂ€ftigt sich mit dem Vergleich von Sprachmodellen und der klassischen tf*idf-Gewichtung. Neben einem besseren VerstĂ€ndnis fĂŒr die existierenden GlĂ€ttungsmethoden fĂŒhrt uns dieser Ansatz zur Entwicklung des Verfahrens der 'empirischen GlĂ€ttung'. Die damit durchgefĂŒhrten Retrievalerexperimente zeigen Verbesserungen gegenĂŒber anderen GlĂ€ttungsverfahren.Language Models and Smoothing Methods for Information Retrieval
Najeeb A. Abdulmutalib
Abstract of the Dissertation
Designing an effective retrieval model that can rank documents accurately for a given query has been a central problem in information retrieval for several decades. An optimal retrieval model that is both effective and efficient and that can learn from feedback information over time is needed. Language models are new generation of retrieval models and have been applied since the last ten years to solve many different information retrieval problems. Compared with the traditional models such as the vector space model, they can be more easily adapted to model non traditional and complex retrieval problems and empirically they tend to achieve comparable or better performance than the traditional models. Developing new language models is currently an active research area in information retrieval.
In the first stage of this thesis we present a new language model based on an odds formula, which explicitly incorporates document length as a parameter.
To address the problem of data sparsity where there is rarely enough data to accurately estimate the parameters of a language model, smoothing gives a way to combine less specific, more accurate information with more specific, but noisier data. We introduce a new smoothing method called exponential smoothing, which can be combined with most language models. We present experimental results for various language models and smoothing methods on a collection with large document length variation, and show that our new methods compare favourably with the best approaches known so far.
We discuss the collection effect on the retrieval function, where we investigate the performance of well known models and compare the results conducted using two variant collections.
In the second stage we extend the current model from flat text retrieval to XML retrieval since there is a need for content-oriented XML retrieval systems that can efficiently and effectively store, search and retrieve information from XML document collections. Compared to traditional information retrieval, where whole documents are usually indexed and retrieved as single complete units, information retrieval from XML documents creates additional retrieval challenges. By exploiting the logical document structure, XML allows for more focussed retrieval that identifies elements rather than documents as answers to user queries.
Finally we show how smoothing plays a role very similar to that of the idf function: beside the obvious role of smoothing, it also improves the accuracy of the estimated language model. The within document frequency and the collection frequency of a term actually influence the probability of relevance, which led us to a new class of smoothing function based on numeric prediction, which we call empirical smoothing. Its retrieval quality outperforms that of other smoothing methods
Discrete language models for video retrieval
Finding relevant video content is important for producers of television news, documentanes and commercials. As digital video collections become more widely available, content-based video retrieval tools will likely grow in importance for an even wider group of users. In this thesis we investigate language modelling approaches, that have been the focus of recent attention within the text information retrieval community, for the video search task. Language models are smoothed discrete generative probability distributions generally of text and provide a neat information retrieval formalism that we believe is equally applicable to traditional visual features as to text. We propose to model colour, edge and texture histogrambased features directly with discrete language models and this approach is compatible with further traditional visual feature representations. We provide a comprehensive and robust empirical study of smoothing methods, hierarchical semantic and physical structures, and fusion methods for this language modelling approach to video retrieval. The advantage of our approach is that it provides a consistent, effective and relatively efficient model for video retrieval
ISCAS in English-Chinese CLIR at NTCIR-5
Abstract We participated in the Chinese single language information retrieval(SLIR) C-C task and EnglishChinese cross-language information retrieval(CLIR) E-C tasks in NTCIR5. Our project concentrates on the two aspects of the CLIR research: 1) We test various IR models especially language models for Chinese SLIR using the training corpus provided by the NTCIR organizer, and different smoothing methods have been studied for Chinese SLIR; 2) Our C-E CLIR task is based on the dictionary-based translation approach, and a new context-based translation algorithm using web corpus is proposed to solve the outof-vocabulary(OOV) problem in CLIR
Dating Texts without Explicit Temporal Cues
This paper tackles temporal resolution of documents, such as determining when
a document is about or when it was written, based only on its text. We apply
techniques from information retrieval that predict dates via language models
over a discretized timeline. Unlike most previous works, we rely {\it solely}
on temporal cues implicit in the text. We consider both document-likelihood and
divergence based techniques and several smoothing methods for both of them. Our
best model predicts the mid-point of individuals' lives with a median of 22 and
mean error of 36 years for Wikipedia biographies from 3800 B.C. to the present
day. We also show that this approach works well when training on such
biographies and predicting dates both for non-biographical Wikipedia pages
about specific years (500 B.C. to 2010 A.D.) and for publication dates of short
stories (1798 to 2008). Together, our work shows that, even in absence of
temporal extraction resources, it is possible to achieve remarkable temporal
locality across a diverse set of texts
Language Models
Contains fulltext :
227630.pdf (preprint version ) (Open Access
Topic based language models for ad hoc information retrieval
We propose a topic based approach lo language
modelling for ad-hoc Information Retrieval (IR). Many smoothed estimators used for the multinomial query model in IR rely upon the estimated background collection probabilities. In this paper, we propose a topic based language modelling approach, that uses a more informative prior based on the topical content of a document. In our experiments, the proposed model provides comparable IR performance to the standard models, but when combined in a two stage language model, it outperforms all other estimated models
Improving Term Frequency Normalization for Multi-topical Documents, and Application to Language Modeling Approaches
Term frequency normalization is a serious issue since lengths of documents
are various. Generally, documents become long due to two different reasons -
verbosity and multi-topicality. First, verbosity means that the same topic is
repeatedly mentioned by terms related to the topic, so that term frequency is
more increased than the well-summarized one. Second, multi-topicality indicates
that a document has a broad discussion of multi-topics, rather than single
topic. Although these document characteristics should be differently handled,
all previous methods of term frequency normalization have ignored these
differences and have used a simplified length-driven approach which decreases
the term frequency by only the length of a document, causing an unreasonable
penalization. To attack this problem, we propose a novel TF normalization
method which is a type of partially-axiomatic approach. We first formulate two
formal constraints that the retrieval model should satisfy for documents having
verbose and multi-topicality characteristic, respectively. Then, we modify
language modeling approaches to better satisfy these two constraints, and
derive novel smoothing methods. Experimental results show that the proposed
method increases significantly the precision for keyword queries, and
substantially improves MAP (Mean Average Precision) for verbose queries.Comment: 8 pages, conference paper, published in ECIR '0
- âŠ