Search CORE

123,850 research outputs found

Variations on language modeling for information retrieval

Author: Kraaij Wessel
Publication venue: University of Twente
Publication date: 01/01/2004
Field of study

Search engine technology builds on theoretical and empirical research results in the area of information retrieval (IR). This dissertation makes a contribution to the field of language modeling (LM) for IR, which views both queries and documents as instances of a unigram language model and defines the matching function between a query and\ud each document as the probability that the query terms are generated by the document language model

University of Twente Research Information

Language Modeling Approaches to Information Retrieval

Author: Banerjee Protima
Han Hyoil
Publication venue: Marshall Digital Scholar
Publication date: 08/04/2009
Field of study

This article surveys recent research in the area of language modeling (sometimes called statistical language modeling) approaches to information retrieval. Language modeling is a formal probabilistic retrieval framework with roots in speech recognition and natural language processing. The underlying assumption of language modeling is that human language generation is a random process; the goal is to model that process via a generative statistical model. In this article, we discuss current research in the application of language modeling to information retrieval, the role of semantics in the language modeling framework, cluster-based language models, use of language modeling for XML retrieval and future trends

Marshall University

Towards an Information Retrieval Theory of Everything

Author: Hiemstra Djoerd
Publication venue: NVTI
Publication date: 01/01/2009
Field of study

I present three well-known probabilistic models of information retrieval in tutorial style: The binary independence probabilistic model, the language modeling approach, and Google's page rank. Although all three models are based on probability theory, they are very different in nature. Each model seems well-suited for solving certain information retrieval problems, but not so useful for solving others. So, essentially each model solves part of a bigger puzzle, and a unified view on these models might be a first step towards an Information Retrieval Theory of Everything

CiteSeerX

Radboud Repository

University of Twente Research Information

A probabilistic justification for using tf.idf term weighting in information retrieval

Author: Hiemstra D.
Publication venue: Springer Verlag
Publication date: 01/01/2000
Field of study

This paper presents a new probabilistic model of information retrieval. The most important modeling assumption made is that documents and queries are defined by an ordered sequence of single terms. This assumption is not made in well known existing models of information retrieval, but is essential in the field of statistical natural language processing. Advances already made in statistical natural language processing will be used in this paper to formulate a probabilistic justification for using tf.idf term weighting. The paper shows that the new probabilistic interpretation of tf.idf term weighting might lead to better understanding of statistical ranking mechanisms, for example by explaining how they relate to coordination level ranking. A pilot experiment on the TREC collection shows that the linguistically motivated weighting algorithm outperforms the popular BM25 weighting algorithm

Radboud Repository

University of Twente Research Information

Investigation of the Lambda Parameter for Language Modeling Based Persian Retrieval

Author: Abedinzadeh Sadra
Amiri Hadi
Oroumchian Farhad
Rahgozar Masoud
Tavallaee Mahbod
Zarnani Ashkan
Publication venue: 'Sociological Research Online'
Publication date: 01/01/2008
Field of study

Language modeling is one of the most powerful methods in information retrieval. Many language modeling based retrieval systems have been developed and tested on English collections. Hence, the evaluation of language modeling on collections of other languages is an interesting research issue. In this study, four different language modeling methods proposed by Hiemstra [1] have been evaluated on a large Persian collection of a news archive. Furthermore, we study two different approaches that are proposed for tuning the Lambda parameter in the method. Experimental results show that the performance of language models on Persian text improves after Lambda Tuning. More specifically Witten Bell method provides the best results

Research Online

Knowledge-based Query Expansion in Real-Time Microblog Search

Author: Fan Feifan
Lv Chao
Qiang Runwei
Yang Jianwu
Publication venue
Publication date: 13/03/2015
Field of study

Since the length of microblog texts, such as tweets, is strictly limited to 140 characters, traditional Information Retrieval techniques suffer from the vocabulary mismatch problem severely and cannot yield good performance in the context of microblogosphere. To address this critical challenge, in this paper, we propose a new language modeling approach for microblog retrieval by inferring various types of context information. In particular, we expand the query using knowledge terms derived from Freebase so that the expanded one can better reflect users' search intent. Besides, in order to further satisfy users' real-time information need, we incorporate temporal evidences into the expansion method, which can boost recent tweets in the retrieval results with respect to a given topic. Experimental results on two official TREC Twitter corpora demonstrate the significant superiority of our approach over baseline methods.Comment: 9 pages, 9 figure

arXiv.org e-Print Archive

Crossref

A database approach to information retrieval:The remarkable relationship between language models and region models

Author: Hiemstra Djoerd
Mihajlovic V.
Publication venue: Centre for Telematics and Information Technology (CTIT)
Publication date: 01/01/2005
Field of study

In this report, we unify two quite distinct approaches to information retrieval: region models and language models. Region models were developed for structured document retrieval. They provide a well-defined behaviour as well as a simple query language that allows application developers to rapidly develop applications. Language models are particularly useful to reason about the ranking of search results, and for developing new ranking approaches. The unified model allows application developers to define complex language modeling approaches as logical queries on a textual database. We show a remarkable one-to-one relationship between region queries and the language models they represent for a wide variety of applications: simple ad-hoc search, cross-language retrieval, video retrieval, and web search

University of Twente Research Information

Retrieval Oriented Masking Pre-training Language Model for Dense Passage Retrieval

Author: Long Dingkun
Xie Pengjun
Xu Guangwei
Zhang Yanzhao
Publication venue
Publication date: 26/10/2022
Field of study

Pre-trained language model (PTM) has been shown to yield powerful text representations for dense passage retrieval task. The Masked Language Modeling (MLM) is a major sub-task of the pre-training process. However, we found that the conventional random masking strategy tend to select a large number of tokens that have limited effect on the passage retrieval task (e,g. stop-words and punctuation). By noticing the term importance weight can provide valuable information for passage retrieval, we hereby propose alternative retrieval oriented masking (dubbed as ROM) strategy where more important tokens will have a higher probability of being masked out, to capture this straightforward yet essential information to facilitate the language model pre-training process. Notably, the proposed new token masking method will not change the architecture and learning objective of original PTM. Our experiments verify that the proposed ROM enables term importance information to help language model pre-training thus achieving better performance on multiple passage retrieval benchmarks.Comment: Search LM part of the "AliceMind SLM + HLAR" method in MS MARCO Passage Ranking Leaderboard Submissio

arXiv.org e-Print Archive