Search CORE

117,756 research outputs found

Knowledge-based Query Expansion in Real-Time Microblog Search

Author: Fan Feifan
Lv Chao
Qiang Runwei
Yang Jianwu
Publication venue
Publication date: 13/03/2015
Field of study

Since the length of microblog texts, such as tweets, is strictly limited to 140 characters, traditional Information Retrieval techniques suffer from the vocabulary mismatch problem severely and cannot yield good performance in the context of microblogosphere. To address this critical challenge, in this paper, we propose a new language modeling approach for microblog retrieval by inferring various types of context information. In particular, we expand the query using knowledge terms derived from Freebase so that the expanded one can better reflect users' search intent. Besides, in order to further satisfy users' real-time information need, we incorporate temporal evidences into the expansion method, which can boost recent tweets in the retrieval results with respect to a given topic. Experimental results on two official TREC Twitter corpora demonstrate the significant superiority of our approach over baseline methods.Comment: 9 pages, 9 figure

arXiv.org e-Print Archive

Crossref

Dense Text Retrieval based on Pretrained Language Models: A Survey

Author: Liu Jing
Ren Ruiyang
Wen Ji-Rong
Zhao Wayne Xin
Publication venue
Publication date: 27/11/2022
Field of study

Text retrieval is a long-standing research topic on information seeking, where a system is required to return relevant information resources to user's queries in natural language. From classic retrieval methods to learning-based ranking functions, the underlying retrieval models have been continually evolved with the ever-lasting technical innovation. To design effective retrieval models, a key point lies in how to learn the text representation and model the relevance matching. The recent success of pretrained language models (PLMs) sheds light on developing more capable text retrieval approaches by leveraging the excellent modeling capacity of PLMs. With powerful PLMs, we can effectively learn the representations of queries and texts in the latent representation space, and further construct the semantic matching function between the dense vectors for relevance modeling. Such a retrieval approach is referred to as dense retrieval, since it employs dense vectors (a.k.a., embeddings) to represent the texts. Considering the rapid progress on dense retrieval, in this survey, we systematically review the recent advances on PLM-based dense retrieval. Different from previous surveys on dense retrieval, we take a new perspective to organize the related work by four major aspects, including architecture, training, indexing and integration, and summarize the mainstream techniques for each aspect. We thoroughly survey the literature, and include 300+ related reference papers on dense retrieval. To support our survey, we create a website for providing useful resources, and release a code repertory and toolkit for implementing dense retrieval models. This survey aims to provide a comprehensive, practical reference focused on the major progress for dense text retrieval

arXiv.org e-Print Archive

Bias-variance analysis in estimating true query model for information retrieval

Author: Amati
Banks
Bishop
Collins-Thompson
Dawei Song
Duda
Geman
Hofmann
Jun Wang
Karimzadehgan
Lafferty
Li
Maron
Peng Zhang
Perlich
Robertson
Spärck Jones
Valentini
Yuexian Hou
Zhang
Publication venue: 'Elsevier BV'
Publication date: 01/01/2014
Field of study

The estimation of query model is an important task in language modeling (LM) approaches to information retrieval (IR). The ideal estimation is expected to be not only effective in terms of high mean retrieval performance over all queries, but also stable in terms of low variance of retrieval performance across different queries. In practice, however, improving effectiveness can sacrifice stability, and vice versa. In this paper, we propose to study this tradeoff from a new perspective, i.e., the bias-variance tradeoff, which is a fundamental theory in statistics. We formulate the notion of bias-variance regarding retrieval performance and estimation quality of query models. We then investigate several estimated query models, by analyzing when and why the bias-variance tradeoff will occur, and how the bias and variance can be reduced simultaneously. A series of experiments on four TREC collections have been conducted to systematically evaluate our bias-variance analysis. Our approach and results will potentially form an analysis framework and a novel evaluation strategy for query language modeling

CiteSeerX

Crossref

Open Research Online (The Open University)

The Quantum Challenge in Concept Theory and Natural Language Processing

Author: Aerts Diederik
Broekaert Jan
Sozzo Sandro
Veloz Tomas
Publication venue
Publication date: 01/01/2013
Field of study

The mathematical formalism of quantum theory has been successfully used in human cognition to model decision processes and to deliver representations of human knowledge. As such, quantum cognition inspired tools have improved technologies for Natural Language Processing and Information Retrieval. In this paper, we overview the quantum cognition approach developed in our Brussels team during the last two decades, specifically our identification of quantum structures in human concepts and language, and the modeling of data from psychological and corpus-text-based experiments. We discuss our quantum-theoretic framework for concepts and their conjunctions/disjunctions in a Fock-Hilbert space structure, adequately modeling a large amount of data collected on concept combinations. Inspired by this modeling, we put forward elements for a quantum contextual and meaning-based approach to information technologies in which 'entities of meaning' are inversely reconstructed from texts, which are considered as traces of these entities' states.Comment: 5 page

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Università degli Studi di Udine

Automatic query expansion: A structural linguistic perspective

Author: Bruza Peter
Koopman Bevan
Sitbon Laurianne
Symonds Michael
Turner Ian
Zuccon Guido
Publication venue: John Wiley & Sons
Publication date: 01/01/2014
Field of study

A user’s query is considered to be an imprecise description of their information need. Automatic query expansion is the process of reformulating the original query with the goal of improving retrieval effectiveness. Many successful query expansion techniques ignore information about the dependencies that exist between words in natural language. However, more recent approaches have demonstrated that by explicitly modeling associations between terms significant improvements in retrieval effectiveness can be achieved over those that ignore these dependencies. State-of-the-art dependency-based approaches have been shown to primarily model syntagmatic associations. Syntagmatic associations infer a likelihood that two terms co-occur more often than by chance. However, structural linguistics relies on both syntagmatic and paradigmatic associations to deduce the meaning of a word. Given the success of dependency-based approaches and the reliance on word meanings in the query formulation process, we argue that modeling both syntagmatic and paradigmatic information in the query expansion process will improve retrieval effectiveness. This article develops and evaluates a new query expansion technique that is based on a formal, corpus-based model of word meaning that models syntagmatic and paradigmatic associations. We demonstrate that when sufficient statistical information exists, as in the case of longer queries, including paradigmatic information alone provides significant improvements in retrieval effectiveness across a wide variety of data sets. More generally, when our new query expansion approach is applied to large-scale web retrieval it demonstrates significant improvements in retrieval effectiveness over a strong baseline system, based on a commercial search engine

Queensland University of Technology ePrints Archive

Probabilistic collaborative filtering with negative cross entropy

Author: Bellogín Alejandro
Castells Pablo
Parapar Javier
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2013
Field of study

This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in RecSys '13 Proceedings of the 7th ACM conference on Recommender systems, http://dx.doi.org/10.1145/2507157.2507191.Relevance-Based Language Models are an effective IR approach which explicitly introduces the concept of relevance in the statistical Language Modelling framework of Information Retrieval. These models have shown to achieve state-of-the-art retrieval performance in the pseudo relevance feedback task. In this paper we propose a novel adaptation of this language modeling approach to rating-based Collaborative Filtering. In a memory-based approach, we apply the model to the formation of user neighbourhoods, and the generation of recommendations based on such neighbourhoods. We report experimental results where our method outperforms other standard memory-based algorithms in terms of ranking precision.This work was funded by Secretaría de Estado de Investigación, Desarrollo e Innovación from the Spanish Government under projects TIN2012-33867 and TIN2011-28538-C02

Crossref

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Biblos-e Archivo

Enhancing Information Retrieval Through Concept-Based Language Modeling and Semantic Smoothing.

Author: Alvarez
Bai
Banerjee
Bao
Baziz
Bendersky
Bendersky
Bennett
Berger
Buckley
Cao
Chen
Gao
Gonzalo
Hammache
Hull
Krovetz
Lafferty
Lavrenko
Li
Macdonald
Meij
Meij
Metzler
Miller
Ounis
Petrovic
Ponte
Resnik
Salton
Seco
Shi
Song
Srikanth
Srikanth
Tu
Wei
Zhai
Zhai
Zhai
Zhang
Zhao
Zhou
Zhou
Publication venue: 'Wiley'
Publication date: 01/06/2015
Field of study

Traditionally, many information retrieval models assume that terms occur in documents independently. Although these models have already shown good performance, the word independency assumption seems to be unrealistic from a natural language point of view, which considers that terms are related to each other. Therefore, such an assumption leads to two well‐known problems in information retrieval (IR), namely, polysemy, or term mismatch, and synonymy. In language models, these issues have been addressed by considering dependencies such as bigrams, phrasal‐concepts, or word relationships, but such models are estimated using simple n‐grams or concept counting. In this paper, we address polysemy and synonymy mismatch with a concept‐based language modeling approach that combines ontological concepts from external resources with frequently found collocations from the document collection. In addition, the concept‐based model is enriched with subconcepts and semantic relationships through a semantic smoothing technique so as to perform semantic matching. Experiments carried out on TREC collections show that our model achieves significant improvements over a single word‐based model and the Markov Random Field model (using a Markov classifier)

Crossref

Scientific Publications of the University of Toulouse II Le Mirail

Open Archive Toulouse Archive Ouverte

Combining compound and single terms under language model framework

Author: Ahmed-Ouamar Rachid
Boughanem Mohand
Hammache Arezki
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/05/2014
Field of study

International audienceMost existing Information Retrieval model including probabilistic and vector space models are based on the term independence hypothesis. To go beyond this assumption and thereby capture the semantics of document and query more accurately, several works have incorporated phrases or other syntactic information in IR, such attempts have shown slight benefit, at best. Particularly in language modeling approaches this extension is achieved through the use of the bigram or n-gram models. However, in these models all bigrams/n-grams are considered and weighted uniformly. In this paper we introduce a new approach to select and weight relevant n-grams associated with a document. Experimental results on three TREC test collections showed an improvement over three strongest state-of-the-art model baselines, which are the original unigram language model, the Markov Random Field model, and the positional language model

Scientific Publications of the University of Toulouse II Le Mirail

Open Archive Toulouse Archive Ouverte