Search CORE

26,343 research outputs found

Recommended from our members

Learning to Diversify Web Search Results with a Document Repulsion Model

Author: Li Jingfei
Song Dawei
Wang Benyou
Wu Yue
Zhang Peng
Publication venue: 'Elsevier BV'
Publication date: 01/10/2017
Field of study

Search diversification (also called diversity search), is an important approach to tackling the query ambiguity problem in information retrieval. It aims to diversify the search results that are originally ranked according to their probabilities of relevance to a given query, by re-ranking them to cover as many as possible different aspects (or subtopics) of the query. Most existing diversity search models heuristically balance the relevance ranking and the diversity ranking, yet lacking an efficient learning mechanism to reach an optimized parameter setting. To address this problem, we propose a learning-to-diversify approach which can directly optimize the search diversification performance (in term of any effectiveness metric). We first extend the ranking function of a widely used learning-to-rank framework, i.e., LambdaMART, so that the extended ranking function can correlate relevance and diversity indicators. Furthermore, we develop an effective learning algorithm, namely Document Repulsion Model (DRM), to train the ranking function based on a Document Repulsion Theory (DRT). DRT assumes that two result documents covering similar query aspects (i.e., subtopics) should be mutually repulsive, for the purpose of search diversification. Accordingly, the proposed DRM exerts a repulsion force between each pair of similar documents in the learning process, and includes the diversity effectiveness metric to be optimized as part of the loss function. Although there have been existing learning based diversity search methods, they often involve an iterative sequential selection process in the ranking process, which is computationally complex and time consuming for training, while our proposed learning strategy can largely reduce the time cost. Extensive experiments are conducted on the TREC diversity track data (2009, 2010 and 2011). The results demonstrate that our model significantly outperforms a number of baselines in terms of effectiveness and robustness. Further, an efficiency analysis shows that the proposed DRM has a lower computational complexity than the state of the art learning-to-diversify methods

Open Research Online (The Open University)

Rhetorical relations for information retrieval

Author: Larsen Birger
Lioma Christina
Lu Wei
Publication venue
Publication date: 05/04/2017
Field of study

Typically, every part in most coherent text has some plausible reason for its presence, some function that it performs to the overall semantics of the text. Rhetorical relations, e.g. contrast, cause, explanation, describe how the parts of a text are linked to each other. Knowledge about this socalled discourse structure has been applied successfully to several natural language processing tasks. This work studies the use of rhetorical relations for Information Retrieval (IR): Is there a correlation between certain rhetorical relations and retrieval performance? Can knowledge about a document's rhetorical relations be useful to IR? We present a language model modification that considers rhetorical relations when estimating the relevance of a document to a query. Empirical evaluation of different versions of our model on TREC settings shows that certain rhetorical relations can benefit retrieval effectiveness notably (> 10% in mean average precision over a state-of-the-art baseline)

arXiv.org e-Print Archive

CiteSeerX

A Vertical PRF Architecture for Microblog Search

Author: Arguello J.
Demeester Thomas
Lin Jimmy
Massoudi Kamran
Milad Shokouhi
Rosa Kevin Dela
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 08/10/2018
Field of study

In microblog retrieval, query expansion can be essential to obtain good search results due to the short size of queries and posts. Since information in microblogs is highly dynamic, an up-to-date index coupled with pseudo-relevance feedback (PRF) with an external corpus has a higher chance of retrieving more relevant documents and improving ranking. In this paper, we focus on the research question:how can we reduce the query expansion computational cost while maintaining the same retrieval precision as standard PRF? Therefore, we propose to accelerate the query expansion step of pseudo-relevance feedback. The hypothesis is that using an expansion corpus organized into verticals for expanding the query, will lead to a more efficient query expansion process and improved retrieval effectiveness. Thus, the proposed query expansion method uses a distributed search architecture and resource selection algorithms to provide an efficient query expansion process. Experiments on the TREC Microblog datasets show that the proposed approach can match or outperform standard PRF in MAP and NDCG@30, with a computational cost that is three orders of magnitude lower.Comment: To appear in ICTIR 201

arXiv.org e-Print Archive

Crossref

Improving Entity Retrieval on Structured Data

Author: Dietze Stefan
Fetahu Besnik
Gadiraju Ujwal
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 30/03/2017
Field of study

The increasing amount of data on the Web, in particular of Linked Data, has led to a diverse landscape of datasets, which make entity retrieval a challenging task. Explicit cross-dataset links, for instance to indicate co-references or related entities can significantly improve entity retrieval. However, only a small fraction of entities are interlinked through explicit statements. In this paper, we propose a two-fold entity retrieval approach. In a first, offline preprocessing step, we cluster entities based on the \emph{x--means} and \emph{spectral} clustering algorithms. In the second step, we propose an optimized retrieval model which takes advantage of our precomputed clusters. For a given set of entities retrieved by the BM25F retrieval approach and a given user query, we further expand the result set with relevant entities by considering features of the queries, entities and the precomputed clusters. Finally, we re-rank the expanded result set with respect to the relevance to the query. We perform a thorough experimental evaluation on the Billions Triple Challenge (BTC12) dataset. The proposed approach shows significant improvements compared to the baseline and state of the art approaches

arXiv.org e-Print Archive

CiteSeerX

Modeling Temporal Evidence from External Collections

Author: Craveiro Olga
Guo Weiwei
Lin Jimmy
Lin Jimmy
Metzler Donald
O'Connor Brendan
Shokouhi Milad
Xu Tan
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 02/12/2018
Field of study

Newsworthy events are broadcast through multiple mediums and prompt the crowds to produce comments on social media. In this paper, we propose to leverage on this behavioral dynamics to estimate the most relevant time periods for an event (i.e., query). Recent advances have shown how to improve the estimation of the temporal relevance of such topics. In this approach, we build on two major novelties. First, we mine temporal evidences from hundreds of external sources into topic-based external collections to improve the robustness of the detection of relevant time periods. Second, we propose a formal retrieval model that generalizes the use of the temporal dimension across different aspects of the retrieval process. In particular, we show that temporal evidence of external collections can be used to (i) infer a topic's temporal relevance, (ii) select the query expansion terms, and (iii) re-rank the final results for improved precision. Experiments with TREC Microblog collections show that the proposed time-aware retrieval model makes an effective and extensive use of the temporal dimension to improve search results over the most recent temporal models. Interestingly, we observe a strong correlation between precision and the temporal distribution of retrieved and relevant documents.Comment: To appear in WSDM 201

arXiv.org e-Print Archive

Crossref

Index ordering by query-independent measures

Author: Alan F. Smeaton
Amento
Anh
Anh
Anh
Baeza-Yates
Broder
Büttcher
Chakrabarti
Fagni
Ferguson
Garcia
Joachims
Joachims
Kleinberg
Moffat
Ntoulas
Park
Paul Ferguson
Persin
Plachouras
Robertson
Vapnik
Wang
Witten
Xue
Zhai
Zhang
Zipf
Publication venue: 'Elsevier BV'
Publication date: 01/05/2012
Field of study

Conventional approaches to information retrieval search through all applicable entries in an inverted file for a particular collection in order to find those documents with the highest scores. For particularly large collections this may be extremely time consuming. A solution to this problem is to only search a limited amount of the collection at query-time, in order to speed up the retrieval process. In doing this we can also limit the loss in retrieval efficacy (in terms of accuracy of results). The way we achieve this is to firstly identify the most “important” documents within the collection, and sort documents within inverted file lists in order of this “importance”. In this way we limit the amount of information to be searched at query time by eliminating documents of lesser importance, which not only makes the search more efficient, but also limits loss in retrieval accuracy. Our experiments, carried out on the TREC Terabyte collection, report significant savings, in terms of number of postings examined, without significant loss of effectiveness when based on several measures of importance used in isolation, and in combination. Our results point to several ways in which the computation cost of searching large collections of documents can be significantly reduced

Crossref

Irish Universities

DCU Online Research Access Service

Recommending Items in Social Tagging Systems Using Tag and Time Information

Author: Kowald Dominik
Lacic Emanuel
Parra Denis
Seitlinger Paul
Trattner Christoph
Publication venue
Publication date: 30/06/2014
Field of study

In this work we present a novel item recommendation approach that aims at improving Collaborative Filtering (CF) in social tagging systems using the information about tags and time. Our algorithm follows a two-step approach, where in the first step a potentially interesting candidate item-set is found using user-based CF and in the second step this candidate item-set is ranked using item-based CF. Within this ranking step we integrate the information of tag usage and time using the Base-Level Learning (BLL) equation coming from human memory theory that is used to determine the reuse-probability of words and tags using a power-law forgetting function. As the results of our extensive evaluation conducted on data-sets gathered from three social tagging systems (BibSonomy, CiteULike and MovieLens) show, the usage of tag-based and time information via the BLL equation also helps to improve the ranking and recommendation process of items and thus, can be used to realize an effective item recommender that outperforms two alternative algorithms which also exploit time and tag-based information.Comment: 6 pages, 2 tables, 9 figure

arXiv.org e-Print Archive

CiteSeerX

Application and evaluation of multi-dimensional diversity

Author: Halvey M.
Jose J.M.
Leelanupab T.
Publication venue
Publication date: 01/01/2009
Field of study

Traditional information retrieval (IR) systems mostly focus on finding documents relevant to queries without considering other documents in the search results. This approach works quite well in general cases; however, this also means that the set of returned documents in a result list can be very similar to each other. This can be an undesired system property from a user's perspective. The creation of IR systems that support the search result diversification present many challenges, indeed current evaluation measures and methodologies are still unclear with regards to specific search domains and dimensions of diversity. In this paper, we highlight various issues in relation to image search diversification for the ImageClef 2009 collection and tasks. Furthermore, we discuss the problem of defining clusters/subtopics by mixing diversity dimensions regardless of which dimension is important in relation to information need or circumstances. We also introduce possible applications and evaluation metrics for diversity based retrieval

CiteSeerX

Enlighten