26,343 research outputs found
Recommended from our members
Learning to Diversify Web Search Results with a Document Repulsion Model
Search diversification (also called diversity search), is an important approach to tackling the query ambiguity problem in information retrieval. It aims to diversify the search results that are originally ranked according to their probabilities of relevance to a given query, by re-ranking them to cover as many as possible different aspects (or subtopics) of the query. Most existing diversity search models heuristically balance the relevance ranking and the diversity ranking, yet lacking an efficient learning mechanism to reach an optimized parameter setting. To address this problem, we propose a learning-to-diversify approach which can directly optimize the search diversification performance (in term of any effectiveness metric). We first extend the ranking function of a widely used learning-to-rank framework, i.e., LambdaMART, so that the extended ranking function can correlate relevance and diversity indicators. Furthermore, we develop an effective learning algorithm, namely Document Repulsion Model (DRM), to train the ranking function based on a Document Repulsion Theory (DRT). DRT assumes that two result documents covering similar query aspects (i.e., subtopics) should be mutually repulsive, for the purpose of search diversification. Accordingly, the proposed DRM exerts a repulsion force between each pair of similar documents in the learning process, and includes the diversity effectiveness metric to be optimized as part of the loss function. Although there have been existing learning based diversity search methods, they often involve an iterative sequential selection process in the ranking process, which is computationally complex and time consuming for training, while our proposed learning strategy can largely reduce the time cost. Extensive experiments are conducted on the TREC diversity track data (2009, 2010 and 2011). The results demonstrate that our model significantly outperforms a number of baselines in terms of effectiveness and robustness. Further, an efficiency analysis shows that the proposed DRM has a lower computational complexity than the state of the art learning-to-diversify methods
Rhetorical relations for information retrieval
Typically, every part in most coherent text has some plausible reason for its
presence, some function that it performs to the overall semantics of the text.
Rhetorical relations, e.g. contrast, cause, explanation, describe how the parts
of a text are linked to each other. Knowledge about this socalled discourse
structure has been applied successfully to several natural language processing
tasks. This work studies the use of rhetorical relations for Information
Retrieval (IR): Is there a correlation between certain rhetorical relations and
retrieval performance? Can knowledge about a document's rhetorical relations be
useful to IR? We present a language model modification that considers
rhetorical relations when estimating the relevance of a document to a query.
Empirical evaluation of different versions of our model on TREC settings shows
that certain rhetorical relations can benefit retrieval effectiveness notably
(> 10% in mean average precision over a state-of-the-art baseline)
A Vertical PRF Architecture for Microblog Search
In microblog retrieval, query expansion can be essential to obtain good
search results due to the short size of queries and posts. Since information in
microblogs is highly dynamic, an up-to-date index coupled with pseudo-relevance
feedback (PRF) with an external corpus has a higher chance of retrieving more
relevant documents and improving ranking. In this paper, we focus on the
research question:how can we reduce the query expansion computational cost
while maintaining the same retrieval precision as standard PRF? Therefore, we
propose to accelerate the query expansion step of pseudo-relevance feedback.
The hypothesis is that using an expansion corpus organized into verticals for
expanding the query, will lead to a more efficient query expansion process and
improved retrieval effectiveness. Thus, the proposed query expansion method
uses a distributed search architecture and resource selection algorithms to
provide an efficient query expansion process. Experiments on the TREC Microblog
datasets show that the proposed approach can match or outperform standard PRF
in MAP and NDCG@30, with a computational cost that is three orders of magnitude
lower.Comment: To appear in ICTIR 201
Improving Entity Retrieval on Structured Data
The increasing amount of data on the Web, in particular of Linked Data, has
led to a diverse landscape of datasets, which make entity retrieval a
challenging task. Explicit cross-dataset links, for instance to indicate
co-references or related entities can significantly improve entity retrieval.
However, only a small fraction of entities are interlinked through explicit
statements. In this paper, we propose a two-fold entity retrieval approach. In
a first, offline preprocessing step, we cluster entities based on the
\emph{x--means} and \emph{spectral} clustering algorithms. In the second step,
we propose an optimized retrieval model which takes advantage of our
precomputed clusters. For a given set of entities retrieved by the BM25F
retrieval approach and a given user query, we further expand the result set
with relevant entities by considering features of the queries, entities and the
precomputed clusters. Finally, we re-rank the expanded result set with respect
to the relevance to the query. We perform a thorough experimental evaluation on
the Billions Triple Challenge (BTC12) dataset. The proposed approach shows
significant improvements compared to the baseline and state of the art
approaches
Modeling Temporal Evidence from External Collections
Newsworthy events are broadcast through multiple mediums and prompt the
crowds to produce comments on social media. In this paper, we propose to
leverage on this behavioral dynamics to estimate the most relevant time periods
for an event (i.e., query). Recent advances have shown how to improve the
estimation of the temporal relevance of such topics. In this approach, we build
on two major novelties. First, we mine temporal evidences from hundreds of
external sources into topic-based external collections to improve the
robustness of the detection of relevant time periods. Second, we propose a
formal retrieval model that generalizes the use of the temporal dimension
across different aspects of the retrieval process. In particular, we show that
temporal evidence of external collections can be used to (i) infer a topic's
temporal relevance, (ii) select the query expansion terms, and (iii) re-rank
the final results for improved precision. Experiments with TREC Microblog
collections show that the proposed time-aware retrieval model makes an
effective and extensive use of the temporal dimension to improve search results
over the most recent temporal models. Interestingly, we observe a strong
correlation between precision and the temporal distribution of retrieved and
relevant documents.Comment: To appear in WSDM 201
Index ordering by query-independent measures
Conventional approaches to information retrieval search through all applicable entries in an inverted file for a particular collection in order to find those documents with the highest scores. For particularly large collections this may be extremely time consuming.
A solution to this problem is to only search a limited amount of the collection at query-time, in order to speed up the retrieval process. In doing this we can also limit the loss in retrieval efficacy (in terms of accuracy of results). The way we achieve this is to firstly identify the most “important” documents within the collection, and sort documents within inverted file lists in order of this “importance”. In this way we limit the amount of information to be searched at query time by eliminating documents of lesser importance, which not only makes the search more efficient, but also limits loss in retrieval accuracy. Our experiments, carried out on the TREC Terabyte collection, report significant savings, in terms of number of postings examined, without significant loss of effectiveness when based on several measures of importance used in isolation, and in combination. Our results point to several ways in which the computation cost of searching large collections of documents can be significantly reduced
Recommending Items in Social Tagging Systems Using Tag and Time Information
In this work we present a novel item recommendation approach that aims at
improving Collaborative Filtering (CF) in social tagging systems using the
information about tags and time. Our algorithm follows a two-step approach,
where in the first step a potentially interesting candidate item-set is found
using user-based CF and in the second step this candidate item-set is ranked
using item-based CF. Within this ranking step we integrate the information of
tag usage and time using the Base-Level Learning (BLL) equation coming from
human memory theory that is used to determine the reuse-probability of words
and tags using a power-law forgetting function.
As the results of our extensive evaluation conducted on data-sets gathered
from three social tagging systems (BibSonomy, CiteULike and MovieLens) show,
the usage of tag-based and time information via the BLL equation also helps to
improve the ranking and recommendation process of items and thus, can be used
to realize an effective item recommender that outperforms two alternative
algorithms which also exploit time and tag-based information.Comment: 6 pages, 2 tables, 9 figure
Application and evaluation of multi-dimensional diversity
Traditional information retrieval (IR) systems mostly focus on finding documents relevant to queries without considering other documents in the search results. This approach works quite well in general cases; however, this also means that the set of returned documents in a result list can be very similar to each other. This can be an undesired system property from a user's perspective. The creation of IR systems that support the search result diversification present many challenges, indeed current evaluation measures and methodologies are still unclear with regards to specific search domains and dimensions of diversity. In this paper, we highlight various issues in relation to image search diversification for the ImageClef 2009 collection and tasks. Furthermore, we discuss the problem of defining clusters/subtopics by mixing diversity dimensions regardless of which dimension is important in relation to information need or circumstances. We also introduce possible applications and evaluation metrics for diversity based retrieval
- …