6,468 research outputs found
Efficient & Effective Selective Query Rewriting with Efficiency Predictions
To enhance effectiveness, a user's query can be rewritten internally by the search engine in many ways, for example by applying proximity, or by expanding the query with related terms. However, approaches that benefit effectiveness often have a negative impact on efficiency, which has impacts upon the user satisfaction, if the query is excessively slow. In this paper, we propose a novel framework for using the predicted execution time of various query rewritings to select between alternatives on a per-query basis, in a manner that ensures both effectiveness and efficiency. In particular, we propose the prediction of the execution time of ephemeral (e.g., proximity) posting lists generated from uni-gram inverted index posting lists, which are used in establishing the permissible query rewriting alternatives that may execute in the allowed time. Experiments examining both the effectiveness and efficiency of the proposed approach demonstrate that a 49% decrease in mean response time (and 62% decrease in 95th-percentile response time) can be attained without significantly hindering the effectiveness of the search engine
Index ordering by query-independent measures
Conventional approaches to information retrieval search through all applicable entries in an inverted file for a particular collection in order to find those documents with the highest scores. For particularly large collections this may be extremely time consuming.
A solution to this problem is to only search a limited amount of the collection at query-time, in order to speed up the retrieval process. In doing this we can also limit the loss in retrieval efficacy (in terms of accuracy of results). The way we achieve this is to firstly identify the most âimportantâ documents within the collection, and sort documents within inverted file lists in order of this âimportanceâ. In this way we limit the amount of information to be searched at query time by eliminating documents of lesser importance, which not only makes the search more efficient, but also limits loss in retrieval accuracy. Our experiments, carried out on the TREC Terabyte collection, report significant savings, in terms of number of postings examined, without significant loss of effectiveness when based on several measures of importance used in isolation, and in combination. Our results point to several ways in which the computation cost of searching large collections of documents can be significantly reduced
Diversification Based Static Index Pruning - Application to Temporal Collections
Nowadays, web archives preserve the history of large portions of the web. As
medias are shifting from printed to digital editions, accessing these huge
information sources is drawing increasingly more attention from national and
international institutions, as well as from the research community. These
collections are intrinsically big, leading to index files that do not fit into
the memory and an increase query response time. Decreasing the index size is a
direct way to decrease this query response time.
Static index pruning methods reduce the size of indexes by removing a part of
the postings. In the context of web archives, it is necessary to remove
postings while preserving the temporal diversity of the archive. None of the
existing pruning approaches take (temporal) diversification into account.
In this paper, we propose a diversification-based static index pruning
method. It differs from the existing pruning approaches by integrating
diversification within the pruning context. We aim at pruning the index while
preserving retrieval effectiveness and diversity by pruning while maximizing a
given IR evaluation metric like DCG. We show how to apply this approach in the
context of web archives. Finally, we show on two collections that search
effectiveness in temporal collections after pruning can be improved using our
approach rather than diversity oblivious approaches
Embedding Web-based Statistical Translation Models in Cross-Language Information Retrieval
Although more and more language pairs are covered by machine translation
services, there are still many pairs that lack translation resources.
Cross-language information retrieval (CLIR) is an application which needs
translation functionality of a relatively low level of sophistication since
current models for information retrieval (IR) are still based on a
bag-of-words. The Web provides a vast resource for the automatic construction
of parallel corpora which can be used to train statistical translation models
automatically. The resulting translation models can be embedded in several ways
in a retrieval model. In this paper, we will investigate the problem of
automatically mining parallel texts from the Web and different ways of
integrating the translation models within the retrieval process. Our
experiments on standard test collections for CLIR show that the Web-based
translation models can surpass commercial MT systems in CLIR tasks. These
results open the perspective of constructing a fully automatic query
translation device for CLIR at a very low cost.Comment: 37 page
Upper Bound Approximations for BlockMaxWand
BlockMaxWand is a recent advance on the Wand dynamic pruning
technique, which allows efficient retrieval without any effectiveness
degradation to rank K. However, while BMW uses docid-sorted indices,
it relies on recording the upper bound of the term weighting
model scores for each block of postings in the inverted index. Such
a requirement can be disadvantageous in situations such as when
an index must be updated. In this work, we examine the appropriateness
of upper-bound approximation â which have previously
been shown suitable for Wandâ in providing efficient retrieval for
BMW. Experiments on the ClueWeb12 category B13 corpus using
5000 queries from a real search engineâs query log demonstrate that
BMW still provides benefits w.r.t. Wand when approximate upper
bounds are used, and that, if approximations on upper bounds are
tight, BMW with approximate upper bounds can provide efficiency
gains w.r.t.Wand with exact upper bounds, in particular for queries
of short to medium length
Visualising the structure of document search results: A comparison of graph theoretic approaches
This is the post-print of the article - Copyright @ 2010 Sage PublicationsPrevious work has shown that distance-similarity visualisation or âspatialisationâ can provide a potentially useful context in which to browse the results of a query search, enabling the user to adopt a simple local foraging or âcluster growingâ strategy to navigate through the retrieved document set. However, faithfully mapping feature-space models to visual space can be problematic owing to their inherent high dimensionality and non-linearity. Conventional linear approaches to dimension reduction tend to fail at this kind of task, sacrificing local structural in order to preserve a globally optimal mapping. In this paper the clustering performance of a recently proposed algorithm called isometric feature mapping (Isomap), which deals with non-linearity by transforming dissimilarities into geodesic distances, is compared to that of non-metric multidimensional scaling (MDS). Various graph pruning methods, for geodesic distance estimation, are also compared. Results show that Isomap is significantly better at preserving local structural detail than MDS, suggesting it is better suited to cluster growing and other semantic navigation tasks. Moreover, it is shown that applying a minimum-cost graph pruning criterion can provide a parameter-free alternative to the traditional K-neighbour method, resulting in spatial clustering that is equivalent to or better than that achieved using an optimal-K criterion
- âŠ