92,076 research outputs found
Probabilistic retrieval models - relationships, context-specific application, selection and implementation
PhDRetrieval models are the core components of information retrieval systems, which guide the document
and query representations, as well as the document ranking schemes. TF-IDF, binary
independence retrieval (BIR) model and language modelling (LM) are three of the most influential
contemporary models due to their stability and performance. The BIR model and LM
have probabilistic theory as their basis, whereas TF-IDF is viewed as a heuristic model, whose
theoretical justification always fascinates researchers.
This thesis firstly investigates the parallel derivation of BIR model, LM and Poisson model,
wrt event spaces, relevance assumptions and ranking rationales. It establishes a bridge between
the BIR model and LM, and derives TF-IDF from the probabilistic framework.
Then, the thesis presents the probabilistic logical modelling of the retrieval models. Various
ways of how to estimate and aggregate probability, and alternative implementation to nonprobabilistic
operator are demonstrated. Typical models have been implemented.
The next contribution concerns the usage of of context-specific frequencies, i.e., the frequencies
counted based on assorted element types or within different text scopes. The hypothesis
is that they can help to rank the elements in structured document retrieval. The thesis applies
context-specific frequencies on term weighting schemes in these models, and the outcome is a
generalised retrieval model with regard to both element and document ranking.
The retrieval models behave differently on the same query set: for some queries, one model
performs better, for other queries, another model is superior. Therefore, one idea to improve the
overall performance of a retrieval system is to choose for each query the model that is likely
to perform the best. This thesis proposes and empirically explores the model selection method
according to the correlation of query feature and query performance, which contributes to the
methodology of dynamically choosing a model.
In summary, this thesis contributes a study of probabilistic models and their relationships,
the probabilistic logical modelling of retrieval models, the usage and effect of context-specific
frequencies in models, and the selection of retrieval models
Effect of inverted index partitioning schemes on performance of query processing in parallel text retrieval systems
Shared-nothing, parallel text retrieval systems require an inverted index, representing a document collection, to be partitioned among a number of processors. In general, the index can be partitioned based on either the terms or documents in the collection, and the way the partitioning is done greatly affects the query processing performance of the parallel system. In this work, we investigate the effect of these two index partitioning schemes on query processing. We conduct experiments on a 32-node PC cluster, considering the case where index is completely stored in disk. Performance results are reported for a large (30 GB) document collection using an MPI-based parallel query processing implementation. © Springer-Verlag Berlin Heidelberg 2006
Parallel text retrieval on PC clusters
Cataloged from PDF version of article.The inverted index partitioning problem is investigated for parallel text retrieval
systems. The objective is to perform efficient query processing on an inverted
index distributed across a PC cluster. Alternative strategies are considered and
evaluated for inverted index partitioning, where index entries are distributed according
to their document-ids or term-ids. The performance of both partitioning
schemes depend on the total number of disk accesses and the total volume of
communication in the system. In document-id partitioning, the total volume of
communication is naturally minimum, whereas the total number of disk accesses
may be larger compared to term-id partitioning. On the other hand, in term-id
partitioning the total number of disk accesses is already equivalent to the lower
bound achieved by the sequential algorithm, albeit the total communication volume
may be quite large. The studies done so far perform these partitioning
schemes in a round-robin fashion and compare the performance of them by simulation.
In this work, a parallel text retrieval system is designed and implemented
on a PC cluster. We adopted hypergraph-theoretical partitioning models and
carried out performance comparison of round-robin and hypergraph-theoretical
partitioning schemes on our parallel text retrieval system. We also designed and
implemented a query interface and a user interface of our system.Çatal, AytülM.S
Embedding Web-based Statistical Translation Models in Cross-Language Information Retrieval
Although more and more language pairs are covered by machine translation
services, there are still many pairs that lack translation resources.
Cross-language information retrieval (CLIR) is an application which needs
translation functionality of a relatively low level of sophistication since
current models for information retrieval (IR) are still based on a
bag-of-words. The Web provides a vast resource for the automatic construction
of parallel corpora which can be used to train statistical translation models
automatically. The resulting translation models can be embedded in several ways
in a retrieval model. In this paper, we will investigate the problem of
automatically mining parallel texts from the Web and different ways of
integrating the translation models within the retrieval process. Our
experiments on standard test collections for CLIR show that the Web-based
translation models can surpass commercial MT systems in CLIR tasks. These
results open the perspective of constructing a fully automatic query
translation device for CLIR at a very low cost.Comment: 37 page
MapReduce for information retrieval evaluation: "Let's quickly test this on 12 TB of data"
We propose to use MapReduce to quickly test new retrieval approaches on a cluster of machines by sequentially scanning all documents. We present a small case study in which we use a cluster of 15 low cost machines to search a web crawl of 0.5 billion pages showing that sequential scanning is a viable approach to running large-scale information retrieval experiments with little effort. The code is available to other researchers at: http://mirex.sourceforge.net
Transitive probabilistic CLIR models.
Transitive translation could be a useful technique to enlarge the number of supported language pairs for a cross-language information retrieval (CLIR) system in a cost-effective manner. The paper describes several setups for transitive translation based on probabilistic translation models. The transitive CLIR models were evaluated on the CLEF test collection and yielded a retrieval effectiveness\ud
up to 83% of monolingual performance, which is significantly better than a baseline using the synonym operator
Disambiguation strategies for cross-language information retrieval
This paper gives an overview of tools and methods for Cross-Language Information Retrieval (CLIR) that are developed within the Twenty-One project. The tools and methods are evaluated with the TREC CLIR task document collection using Dutch queries on the English document base. The main issue addressed here is an evaluation of two approaches to disambiguation. The underlying question is whether a lot of effort should be put in finding the correct translation for each query term before searching, or whether searching with more than one possible translation leads to better results? The experimental study suggests that the quality of search methods is more important than the quality of disambiguation methods. Good retrieval methods are able to disambiguate translated queries implicitly during searching
- …