207,226 research outputs found
Recommended from our members
Neural Methods for Answer Passage Retrieval over Sparse Collections
Recent advances in machine learning have allowed information retrieval (IR) techniques to advance beyond the stage of handcrafting domain specific features. Specifically, deep neural models incorporate varying levels of features to learn whether a document answers the information need of a query. However, these neural models rely on a large number of parameters to successfully learn a relation between a query and a relevant document. This reliance on a large number of parameters, combined with the current methods of optimization relying on small updates necessitates numerous samples to allow the neural model to converge on an effective relevance function. This presents a significant obstacle in the realm of IR as relevance judgements are often sparse or noisy and combined with a large class imbalance. This is especially true for short text retrieval where there is often only one relevant passage. This problem is exacerbated when training these artificial neural networks, as excessive negative sampling can result in poor performance. Thus, we propose approaching this task through multiple avenues and examining their effectiveness on a non-factoid question answering (QA) task.We first propose learning local embeddings specific to the relevance information of the collection to improve performance of an upstream neural model. In doing so, we find significantly improved results over standard pre-trained embeddings, despite only developing the embeddings on a small collection which would not be sufficient for a full language model. Leveraging this local representation, and inspired by recent work in machine translation, we introduce a hybrid embedding based model that incorporates both pre-trained embeddings while dynamically constructing local representations from character embeddings. The hybrid approach relies on pre-trained embeddings to achieve an effective retrieval model, and continually adjusts its character level abstraction to fit a local representation.We next approach methods to adapt neural models to multiple IR collections, therefore reducing the collection specific training required and alleviating the need to retrain a neural model\u27s parameters for a new subdomain of a collection. First, we propose an adversarial retrieval model which achieves state-of-the-art performance on out of subdomain queries while maintaining in-domain performance. Second, we establish an informed negative sampling approach using a reinforcement learning agent. The agent is trained to directly maximize the performance of a neural IR model using a predefined IR metric by choosing which ranking function from which to sample negative documents. This policy based sampling allows the neural model to be exposed to more of a collection and results in a more consistent neural retrieval model over multiple training instances. Lastly, we move towards a universal retrieval function. We initially introduce a probe-based inspection of neural relevance models through the lens of standard natural language processing tasks and establish that while seemingly similar QA collections require the same basic abstract information, the final layers that determine relevance differ significantly. We then introduce Universal Retrieval Functions, a method to incorporate new collections using a library of previously trained linear relevance models and a common neural representation
Performance comparison of language models for information retrieval
Vector Space Model (VSM), Statistical Language Model (SLM) and Inference Network are three distinguished language models. Instead of evaluating their performance directly, we estimate the information strategies founded on them using the known measures: precision and recall. What's more, we proposed the Sort Order Rationality (SOR) to make further performance comparison among different language models. All models are tested on a standard testing collection. Three important conclusions are attained: (1). The IR model combining the statistical language modeling and inference network approaches is better than that only founded on statistical language modeling approach. What's more, it is also better than that based on vector space modeling approach. (2). The performance of IR model based on VSM is similar to that based on SLM. (3). The Dirichlet priors method often is a better option to smooth a statistical language model. In some respects, these conclusions provide some experimental bases for constructing an efficient information retrieval system
Concept learning and information inferencing on a high-dimensional semantic space
How to automatically capture a significant portion of relevant background knowledge and keep it up-to-date has been a challenging problem encountered in current research on logic based information retrieval. This paper addresses this problem by investigating various information inference mechanisms based on a high dimensional semantic space constructed from a text corpus using the Hyperspace Analogue to Language (HAL) model. Additionally, the Singular Value Decomposition (SVD) algorithm is considered as an alternative way to enhance the quality of the HAL matrix as well as a mechanism of infering implicit associations. The different characteristics of these inference mechanisms are demonstrated using examples from the Reuters-21578 collection. Our hope is that the techniques discussed in this paper provide a basis for logic based IR to progress to large scale applications
Adaptation of machine translation for multilingual information retrieval in the medical domain
Objective. We investigate machine translation (MT) of user search queries in the context of cross-lingual information retrieval (IR) in the medical domain. The main focus is on techniques to adapt MT to increase translation quality; however, we also explore MT adaptation to improve eectiveness of cross-lingual IR.
Methods and Data. Our MT system is Moses, a state-of-the-art phrase-based statistical machine translation system. The IR system is based on the BM25 retrieval model implemented in the Lucene search engine. The MT techniques employed in this work include in-domain training and tuning, intelligent training data selection, optimization of phrase table configuration, compound
splitting, and exploiting synonyms as translation variants. The IR methods include morphological normalization and using multiple translation variants for query expansion. The experiments are performed and thoroughly evaluated on three language pairs: Czech–English, German–English, and French–English. MT quality is evaluated on data sets created within the Khresmoi project and IR eectiveness is tested on the CLEF eHealth 2013 data sets.
Results. The search query translation results achieved in our experiments are outstanding – our systems outperform not only our strong baselines, but also Google Translate and Microsoft Bing Translator in direct comparison carried out on all the language pairs. The baseline BLEU scores increased from 26.59 to 41.45 for Czech–English, from 23.03 to 40.82 for German–English, and from 32.67 to 40.82 for French–English. This is a 55% improvement on average. In terms of the IR performance on this
particular test collection, a significant improvement over the baseline is achieved only for French–English. For Czech–English and German–English, the increased MT quality does not lead to better IR results.
Conclusions. Most of the MT techniques employed in our experiments improve MT of medical search queries. Especially the intelligent training data selection proves to be very successful for domain adaptation of MT. Certain improvements are also obtained from German compound splitting on the source language side. Translation quality, however, does not appear to correlate with the IR performance – better translation does not necessarily yield better retrieval. We discuss in detail the contribution of the individual techniques and state-of-the-art features and provide future research directions
Incorporation of Contextual Retrieval and Data Fusion Approach Towards Improving The Retrieval Precision.
Generally, the functionality of information retrieval (IR) could be divided into two
categories where one section deals with search and retrieval while the other
component concerns with the subject or content analysis. In the search and retrieval
part, the IR systems present a ranked list of relevant documents depending on the
user submitted query as the representation of the user's information need. The ranked
list given indicates the probability of the document is relevant to the query by
ordering the highest relevant document at the top position and so forth. However,
queries are often formulated with simplified short words, such as "Java". These
words are unable to summarise precisely the user's information need and its context,
i.e. "java, programming language" or "java, the island". Consequently, the user's
information need is not satisfied as the highest relevant document was not positioned
accordingly or too much relevant document was presented in the ranked list.
Besides, by using the simplified query made the context is not easily extractable, and
in recent years there has been much research interest in contextual retrieval. Likewise IR, contextual retrieval retrieved the relevant document by using the combination of
query, user context and search technology into a single framework. Furthermore, in
contextual retrieval, the user's context is exploited to differentiate the relevant
document that is useful at that time the requests occur.
On the other hand, in order to match the queries and the document representation,
different IR schemes were applied to calculate the probability. As a result, often
retrieval precision is different for differing IR schemes, where dissimilar lists of
relevant documents for the same query submitted are presented. Thus, data fusion
approach is implemented in the IR to overcome this complication where multiple
sources of results are combined. The implementation of data fusion approach in IR
involves the merging of retrieval result from different IR schemes into a single
unified ranked list that supposedly presents a list of high precisely relevant
document.
This study presents an approach to incorporate contextual retrieval and data hsion
by using a one-keyword query towards improving retrieval precision. The methods to
identify user context are categorised into four approaches; relevance feedback, user
profiles, word-sense disambiguation and knowledge engineering. In order to extract
user context and to model contextual retrieval, term-weighting scheme based on user
profiles and knowledge engineering approaches for Watson scheme and word-sense
disambiguation approach for Wordsieve scheme are implemented in this study. Five
randomly selected documents are selected and submitted to these schemes and the
user's context extracted is used to expand the initial query for retrieval process.In addition, the feasibility of adopting a data fusion approach was assessed in this
study by testing two preconditions; --the efficacy and dissimilarity tests for the IR
scheme candidates, as there is a possibility that the precision improvement may not
be accomplished. Two queries which are Java and Jaguar, expanded by using user's
context extracted by Watson and WordSieve are submitted and more than ten
thousand documents are collected as the data collection for conducting the
experiment. The performance of the experiment is evaluated by using three
assessments; precision recall graph, precision evaluation based on document ranked
and mean average precision. The data fusion experiment based on contextual
retrieval results has reveals significant improvement on retrieval precision where the
lowest percentage gained compared to the basic IR scheme is approximate to thirty
seven percent, ten percent improvement compared to Watson and fifthteen percent
improvement compared to WordSieve based on mean average precision calculatio
Japanese/English Cross-Language Information Retrieval: Exploration of Query Translation and Transliteration
Cross-language information retrieval (CLIR), where queries and documents are
in different languages, has of late become one of the major topics within the
information retrieval community. This paper proposes a Japanese/English CLIR
system, where we combine a query translation and retrieval modules. We
currently target the retrieval of technical documents, and therefore the
performance of our system is highly dependent on the quality of the translation
of technical terms. However, the technical term translation is still
problematic in that technical terms are often compound words, and thus new
terms are progressively created by combining existing base words. In addition,
Japanese often represents loanwords based on its special phonogram.
Consequently, existing dictionaries find it difficult to achieve sufficient
coverage. To counter the first problem, we produce a Japanese/English
dictionary for base words, and translate compound words on a word-by-word
basis. We also use a probabilistic method to resolve translation ambiguity. For
the second problem, we use a transliteration method, which corresponds words
unlisted in the base word dictionary to their phonetic equivalents in the
target language. We evaluate our system using a test collection for CLIR, and
show that both the compound word translation and transliteration methods
improve the system performance
- …