Search CORE

207,226 research outputs found

Recommended from our members

Neural Methods for Answer Passage Retrieval over Sparse Collections

Author: Cohen Daniel
Publication venue: ScholarWorks@UMass Amherst
Publication date: 06/04/2021
Field of study

Recent advances in machine learning have allowed information retrieval (IR) techniques to advance beyond the stage of handcrafting domain specific features. Specifically, deep neural models incorporate varying levels of features to learn whether a document answers the information need of a query. However, these neural models rely on a large number of parameters to successfully learn a relation between a query and a relevant document. This reliance on a large number of parameters, combined with the current methods of optimization relying on small updates necessitates numerous samples to allow the neural model to converge on an effective relevance function. This presents a significant obstacle in the realm of IR as relevance judgements are often sparse or noisy and combined with a large class imbalance. This is especially true for short text retrieval where there is often only one relevant passage. This problem is exacerbated when training these artificial neural networks, as excessive negative sampling can result in poor performance. Thus, we propose approaching this task through multiple avenues and examining their effectiveness on a non-factoid question answering (QA) task.We first propose learning local embeddings specific to the relevance information of the collection to improve performance of an upstream neural model. In doing so, we find significantly improved results over standard pre-trained embeddings, despite only developing the embeddings on a small collection which would not be sufficient for a full language model. Leveraging this local representation, and inspired by recent work in machine translation, we introduce a hybrid embedding based model that incorporates both pre-trained embeddings while dynamically constructing local representations from character embeddings. The hybrid approach relies on pre-trained embeddings to achieve an effective retrieval model, and continually adjusts its character level abstraction to fit a local representation.We next approach methods to adapt neural models to multiple IR collections, therefore reducing the collection specific training required and alleviating the need to retrain a neural model\u27s parameters for a new subdomain of a collection. First, we propose an adversarial retrieval model which achieves state-of-the-art performance on out of subdomain queries while maintaining in-domain performance. Second, we establish an informed negative sampling approach using a reinforcement learning agent. The agent is trained to directly maximize the performance of a neural IR model using a predefined IR metric by choosing which ranking function from which to sample negative documents. This policy based sampling allows the neural model to be exposed to more of a collection and results in a more consistent neural retrieval model over multiple training instances. Lastly, we move towards a universal retrieval function. We initially introduce a probe-based inspection of neural relevance models through the lens of standard natural language processing tasks and establish that while seemingly similar QA collections require the same basic abstract information, the final layers that determine relevance differ significantly. We then introduce Universal Retrieval Functions, a method to incorporate new collections using a library of previously trained linear relevance models and a common neural representation

ScholarWorks@UMass Amherst

Performance comparison of language models for information retrieval

Author: Dai SX
Diao Q
Li DL
Wang B
Zhou CL
周昌乐
Publication venue
Publication date: 01/01/2005
Field of study

Vector Space Model (VSM), Statistical Language Model (SLM) and Inference Network are three distinguished language models. Instead of evaluating their performance directly, we estimate the information strategies founded on them using the known measures: precision and recall. What's more, we proposed the Sort Order Rationality (SOR) to make further performance comparison among different language models. All models are tested on a standard testing collection. Three important conclusions are attained: (1). The IR model combining the statistical language modeling and inference network approaches is better than that only founded on statistical language modeling approach. What's more, it is also better than that based on vector space modeling approach. (2). The performance of IR model based on VSM is similar to that based on SLM. (3). The Dirichlet priors method often is a better option to smooth a statistical language model. In some respects, these conclusions provide some experimental bases for constructing an efficient information retrieval system

Xiamen University Institutional Repository

Concept learning and information inferencing on a high-dimensional semantic space

Author: Bruza Peter
Cole Richard
Song Dawei
Publication venue
Publication date: 01/01/2004
Field of study

How to automatically capture a significant portion of relevant background knowledge and keep it up-to-date has been a challenging problem encountered in current research on logic based information retrieval. This paper addresses this problem by investigating various information inference mechanisms based on a high dimensional semantic space constructed from a text corpus using the Hyperspace Analogue to Language (HAL) model. Additionally, the Singular Value Decomposition (SVD) algorithm is considered as an alternative way to enhance the quality of the HAL matrix as well as a mechanism of infering implicit associations. The different characteristics of these inference mechanisms are demonstrated using examples from the Reuters-21578 collection. Our hope is that the techniques discussed in this paper provide a basis for logic based IR to progress to large scale applications

CiteSeerX

Open Access Institutional Repository at Robert Gordon University

Open Research Online (The Open University)

Adaptation of machine translation for multilingual information retrieval in the medical domain

Author: Dušek Ondřej
Goeuriot Lorraine
Hajič Jan
Hlaváčová Jaroslava
Jones Gareth J.F.
Kelly Liadh
Leveling Johannes
Mareček David
Novák Michal
Pecina Pavel
Popel Martin
Rosa Rudolf
Tamchyna Aleš
Urešová Zdeňka
Publication venue: 'Elsevier BV'
Publication date: 01/01/2014
Field of study

Objective. We investigate machine translation (MT) of user search queries in the context of cross-lingual information retrieval (IR) in the medical domain. The main focus is on techniques to adapt MT to increase translation quality; however, we also explore MT adaptation to improve eectiveness of cross-lingual IR. Methods and Data. Our MT system is Moses, a state-of-the-art phrase-based statistical machine translation system. The IR system is based on the BM25 retrieval model implemented in the Lucene search engine. The MT techniques employed in this work include in-domain training and tuning, intelligent training data selection, optimization of phrase table configuration, compound splitting, and exploiting synonyms as translation variants. The IR methods include morphological normalization and using multiple translation variants for query expansion. The experiments are performed and thoroughly evaluated on three language pairs: Czech–English, German–English, and French–English. MT quality is evaluated on data sets created within the Khresmoi project and IR eectiveness is tested on the CLEF eHealth 2013 data sets. Results. The search query translation results achieved in our experiments are outstanding – our systems outperform not only our strong baselines, but also Google Translate and Microsoft Bing Translator in direct comparison carried out on all the language pairs. The baseline BLEU scores increased from 26.59 to 41.45 for Czech–English, from 23.03 to 40.82 for German–English, and from 32.67 to 40.82 for French–English. This is a 55% improvement on average. In terms of the IR performance on this particular test collection, a significant improvement over the baseline is achieved only for French–English. For Czech–English and German–English, the increased MT quality does not lead to better IR results. Conclusions. Most of the MT techniques employed in our experiments improve MT of medical search queries. Especially the intelligent training data selection proves to be very successful for domain adaptation of MT. Certain improvements are also obtained from German compound splitting on the source language side. Translation quality, however, does not appear to correlate with the IR performance – better translation does not necessarily yield better retrieval. We discuss in detail the contribution of the individual techniques and state-of-the-art features and provide future research directions

Crossref

Hal - Université Grenoble Alpes

Irish Universities

DCU Online Research Access Service

Biblio at Institute of Formal and Applied Linguistics

Incorporation of Contextual Retrieval and Data Fusion Approach Towards Improving The Retrieval Precision.

Author: Alidin Az Azrinudin
Publication venue
Publication date: 01/11/2007
Field of study

Generally, the functionality of information retrieval (IR) could be divided into two categories where one section deals with search and retrieval while the other component concerns with the subject or content analysis. In the search and retrieval part, the IR systems present a ranked list of relevant documents depending on the user submitted query as the representation of the user's information need. The ranked list given indicates the probability of the document is relevant to the query by ordering the highest relevant document at the top position and so forth. However, queries are often formulated with simplified short words, such as "Java". These words are unable to summarise precisely the user's information need and its context, i.e. "java, programming language" or "java, the island". Consequently, the user's information need is not satisfied as the highest relevant document was not positioned accordingly or too much relevant document was presented in the ranked list. Besides, by using the simplified query made the context is not easily extractable, and in recent years there has been much research interest in contextual retrieval. Likewise IR, contextual retrieval retrieved the relevant document by using the combination of query, user context and search technology into a single framework. Furthermore, in contextual retrieval, the user's context is exploited to differentiate the relevant document that is useful at that time the requests occur. On the other hand, in order to match the queries and the document representation, different IR schemes were applied to calculate the probability. As a result, often retrieval precision is different for differing IR schemes, where dissimilar lists of relevant documents for the same query submitted are presented. Thus, data fusion approach is implemented in the IR to overcome this complication where multiple sources of results are combined. The implementation of data fusion approach in IR involves the merging of retrieval result from different IR schemes into a single unified ranked list that supposedly presents a list of high precisely relevant document. This study presents an approach to incorporate contextual retrieval and data hsion by using a one-keyword query towards improving retrieval precision. The methods to identify user context are categorised into four approaches; relevance feedback, user profiles, word-sense disambiguation and knowledge engineering. In order to extract user context and to model contextual retrieval, term-weighting scheme based on user profiles and knowledge engineering approaches for Watson scheme and word-sense disambiguation approach for Wordsieve scheme are implemented in this study. Five randomly selected documents are selected and submitted to these schemes and the user's context extracted is used to expand the initial query for retrieval process.In addition, the feasibility of adopting a data fusion approach was assessed in this study by testing two preconditions; --the efficacy and dissimilarity tests for the IR scheme candidates, as there is a possibility that the precision improvement may not be accomplished. Two queries which are Java and Jaguar, expanded by using user's context extracted by Watson and WordSieve are submitted and more than ten thousand documents are collected as the data collection for conducting the experiment. The performance of the experiment is evaluated by using three assessments; precision recall graph, precision evaluation based on document ranked and mean average precision. The data fusion experiment based on contextual retrieval results has reveals significant improvement on retrieval precision where the lowest percentage gained compared to the basic IR scheme is approximate to thirty seven percent, ten percent improvement compared to Watson and fifthteen percent improvement compared to WordSieve based on mean average precision calculatio

Universiti Putra Malaysia Institutional Repository

Japanese/English Cross-Language Information Retrieval: Exploration of Query Translation and Transliteration

Author: Fujii Atsushi
Ishikawa Tetsuya
Publication venue
Publication date: 01/01/2001
Field of study

Cross-language information retrieval (CLIR), where queries and documents are in different languages, has of late become one of the major topics within the information retrieval community. This paper proposes a Japanese/English CLIR system, where we combine a query translation and retrieval modules. We currently target the retrieval of technical documents, and therefore the performance of our system is highly dependent on the quality of the translation of technical terms. However, the technical term translation is still problematic in that technical terms are often compound words, and thus new terms are progressively created by combining existing base words. In addition, Japanese often represents loanwords based on its special phonogram. Consequently, existing dictionaries find it difficult to achieve sufficient coverage. To counter the first problem, we produce a Japanese/English dictionary for base words, and translate compound words on a word-by-word basis. We also use a probabilistic method to resolve translation ambiguity. For the second problem, we use a transliteration method, which corresponds words unlisted in the base word dictionary to their phonetic equivalents in the target language. We evaluate our system using a test collection for CLIR, and show that both the compound word translation and transliteration methods improve the system performance

arXiv.org e-Print Archive

CiteSeerX