207,226 research outputs found

    Performance comparison of language models for information retrieval

    Get PDF
    Vector Space Model (VSM), Statistical Language Model (SLM) and Inference Network are three distinguished language models. Instead of evaluating their performance directly, we estimate the information strategies founded on them using the known measures: precision and recall. What's more, we proposed the Sort Order Rationality (SOR) to make further performance comparison among different language models. All models are tested on a standard testing collection. Three important conclusions are attained: (1). The IR model combining the statistical language modeling and inference network approaches is better than that only founded on statistical language modeling approach. What's more, it is also better than that based on vector space modeling approach. (2). The performance of IR model based on VSM is similar to that based on SLM. (3). The Dirichlet priors method often is a better option to smooth a statistical language model. In some respects, these conclusions provide some experimental bases for constructing an efficient information retrieval system

    Concept learning and information inferencing on a high-dimensional semantic space

    Get PDF
    How to automatically capture a significant portion of relevant background knowledge and keep it up-to-date has been a challenging problem encountered in current research on logic based information retrieval. This paper addresses this problem by investigating various information inference mechanisms based on a high dimensional semantic space constructed from a text corpus using the Hyperspace Analogue to Language (HAL) model. Additionally, the Singular Value Decomposition (SVD) algorithm is considered as an alternative way to enhance the quality of the HAL matrix as well as a mechanism of infering implicit associations. The different characteristics of these inference mechanisms are demonstrated using examples from the Reuters-21578 collection. Our hope is that the techniques discussed in this paper provide a basis for logic based IR to progress to large scale applications

    Adaptation of machine translation for multilingual information retrieval in the medical domain

    Get PDF
    Objective. We investigate machine translation (MT) of user search queries in the context of cross-lingual information retrieval (IR) in the medical domain. The main focus is on techniques to adapt MT to increase translation quality; however, we also explore MT adaptation to improve eectiveness of cross-lingual IR. Methods and Data. Our MT system is Moses, a state-of-the-art phrase-based statistical machine translation system. The IR system is based on the BM25 retrieval model implemented in the Lucene search engine. The MT techniques employed in this work include in-domain training and tuning, intelligent training data selection, optimization of phrase table configuration, compound splitting, and exploiting synonyms as translation variants. The IR methods include morphological normalization and using multiple translation variants for query expansion. The experiments are performed and thoroughly evaluated on three language pairs: Czech–English, German–English, and French–English. MT quality is evaluated on data sets created within the Khresmoi project and IR eectiveness is tested on the CLEF eHealth 2013 data sets. Results. The search query translation results achieved in our experiments are outstanding – our systems outperform not only our strong baselines, but also Google Translate and Microsoft Bing Translator in direct comparison carried out on all the language pairs. The baseline BLEU scores increased from 26.59 to 41.45 for Czech–English, from 23.03 to 40.82 for German–English, and from 32.67 to 40.82 for French–English. This is a 55% improvement on average. In terms of the IR performance on this particular test collection, a significant improvement over the baseline is achieved only for French–English. For Czech–English and German–English, the increased MT quality does not lead to better IR results. Conclusions. Most of the MT techniques employed in our experiments improve MT of medical search queries. Especially the intelligent training data selection proves to be very successful for domain adaptation of MT. Certain improvements are also obtained from German compound splitting on the source language side. Translation quality, however, does not appear to correlate with the IR performance – better translation does not necessarily yield better retrieval. We discuss in detail the contribution of the individual techniques and state-of-the-art features and provide future research directions

    Incorporation of Contextual Retrieval and Data Fusion Approach Towards Improving The Retrieval Precision.

    Get PDF
    Generally, the functionality of information retrieval (IR) could be divided into two categories where one section deals with search and retrieval while the other component concerns with the subject or content analysis. In the search and retrieval part, the IR systems present a ranked list of relevant documents depending on the user submitted query as the representation of the user's information need. The ranked list given indicates the probability of the document is relevant to the query by ordering the highest relevant document at the top position and so forth. However, queries are often formulated with simplified short words, such as "Java". These words are unable to summarise precisely the user's information need and its context, i.e. "java, programming language" or "java, the island". Consequently, the user's information need is not satisfied as the highest relevant document was not positioned accordingly or too much relevant document was presented in the ranked list. Besides, by using the simplified query made the context is not easily extractable, and in recent years there has been much research interest in contextual retrieval. Likewise IR, contextual retrieval retrieved the relevant document by using the combination of query, user context and search technology into a single framework. Furthermore, in contextual retrieval, the user's context is exploited to differentiate the relevant document that is useful at that time the requests occur. On the other hand, in order to match the queries and the document representation, different IR schemes were applied to calculate the probability. As a result, often retrieval precision is different for differing IR schemes, where dissimilar lists of relevant documents for the same query submitted are presented. Thus, data fusion approach is implemented in the IR to overcome this complication where multiple sources of results are combined. The implementation of data fusion approach in IR involves the merging of retrieval result from different IR schemes into a single unified ranked list that supposedly presents a list of high precisely relevant document. This study presents an approach to incorporate contextual retrieval and data hsion by using a one-keyword query towards improving retrieval precision. The methods to identify user context are categorised into four approaches; relevance feedback, user profiles, word-sense disambiguation and knowledge engineering. In order to extract user context and to model contextual retrieval, term-weighting scheme based on user profiles and knowledge engineering approaches for Watson scheme and word-sense disambiguation approach for Wordsieve scheme are implemented in this study. Five randomly selected documents are selected and submitted to these schemes and the user's context extracted is used to expand the initial query for retrieval process.In addition, the feasibility of adopting a data fusion approach was assessed in this study by testing two preconditions; --the efficacy and dissimilarity tests for the IR scheme candidates, as there is a possibility that the precision improvement may not be accomplished. Two queries which are Java and Jaguar, expanded by using user's context extracted by Watson and WordSieve are submitted and more than ten thousand documents are collected as the data collection for conducting the experiment. The performance of the experiment is evaluated by using three assessments; precision recall graph, precision evaluation based on document ranked and mean average precision. The data fusion experiment based on contextual retrieval results has reveals significant improvement on retrieval precision where the lowest percentage gained compared to the basic IR scheme is approximate to thirty seven percent, ten percent improvement compared to Watson and fifthteen percent improvement compared to WordSieve based on mean average precision calculatio

    Japanese/English Cross-Language Information Retrieval: Exploration of Query Translation and Transliteration

    Full text link
    Cross-language information retrieval (CLIR), where queries and documents are in different languages, has of late become one of the major topics within the information retrieval community. This paper proposes a Japanese/English CLIR system, where we combine a query translation and retrieval modules. We currently target the retrieval of technical documents, and therefore the performance of our system is highly dependent on the quality of the translation of technical terms. However, the technical term translation is still problematic in that technical terms are often compound words, and thus new terms are progressively created by combining existing base words. In addition, Japanese often represents loanwords based on its special phonogram. Consequently, existing dictionaries find it difficult to achieve sufficient coverage. To counter the first problem, we produce a Japanese/English dictionary for base words, and translate compound words on a word-by-word basis. We also use a probabilistic method to resolve translation ambiguity. For the second problem, we use a transliteration method, which corresponds words unlisted in the base word dictionary to their phonetic equivalents in the target language. We evaluate our system using a test collection for CLIR, and show that both the compound word translation and transliteration methods improve the system performance
    corecore