121 research outputs found

    Evaluating Information Retrieval and Access Tasks

    Get PDF
    This open access book summarizes the first two decades of the NII Testbeds and Community for Information access Research (NTCIR). NTCIR is a series of evaluation forums run by a global team of researchers and hosted by the National Institute of Informatics (NII), Japan. The book is unique in that it discusses not just what was done at NTCIR, but also how it was done and the impact it has achieved. For example, in some chapters the reader sees the early seeds of what eventually grew to be the search engines that provide access to content on the World Wide Web, today’s smartphones that can tailor what they show to the needs of their owners, and the smart speakers that enrich our lives at home and on the move. We also get glimpses into how new search engines can be built for mathematical formulae, or for the digital record of a lived human life. Key to the success of the NTCIR endeavor was early recognition that information access research is an empirical discipline and that evaluation therefore lay at the core of the enterprise. Evaluation is thus at the heart of each chapter in this book. They show, for example, how the recognition that some documents are more important than others has shaped thinking about evaluation design. The thirty-three contributors to this volume speak for the many hundreds of researchers from dozens of countries around the world who together shaped NTCIR as organizers and participants. This book is suitable for researchers, practitioners, and students—anyone who wants to learn about past and present evaluation efforts in information retrieval, information access, and natural language processing, as well as those who want to participate in an evaluation task or even to design and organize one

    Multi-word expression-sensitive word alignment

    Get PDF
    This paper presents a new word alignment method which incorporates knowledge about Bilingual Multi-Word Expressions (BMWEs). Our method of word alignment first extracts such BMWEs in a bidirectional way for a given corpus and then starts conventional word alignment, considering the properties of BMWEs in their grouping as well as their alignment links. We give partial annotation of alignment links as prior knowledge to the word alignment process; by replacing the maximum likelihood estimate in the M-step of the IBM Models with the Maximum A Posteriori (MAP) estimate, prior knowledge about BMWEs is embedded in the prior in this MAP estimate. In our experiments, we saw an improvement of 0.77 Bleu points absolute in JP–EN. Except for one case, our method gave better results than the method using only BMWEs grouping. Even though this paper does not directly address the issues in Cross-Lingual Information Retrieval (CLIR), it discusses an approach of direct relevance to the field. This approach could be viewed as the opposite of current trends in CLIR on semantic space that incorporate a notion of order in the bag-of-words model (e.g. co-occurences)

    Toward higher effectiveness for recall-oriented information retrieval: A patent retrieval case study

    Get PDF
    Research in information retrieval (IR) has largely been directed towards tasks requiring high precision. Recently, other IR applications which can be described as recall-oriented IR tasks have received increased attention in the IR research domain. Prominent among these IR applications are patent search and legal search, where users are typically ready to check hundreds or possibly thousands of documents in order to find any possible relevant document. The main concerns in this kind of application are very different from those in standard precision-oriented IR tasks, where users tend to be focused on finding an answer to their information need that can typically be addressed by one or two relevant documents. For precision-oriented tasks, mean average precision continues to be used as the primary evaluation metric for almost all IR applications. For recall-oriented IR applications the nature of the search task, including objectives, users, queries, and document collections, is different from that of standard precision-oriented search tasks. In this research study, two dimensions in IR are explored for the recall-oriented patent search task. The study includes IR system evaluation and multilingual IR for patent search. In each of these dimensions, current IR techniques are studied and novel techniques developed especially for this kind of recall-oriented IR application are proposed and investigated experimentally in the context of patent retrieval. The techniques developed in this thesis provide a significant contribution toward evaluating the effectiveness of recall-oriented IR in general and particularly patent search, and improving the efficiency of multilingual search for this kind of task

    Experiments on domain adaptation for patent machine translation in the PLuTO project

    Get PDF
    The PLUTO1 project (Patent Language Translations Online) aims to provide a rapid solution for the online retrieval and translation of patent documents through the integration of a number of existing state-of-the-art components provided by the project partners. The paper presents some of the experiments on patent domain adaptation of the Machine Translation (MT) systems used in the PLuTO project. The experiments use the International Patent Classification for domain adaptation and are focused on the English–French language pair

    Boosting Cross-Language Retrieval by Learning Bilingual Phrase Associations from Relevance Rankings

    Get PDF
    We present an approach to learning bilingual n-gram correspondences from relevance rankings of English documents for Japanese queries. We show that directly optimizing cross-lingual rankings rivals and complements machine translation-based cross-language information retrieval (CLIR). We propose an efficient boosting algorithm that deals with very large cross-product spaces of word correspondences. We show in an experimental evaluation on patent prior art search that our approach, and in particular a consensus-based combination of boosting and translation-based approaches, yields substantial improvements in CLIR performance. Our training and test data are made publicly available.

    Cross-language Information Retrieval

    Full text link
    Two key assumptions shape the usual view of ranked retrieval: (1) that the searcher can choose words for their query that might appear in the documents that they wish to see, and (2) that ranking retrieved documents will suffice because the searcher will be able to recognize those which they wished to find. When the documents to be searched are in a language not known by the searcher, neither assumption is true. In such cases, Cross-Language Information Retrieval (CLIR) is needed. This chapter reviews the state of the art for CLIR and outlines some open research questions.Comment: 49 pages, 0 figure

    Word alignment and smoothing methods in statistical machine translation: Noise, prior knowledge and overfitting

    Get PDF
    This thesis discusses how to incorporate linguistic knowledge into an SMT system. Although one important category of linguistic knowledge is that obtained by a constituent / dependency parser, a POS / super tagger, and a morphological analyser, linguistic knowledge here includes larger domains than this: Multi-Word Expressions, Out-Of-Vocabulary words, paraphrases, lexical semantics (or non-literal translations), named-entities, coreferences, and transliterations. The first discussion is about word alignment where we propose a MWE-sensitive word aligner. The second discussion is about the smoothing methods for a language model and a translation model where we propose a hierarchical Pitman-Yor process-based smoothing method. The common grounds for these discussion are the examination of three exceptional cases from real-world data: the presence of noise, the availability of prior knowledge, and the problem of underfitting. Notable characteristics of this design are the careful usage of (Bayesian) priors in order that it can capture both frequent and linguistically important phenomena. This can be considered to provide one example to solve the problems of statistical models which often aim to learn from frequent examples only, and often overlook less frequent but linguistically important phenomena
    corecore