19 research outputs found

    Evaluating Information Retrieval and Access Tasks

    Get PDF
    This open access book summarizes the first two decades of the NII Testbeds and Community for Information access Research (NTCIR). NTCIR is a series of evaluation forums run by a global team of researchers and hosted by the National Institute of Informatics (NII), Japan. The book is unique in that it discusses not just what was done at NTCIR, but also how it was done and the impact it has achieved. For example, in some chapters the reader sees the early seeds of what eventually grew to be the search engines that provide access to content on the World Wide Web, today鈥檚 smartphones that can tailor what they show to the needs of their owners, and the smart speakers that enrich our lives at home and on the move. We also get glimpses into how new search engines can be built for mathematical formulae, or for the digital record of a lived human life. Key to the success of the NTCIR endeavor was early recognition that information access research is an empirical discipline and that evaluation therefore lay at the core of the enterprise. Evaluation is thus at the heart of each chapter in this book. They show, for example, how the recognition that some documents are more important than others has shaped thinking about evaluation design. The thirty-three contributors to this volume speak for the many hundreds of researchers from dozens of countries around the world who together shaped NTCIR as organizers and participants. This book is suitable for researchers, practitioners, and students鈥攁nyone who wants to learn about past and present evaluation efforts in information retrieval, information access, and natural language processing, as well as those who want to participate in an evaluation task or even to design and organize one

    Document-level machine translation : ensuring translational consistency of non-local phenomena

    Get PDF
    In this thesis, we study the automatic translation of documents by taking into account cross-sentence phenomena. This document-level information is typically ignored by most of the standard state-of-the-art Machine Translation (MT) systems, which focus on translating texts processing each of their sentences in isolation. Translating each sentence without looking at its surrounding context can lead to certain types of translation errors, such as inconsistent translations for the same word or for elements in a coreference chain. We introduce methods to attend to document-level phenomena in order to avoid those errors, and thus, reach translations that properly convey the original meaning. Our research starts by identifying the translation errors related to such document-level phenomena that commonly appear in the output of state-of-the-art Statistical Machine Translation (SMT) systems. For two of those errors, namely inconsistent word translations as well as gender and number disagreements among words, we design simple and yet effective post-processing techniques to tackle and correct them. Since these techniques are applied a posteriori, they can access the whole source and target documents, and hence, they are able to perform a global analysis and improve the coherence and consistency of the translation. Nevertheless, since following such a two-pass decoding strategy is not optimal in terms of efficiency, we also focus on introducing the context-awareness during the decoding process itself. To this end, we enhance a document-oriented SMT system with distributional semantic information in the form of bilingual and monolingual word embeddings. In particular, these embeddings are used as Semantic Space Language Models (SSLMs) and as a novel feature function. The goal of the former is to promote word translations that are semantically close to their preceding context, whereas the latter promotes the lexical choice that is closest to its surrounding context, for those words that have varying translations throughout the document. In both cases, the context extends beyond sentence boundaries. Recently, the MT community has transitioned to the neural paradigm. The finalstep of our research proposes an extension of the decoding process for a Neural Machine Translation (NMT) framework, independent of the model architecture, by shallow fusing the information from a neural translation model and the context semantics enclosed in the previously studied SSLMs. The aim of this modification is to introduce the benefits of context information also into the decoding process of NMT systems, as well as to obtain an additional validation for the techniques we explored. The automatic evaluation of our approaches does not reflect significant variations. This is expected since most automatic metrics are neither context-nor semantic-aware and because the phenomena we tackle are rare, leading to few modifications with respect to the baseline translations. On the other hand, manual evaluations demonstrate the positive impact of our approaches since human evaluators tend to prefer the translations produced by our document-aware systems. Therefore, the changes introduced by our enhanced systems are important since they are related to how humans perceive translation quality for long texts.En esta tesis se estudia la traducci贸n autom谩tica de documentos teniendo en cuenta fen贸menos que ocurren entre oraciones. T铆picamente, esta informaci贸n a nivel de documento se ignora por la mayor铆a de los sistemas de Traducci贸n Autom谩tica (MT), que se centran en traducir los textos procesando cada una de las frases que los componen de manera aislada. Traducir cada frase sin mirar al contexto que la rodea puede llevar a generar cierto tipo de errores de traducci贸n, como pueden ser traducciones inconsistentes para la misma palabra o para elementos que aparecen en la misma cadena de correferencia. En este trabajo se presentan m茅todos para prestar atenci贸n a fen贸menos a nivel de documento con el objetivo de evitar este tipo de errores y as铆 llegar a generar traducciones que transmitan correctamente el significado original del texto. Nuestra investigaci贸n empieza por identificar los errores de traducci贸n relacionados con los fen贸menos a nivel de documento que aparecen de manera com煤n en la salida de los sistemas Estad铆sticos del Traducci贸n Autom谩tica (SMT). Para dos de estos errores, la traducci贸n inconsistente de palabras, as铆 como los desacuerdos en g茅nero y n煤mero entre palabras, dise帽amos t茅cnicas simples pero efectivas como post-procesos para tratarlos y corregirlos. Como estas t茅cnicas se aplican a posteriori, pueden acceder a los documentos enteros tanto del origen como la traducci贸n generada, y as铆 son capaces de hacer un an谩lisis global y mejorar la coherencia y la consistencia de la traducci贸n. Sin embargo, como seguir una estrategia de traducci贸n en dos pasos no es 贸ptima en t茅rminos de eficiencia, tambi茅n nos centramos en introducir la conciencia del contexto durante el propio proceso de generaci贸n de la traducci贸n. Para esto, extendemos un sistema SMT orientado a documentos incluyendo informaci贸n sem谩ntica distribucional en forma de word embeddings biling眉es y monoling眉es. En particular, estos embeddings se usan como un Modelo de Lenguaje de Espacio Sem谩ntico (SSLM) y como una nueva funci贸n caracter铆stica del sistema. La meta del primero es promover traducciones de palabras que sean sem谩nticamente cercanas a su contexto precedente, mientras que la segunda quiere promover la selecci贸n l茅xica que es m谩s cercana a su contexto para aquellas palabras que tienen diferentes traducciones a lo largo de un documento. En ambos casos, el contexto que se tiene en cuenta va m谩s all谩 de los l铆mites de una frase u oraci贸n. Recientemente, la comunidad MT ha hecho una transici贸n hacia el paradigma neuronal. El paso final de nuestra investigaci贸n propone una extensi贸n del proceso de decodificaci贸n de un sistema de Traducci贸n Autom谩tica Neuronal (NMT), independiente de la arquitectura del modelo de traducci贸n, aplicando la t茅cnica de Shallow Fusion para combinar la informaci贸n del modelo de traducci贸n neuronal y la informaci贸n sem谩ntica del contexto encerrada en los modelos SSLM estudiados previamente. La motivaci贸n de esta modificaci贸n est谩 en introducir los beneficios de la informaci贸n del contexto tambi茅n en el proceso de decodificaci贸n de los sistemas NMT, as铆 como tambi茅n obtener una validaci贸n adicional para las t茅cnicas que se han ido explorando a lo largo de esta tesis. La evaluaci贸n autom谩tica de nuestras propuestas no refleja variaciones significativas. Esto es un comportamiento esperado ya que la mayor铆a de las m茅tricas autom谩ticas no se dise帽an para ser sensibles al contexto o a la sem谩ntica, y adem谩s los fen贸menos que tratamos son escasos, llevando a pocas modificaciones con respecto a las traducciones de partida. Por otro lado, las evaluaciones manuales demuestran el impacto positivo de nuestras propuestas ya que los evaluadores humanos tienen a preferir las traducciones generadas por nuestros sistemas a nivel de documento. Entonces, los cambios introducidos por nuestros sistemas extendidos son importantes porque est谩n relacionados con la forma en que los humanos perciben la calidad de la traducci贸n de textos largos.Postprint (published version

    Natural Language Processing: Emerging Neural Approaches and Applications

    Get PDF
    This Special Issue highlights the most recent research being carried out in the NLP field to discuss relative open issues, with a particular focus on both emerging approaches for language learning, understanding, production, and grounding interactively or autonomously from data in cognitive and neural systems, as well as on their potential or real applications in different domains

    Low-Resource Unsupervised NMT:Diagnosing the Problem and Providing a Linguistically Motivated Solution

    Get PDF
    Unsupervised Machine Translation hasbeen advancing our ability to translatewithout parallel data, but state-of-the-artmethods assume an abundance of mono-lingual data. This paper investigates thescenario where monolingual data is lim-ited as well, finding that current unsuper-vised methods suffer in performance un-der this stricter setting. We find that theperformance loss originates from the poorquality of the pretrained monolingual em-beddings, and we propose using linguis-tic information in the embedding train-ing scheme. To support this, we look attwo linguistic features that may help im-prove alignment quality: dependency in-formation and sub-word information. Us-ing dependency-based embeddings resultsin a complementary word representationwhich offers a boost in performance ofaround 1.5 BLEU points compared to stan-dardWORD2VECwhen monolingual datais limited to 1 million sentences per lan-guage. We also find that the inclusion ofsub-word information is crucial to improv-ing the quality of the embedding
    corecore