150 research outputs found

    User requirement elicitation for cross-language information retrieval

    Get PDF
    Who are the users of a cross-language retrieval system? Under what circumstances do they need to perform such multi-language searches? How will the task and the context of use affect successful interaction with the system? Answers to these questions were explored in a user study performed as part of the design stages of Clarity, a EU founded project on cross-language information retrieval. The findings resulted in a rethink of the planned user interface and a consequent expansion of the set of services offered. This paper reports on the methodology and techniques used for the elicitation of user requirements as well as how these were in turn transformed into new design solutions

    Identifying effective translations for cross-lingual Arabic-to-English user-generated speech search

    Get PDF
    Cross Language Information Retrieval (CLIR) systems are a valuable tool to enable speakers of one language to search for content of interest expressed in a different language. A group for whom this is of particular interest is bilingual Arabic speakers who wish to search for English language content using information needs expressed in Arabic queries. A key challenge in CLIR is crossing the language barrier between the query and the documents. The most common approach to bridging this gap is automated query translation, which can be unreliable for vague or short queries. In this work, we examine the potential for improving CLIR effectiveness by predicting the translation effectiveness using Query Performance Prediction (QPP) techniques. We propose a novel QPP method to estimate the quality of translation for an Arabic-Engish Cross-lingual User-generated Speech Search (CLUGS) task. We present an empirical evaluation that demonstrates the quality of our method on alternative translation outputs extracted from an Arabic-to-English Machine Translation system developed for this task. Finally, we show how this framework can be integrated in CLUGS to find relevant translations for improved retrieval performance

    Multilingual Information Access: Practices and Perceptions of Bi/multilingual Academic Users

    Get PDF
    The research reported in this dissertation explored linguistic determinants in online information searching, and examined to what extent bi/multilingual academic users utilize Multilingual Information Access (MLIA) tools and what impact these have on their information searching behavior. The aim of the study was three-pronged: to provide tangible data that can support recommendations for the effective user-centered design of Multilingual Information Retrieval (MLIR) systems; to provide a user-centered evaluation of existing MLIA tools, and to offer the basis of a framework for Library & Information Science (LIS) professionals in teaching information literacy and library skills for bi/multilingual academic users. In the first phase of the study, 250 bi/multilingual students participated in a web survey that investigated their language choices while searching for information on the internet and electronic databases. 31 of these participants took part in the second phase which involved a controlled lab-based user experiment and post experiment questionnaire that investigated their use of MLIA tools on Google and WorldCat and their opinions of these tools. In the third phase, 19 students participated in focus groups discussions and 6 librarians were interviewed to find out their perspectives on multilingual information literacy. Results showed that though machine translation has alleviated some of the linguistic related challenges in online information searching, language barriers do still exist for some users especially at the query formulation stage. Captures from the experiment revealed great diversity in the way MLIA tools were utilized while the focus group discussions and interviews revealed a general lack of awareness by both librarians and students of the tools that could help enhance and promote multilingual information literacy. The study highlights the roles of both IR system designers as well as LIS professionals in enhancing and promoting multilingual information access and literacy: User- centered design, user-modeling were found to be key aspects in the development of more effective multilingual information retrieval (MLIR) systems. The study also highlights the distinction between being multilingually information literate and being multilingual information literate. Suitable models for instruction for bi/multilingual academic users point towards Specialized Information Literacy Instruction (SILI) and Personalized Information Literacy Instruction (PILI)

    CLIR teknikak baliabide urriko hizkuntzetarako

    Get PDF
    152 p.Hizkuntza arteko informazioaren berreskurapenerako sistema bat garatxean kontsulta itzultzea da hizkuntzaren mugari aurre egiteko hurbilpenik erabiliena. Kontsulta itzultzeko estrategia arrakastatsuenak itzulpen automatikoko sistem aedo corpus paraleloetan oinarritzen dira, baina baliabide hauek urriak dira baliabide urriko hizkuntzen eszenatokietan. Horrelako egoeretan egokiagoa litzateke eskuragarriago diren baliabideetan oinarritutako komtsulta itzultzeko estrategia bat. Tesi honetan frogatu nahi dugu baliabide nagusi horiek hiztegi elebiduna eta horren osagarri diren corpus konparagarriak eta kontsulta-saioak izan daitezkeela. // Hizkuntza arteko informazioaren berreskurapenerako sistema bat garatxean kontsulta itzultzea da hizkuntzaren mugari aurre egiteko hurbilpenik erabiliena. Kontsulta itzultzeko estrategia arrakastatsuenak itzulpen automatikoko sistem aedo corpus paraleloetan oinarritzen dira, baina baliabide hauek urriak dira baliabide urriko hizkuntzen eszenatokietan. Horrelako egoeretan egokiagoa litzateke eskuragarriago diren baliabideetan oinarritutako komtsulta itzultzeko estrategia bat. Tesi honetan frogatu nahi dugu baliabide nagusi horiek hiztegi elebiduna eta horren osagarri diren corpus konparagarriak eta kontsulta-saioak izan daitezkeela

    Toward higher effectiveness for recall-oriented information retrieval: A patent retrieval case study

    Get PDF
    Research in information retrieval (IR) has largely been directed towards tasks requiring high precision. Recently, other IR applications which can be described as recall-oriented IR tasks have received increased attention in the IR research domain. Prominent among these IR applications are patent search and legal search, where users are typically ready to check hundreds or possibly thousands of documents in order to find any possible relevant document. The main concerns in this kind of application are very different from those in standard precision-oriented IR tasks, where users tend to be focused on finding an answer to their information need that can typically be addressed by one or two relevant documents. For precision-oriented tasks, mean average precision continues to be used as the primary evaluation metric for almost all IR applications. For recall-oriented IR applications the nature of the search task, including objectives, users, queries, and document collections, is different from that of standard precision-oriented search tasks. In this research study, two dimensions in IR are explored for the recall-oriented patent search task. The study includes IR system evaluation and multilingual IR for patent search. In each of these dimensions, current IR techniques are studied and novel techniques developed especially for this kind of recall-oriented IR application are proposed and investigated experimentally in the context of patent retrieval. The techniques developed in this thesis provide a significant contribution toward evaluating the effectiveness of recall-oriented IR in general and particularly patent search, and improving the efficiency of multilingual search for this kind of task

    CLIR teknikak baliabide urriko hizkuntzetarako

    Get PDF
    152 p.Hizkuntza arteko informazioaren berreskurapenerako sistema bat garatxean kontsulta itzultzea da hizkuntzaren mugari aurre egiteko hurbilpenik erabiliena. Kontsulta itzultzeko estrategia arrakastatsuenak itzulpen automatikoko sistem aedo corpus paraleloetan oinarritzen dira, baina baliabide hauek urriak dira baliabide urriko hizkuntzen eszenatokietan. Horrelako egoeretan egokiagoa litzateke eskuragarriago diren baliabideetan oinarritutako komtsulta itzultzeko estrategia bat. Tesi honetan frogatu nahi dugu baliabide nagusi horiek hiztegi elebiduna eta horren osagarri diren corpus konparagarriak eta kontsulta-saioak izan daitezkeela. // Hizkuntza arteko informazioaren berreskurapenerako sistema bat garatxean kontsulta itzultzea da hizkuntzaren mugari aurre egiteko hurbilpenik erabiliena. Kontsulta itzultzeko estrategia arrakastatsuenak itzulpen automatikoko sistem aedo corpus paraleloetan oinarritzen dira, baina baliabide hauek urriak dira baliabide urriko hizkuntzen eszenatokietan. Horrelako egoeretan egokiagoa litzateke eskuragarriago diren baliabideetan oinarritutako komtsulta itzultzeko estrategia bat. Tesi honetan frogatu nahi dugu baliabide nagusi horiek hiztegi elebiduna eta horren osagarri diren corpus konparagarriak eta kontsulta-saioak izan daitezkeela

    Evaluating Information Retrieval and Access Tasks

    Get PDF
    This open access book summarizes the first two decades of the NII Testbeds and Community for Information access Research (NTCIR). NTCIR is a series of evaluation forums run by a global team of researchers and hosted by the National Institute of Informatics (NII), Japan. The book is unique in that it discusses not just what was done at NTCIR, but also how it was done and the impact it has achieved. For example, in some chapters the reader sees the early seeds of what eventually grew to be the search engines that provide access to content on the World Wide Web, today’s smartphones that can tailor what they show to the needs of their owners, and the smart speakers that enrich our lives at home and on the move. We also get glimpses into how new search engines can be built for mathematical formulae, or for the digital record of a lived human life. Key to the success of the NTCIR endeavor was early recognition that information access research is an empirical discipline and that evaluation therefore lay at the core of the enterprise. Evaluation is thus at the heart of each chapter in this book. They show, for example, how the recognition that some documents are more important than others has shaped thinking about evaluation design. The thirty-three contributors to this volume speak for the many hundreds of researchers from dozens of countries around the world who together shaped NTCIR as organizers and participants. This book is suitable for researchers, practitioners, and students—anyone who wants to learn about past and present evaluation efforts in information retrieval, information access, and natural language processing, as well as those who want to participate in an evaluation task or even to design and organize one

    Adaptation of machine translation for multilingual information retrieval in the medical domain

    Get PDF
    Objective. We investigate machine translation (MT) of user search queries in the context of cross-lingual information retrieval (IR) in the medical domain. The main focus is on techniques to adapt MT to increase translation quality; however, we also explore MT adaptation to improve eectiveness of cross-lingual IR. Methods and Data. Our MT system is Moses, a state-of-the-art phrase-based statistical machine translation system. The IR system is based on the BM25 retrieval model implemented in the Lucene search engine. The MT techniques employed in this work include in-domain training and tuning, intelligent training data selection, optimization of phrase table configuration, compound splitting, and exploiting synonyms as translation variants. The IR methods include morphological normalization and using multiple translation variants for query expansion. The experiments are performed and thoroughly evaluated on three language pairs: Czech–English, German–English, and French–English. MT quality is evaluated on data sets created within the Khresmoi project and IR eectiveness is tested on the CLEF eHealth 2013 data sets. Results. The search query translation results achieved in our experiments are outstanding – our systems outperform not only our strong baselines, but also Google Translate and Microsoft Bing Translator in direct comparison carried out on all the language pairs. The baseline BLEU scores increased from 26.59 to 41.45 for Czech–English, from 23.03 to 40.82 for German–English, and from 32.67 to 40.82 for French–English. This is a 55% improvement on average. In terms of the IR performance on this particular test collection, a significant improvement over the baseline is achieved only for French–English. For Czech–English and German–English, the increased MT quality does not lead to better IR results. Conclusions. Most of the MT techniques employed in our experiments improve MT of medical search queries. Especially the intelligent training data selection proves to be very successful for domain adaptation of MT. Certain improvements are also obtained from German compound splitting on the source language side. Translation quality, however, does not appear to correlate with the IR performance – better translation does not necessarily yield better retrieval. We discuss in detail the contribution of the individual techniques and state-of-the-art features and provide future research directions

    Cross-view Embeddings for Information Retrieval

    Full text link
    In this dissertation, we deal with the cross-view tasks related to information retrieval using embedding methods. We study existing methodologies and propose new methods to overcome their limitations. We formally introduce the concept of mixed-script IR, which deals with the challenges faced by an IR system when a language is written in different scripts because of various technological and sociological factors. Mixed-script terms are represented by a small and finite feature space comprised of character n-grams. We propose the cross-view autoencoder (CAE) to model such terms in an abstract space and CAE provides the state-of-the-art performance. We study a wide variety of models for cross-language information retrieval (CLIR) and propose a model based on compositional neural networks (XCNN) which overcomes the limitations of the existing methods and achieves the best results for many CLIR tasks such as ad-hoc retrieval, parallel sentence retrieval and cross-language plagiarism detection. We empirically test the proposed models for these tasks on publicly available datasets and present the results with analyses. In this dissertation, we also explore an effective method to incorporate contextual similarity for lexical selection in machine translation. Concretely, we investigate a feature based on context available in source sentence calculated using deep autoencoders. The proposed feature exhibits statistically significant improvements over the strong baselines for English-to-Spanish and English-to-Hindi translation tasks. Finally, we explore the the methods to evaluate the quality of autoencoder generated representations of text data and analyse its architectural properties. For this, we propose two metrics based on reconstruction capabilities of the autoencoders: structure preservation index (SPI) and similarity accumulation index (SAI). We also introduce a concept of critical bottleneck dimensionality (CBD) below which the structural information is lost and present analyses linking CBD and language perplexity.En esta disertación estudiamos problemas de vistas-múltiples relacionados con la recuperación de información utilizando técnicas de representación en espacios de baja dimensionalidad. Estudiamos las técnicas existentes y proponemos nuevas técnicas para solventar algunas de las limitaciones existentes. Presentamos formalmente el concepto de recuperación de información con escritura mixta, el cual trata las dificultades de los sistemas de recuperación de información cuando los textos contienen escrituras en distintos alfabetos debido a razones tecnológicas y socioculturales. Las palabras en escritura mixta son representadas en un espacio de características finito y reducido, compuesto por n-gramas de caracteres. Proponemos los auto-codificadores de vistas-múltiples (CAE, por sus siglas en inglés) para modelar dichas palabras en un espacio abstracto, y esta técnica produce resultados de vanguardia. En este sentido, estudiamos varios modelos para la recuperación de información entre lenguas diferentes (CLIR, por sus siglas en inglés) y proponemos un modelo basado en redes neuronales composicionales (XCNN, por sus siglas en inglés), el cual supera las limitaciones de los métodos existentes. El método de XCNN propuesto produce mejores resultados en diferentes tareas de CLIR tales como la recuperación de información ad-hoc, la identificación de oraciones equivalentes en lenguas distintas y la detección de plagio entre lenguas diferentes. Para tal efecto, realizamos pruebas experimentales para dichas tareas sobre conjuntos de datos disponibles públicamente, presentando los resultados y análisis correspondientes. En esta disertación, también exploramos un método eficiente para utilizar similitud semántica de contextos en el proceso de selección léxica en traducción automática. Específicamente, proponemos características extraídas de los contextos disponibles en las oraciones fuentes mediante el uso de auto-codificadores. El uso de las características propuestas demuestra mejoras estadísticamente significativas sobre sistemas de traducción robustos para las tareas de traducción entre inglés y español, e inglés e hindú. Finalmente, exploramos métodos para evaluar la calidad de las representaciones de datos de texto generadas por los auto-codificadores, a la vez que analizamos las propiedades de sus arquitecturas. Como resultado, proponemos dos nuevas métricas para cuantificar la calidad de las reconstrucciones generadas por los auto-codificadores: el índice de preservación de estructura (SPI, por sus siglas en inglés) y el índice de acumulación de similitud (SAI, por sus siglas en inglés). También presentamos el concepto de dimensión crítica de cuello de botella (CBD, por sus siglas en inglés), por debajo de la cual la información estructural se deteriora. Mostramos que, interesantemente, la CBD está relacionada con la perplejidad de la lengua.En aquesta dissertació estudiem els problemes de vistes-múltiples relacionats amb la recuperació d'informació utilitzant tècniques de representació en espais de baixa dimensionalitat. Estudiem les tècniques existents i en proposem unes de noves per solucionar algunes de les limitacions existents. Presentem formalment el concepte de recuperació d'informació amb escriptura mixta, el qual tracta les dificultats dels sistemes de recuperació d'informació quan els textos contenen escriptures en diferents alfabets per motius tecnològics i socioculturals. Les paraules en escriptura mixta són representades en un espai de característiques finit i reduït, composat per n-grames de caràcters. Proposem els auto-codificadors de vistes-múltiples (CAE, per les seves sigles en anglès) per modelar aquestes paraules en un espai abstracte, i aquesta tècnica produeix resultats d'avantguarda. En aquest sentit, estudiem diversos models per a la recuperació d'informació entre llengües diferents (CLIR , per les sevas sigles en anglès) i proposem un model basat en xarxes neuronals composicionals (XCNN, per les sevas sigles en anglès), el qual supera les limitacions dels mètodes existents. El mètode de XCNN proposat produeix millors resultats en diferents tasques de CLIR com ara la recuperació d'informació ad-hoc, la identificació d'oracions equivalents en llengües diferents, i la detecció de plagi entre llengües diferents. Per a tal efecte, realitzem proves experimentals per aquestes tasques sobre conjunts de dades disponibles públicament, presentant els resultats i anàlisis corresponents. En aquesta dissertació, també explorem un mètode eficient per utilitzar similitud semàntica de contextos en el procés de selecció lèxica en traducció automàtica. Específicament, proposem característiques extretes dels contextos disponibles a les oracions fonts mitjançant l'ús d'auto-codificadors. L'ús de les característiques proposades demostra millores estadísticament significatives sobre sistemes de traducció robustos per a les tasques de traducció entre anglès i espanyol, i anglès i hindú. Finalment, explorem mètodes per avaluar la qualitat de les representacions de dades de text generades pels auto-codificadors, alhora que analitzem les propietats de les seves arquitectures. Com a resultat, proposem dues noves mètriques per quantificar la qualitat de les reconstruccions generades pels auto-codificadors: l'índex de preservació d'estructura (SCI, per les seves sigles en anglès) i l'índex d'acumulació de similitud (SAI, per les seves sigles en anglès). També presentem el concepte de dimensió crítica de coll d'ampolla (CBD, per les seves sigles en anglès), per sota de la qual la informació estructural es deteriora. Mostrem que, de manera interessant, la CBD està relacionada amb la perplexitat de la llengua.Gupta, PA. (2017). Cross-view Embeddings for Information Retrieval [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/78457TESI
    corecore