150 research outputs found
User requirement elicitation for cross-language information retrieval
Who are the users of a cross-language retrieval system? Under what circumstances do they need to perform such multi-language searches? How will the task and the context
of use affect successful interaction with the system? Answers to these questions were explored in a user study performed as part of the design stages of Clarity, a EU
founded project on cross-language information retrieval. The findings resulted in a rethink of the planned user interface and a consequent expansion of the set of services
offered. This paper reports on the methodology and techniques used for the elicitation of user requirements as well as how these were in turn transformed into new design
solutions
Identifying effective translations for cross-lingual Arabic-to-English user-generated speech search
Cross Language Information Retrieval
(CLIR) systems are a valuable tool to enable speakers of one language to search for
content of interest expressed in a different
language. A group for whom this is of particular interest is bilingual Arabic speakers
who wish to search for English language
content using information needs expressed
in Arabic queries. A key challenge in
CLIR is crossing the language barrier
between the query and the documents.
The most common approach to bridging
this gap is automated query translation,
which can be unreliable for vague or short
queries. In this work, we examine the
potential for improving CLIR effectiveness
by predicting the translation effectiveness
using Query Performance Prediction (QPP)
techniques. We propose a novel QPP
method to estimate the quality of translation for an Arabic-Engish Cross-lingual
User-generated Speech Search (CLUGS)
task. We present an empirical evaluation
that demonstrates the quality of our method
on alternative translation outputs extracted
from an Arabic-to-English Machine Translation system developed for this task. Finally, we show how this framework can be
integrated in CLUGS to find relevant translations for improved retrieval performance
Multilingual Information Access: Practices and Perceptions of Bi/multilingual Academic Users
The research reported in this dissertation explored linguistic determinants in online information searching, and examined to what extent bi/multilingual academic users utilize Multilingual Information Access (MLIA) tools and what impact these have on their information searching behavior.
The aim of the study was three-pronged: to provide tangible data that can support recommendations for the effective user-centered design of Multilingual Information Retrieval (MLIR) systems; to provide a user-centered evaluation of existing MLIA tools, and to offer the basis of a framework for Library & Information Science (LIS) professionals in teaching information literacy and library skills for bi/multilingual academic users.
In the first phase of the study, 250 bi/multilingual students participated in a web survey that investigated their language choices while searching for information on the internet and electronic databases. 31 of these participants took part in the second phase which involved a controlled lab-based user experiment and post experiment questionnaire that investigated their use of MLIA tools on Google and WorldCat and their opinions of these tools. In the third phase, 19 students participated in focus groups discussions and 6 librarians were interviewed to find out their perspectives on multilingual information literacy.
Results showed that though machine translation has alleviated some of the linguistic related challenges in online information searching, language barriers do still exist for some users especially at the query formulation stage. Captures from the experiment revealed great diversity in the way MLIA tools were utilized while the focus group discussions and interviews revealed a general lack of awareness by both librarians and students of the tools that could help enhance and promote multilingual information literacy.
The study highlights the roles of both IR system designers as well as LIS professionals in enhancing and promoting multilingual information access and literacy: User- centered design, user-modeling were found to be key aspects in the development of more effective multilingual information retrieval (MLIR) systems. The study also highlights the distinction between being multilingually information literate and being multilingual information literate. Suitable models for instruction for bi/multilingual academic users point towards Specialized Information Literacy Instruction (SILI) and Personalized Information Literacy Instruction (PILI)
CLIR teknikak baliabide urriko hizkuntzetarako
152 p.Hizkuntza arteko informazioaren berreskurapenerako sistema bat garatxean kontsulta itzultzea da hizkuntzaren mugari aurre egiteko hurbilpenik erabiliena. Kontsulta itzultzeko estrategia arrakastatsuenak itzulpen automatikoko sistem aedo corpus paraleloetan oinarritzen dira, baina baliabide hauek urriak dira baliabide urriko hizkuntzen eszenatokietan. Horrelako egoeretan egokiagoa litzateke eskuragarriago diren baliabideetan oinarritutako komtsulta itzultzeko estrategia bat. Tesi honetan frogatu nahi dugu baliabide nagusi horiek hiztegi elebiduna eta horren osagarri diren corpus konparagarriak eta kontsulta-saioak izan daitezkeela. // Hizkuntza arteko informazioaren berreskurapenerako sistema bat garatxean kontsulta itzultzea da hizkuntzaren mugari aurre egiteko hurbilpenik erabiliena. Kontsulta itzultzeko estrategia arrakastatsuenak itzulpen automatikoko sistem aedo corpus paraleloetan oinarritzen dira, baina baliabide hauek urriak dira baliabide urriko hizkuntzen eszenatokietan. Horrelako egoeretan egokiagoa litzateke eskuragarriago diren baliabideetan oinarritutako komtsulta itzultzeko estrategia bat. Tesi honetan frogatu nahi dugu baliabide nagusi horiek hiztegi elebiduna eta horren osagarri diren corpus konparagarriak eta kontsulta-saioak izan daitezkeela
Toward higher effectiveness for recall-oriented information retrieval: A patent retrieval case study
Research in information retrieval (IR) has largely been directed towards tasks requiring high precision. Recently, other IR applications which can be described as recall-oriented IR tasks have received increased attention in the IR research domain. Prominent among these IR applications are patent search and legal search, where users are typically ready to check hundreds or possibly thousands of documents in order to find any possible relevant document. The main concerns in this kind of application are very different from those in standard precision-oriented IR tasks, where users tend to be focused on finding an answer to their information need that can typically be addressed by one or two relevant documents. For precision-oriented tasks, mean average precision continues to be used as the primary evaluation metric for almost all IR applications. For recall-oriented IR applications the nature of the search task, including objectives, users, queries, and document collections, is different from that of standard precision-oriented search tasks. In this research study, two dimensions in IR are explored for the recall-oriented patent search task. The study includes IR system evaluation and multilingual IR for patent search. In each of these dimensions, current IR techniques are studied and novel techniques developed especially for this kind of recall-oriented IR application are proposed and investigated experimentally in the context of patent retrieval. The techniques developed in this thesis provide a significant contribution toward evaluating the effectiveness of recall-oriented IR in general and particularly patent search, and improving the efficiency of multilingual search for this kind of task
CLIR teknikak baliabide urriko hizkuntzetarako
152 p.Hizkuntza arteko informazioaren berreskurapenerako sistema bat garatxean kontsulta itzultzea da hizkuntzaren mugari aurre egiteko hurbilpenik erabiliena. Kontsulta itzultzeko estrategia arrakastatsuenak itzulpen automatikoko sistem aedo corpus paraleloetan oinarritzen dira, baina baliabide hauek urriak dira baliabide urriko hizkuntzen eszenatokietan. Horrelako egoeretan egokiagoa litzateke eskuragarriago diren baliabideetan oinarritutako komtsulta itzultzeko estrategia bat. Tesi honetan frogatu nahi dugu baliabide nagusi horiek hiztegi elebiduna eta horren osagarri diren corpus konparagarriak eta kontsulta-saioak izan daitezkeela. // Hizkuntza arteko informazioaren berreskurapenerako sistema bat garatxean kontsulta itzultzea da hizkuntzaren mugari aurre egiteko hurbilpenik erabiliena. Kontsulta itzultzeko estrategia arrakastatsuenak itzulpen automatikoko sistem aedo corpus paraleloetan oinarritzen dira, baina baliabide hauek urriak dira baliabide urriko hizkuntzen eszenatokietan. Horrelako egoeretan egokiagoa litzateke eskuragarriago diren baliabideetan oinarritutako komtsulta itzultzeko estrategia bat. Tesi honetan frogatu nahi dugu baliabide nagusi horiek hiztegi elebiduna eta horren osagarri diren corpus konparagarriak eta kontsulta-saioak izan daitezkeela
Evaluating Information Retrieval and Access Tasks
This open access book summarizes the first two decades of the NII Testbeds and Community for Information access Research (NTCIR). NTCIR is a series of evaluation forums run by a global team of researchers and hosted by the National Institute of Informatics (NII), Japan. The book is unique in that it discusses not just what was done at NTCIR, but also how it was done and the impact it has achieved. For example, in some chapters the reader sees the early seeds of what eventually grew to be the search engines that provide access to content on the World Wide Web, today’s smartphones that can tailor what they show to the needs of their owners, and the smart speakers that enrich our lives at home and on the move. We also get glimpses into how new search engines can be built for mathematical formulae, or for the digital record of a lived human life. Key to the success of the NTCIR endeavor was early recognition that information access research is an empirical discipline and that evaluation therefore lay at the core of the enterprise. Evaluation is thus at the heart of each chapter in this book. They show, for example, how the recognition that some documents are more important than others has shaped thinking about evaluation design. The thirty-three contributors to this volume speak for the many hundreds of researchers from dozens of countries around the world who together shaped NTCIR as organizers and participants. This book is suitable for researchers, practitioners, and students—anyone who wants to learn about past and present evaluation efforts in information retrieval, information access, and natural language processing, as well as those who want to participate in an evaluation task or even to design and organize one
Adaptation of machine translation for multilingual information retrieval in the medical domain
Objective. We investigate machine translation (MT) of user search queries in the context of cross-lingual information retrieval (IR) in the medical domain. The main focus is on techniques to adapt MT to increase translation quality; however, we also explore MT adaptation to improve eectiveness of cross-lingual IR.
Methods and Data. Our MT system is Moses, a state-of-the-art phrase-based statistical machine translation system. The IR system is based on the BM25 retrieval model implemented in the Lucene search engine. The MT techniques employed in this work include in-domain training and tuning, intelligent training data selection, optimization of phrase table configuration, compound
splitting, and exploiting synonyms as translation variants. The IR methods include morphological normalization and using multiple translation variants for query expansion. The experiments are performed and thoroughly evaluated on three language pairs: Czech–English, German–English, and French–English. MT quality is evaluated on data sets created within the Khresmoi project and IR eectiveness is tested on the CLEF eHealth 2013 data sets.
Results. The search query translation results achieved in our experiments are outstanding – our systems outperform not only our strong baselines, but also Google Translate and Microsoft Bing Translator in direct comparison carried out on all the language pairs. The baseline BLEU scores increased from 26.59 to 41.45 for Czech–English, from 23.03 to 40.82 for German–English, and from 32.67 to 40.82 for French–English. This is a 55% improvement on average. In terms of the IR performance on this
particular test collection, a significant improvement over the baseline is achieved only for French–English. For Czech–English and German–English, the increased MT quality does not lead to better IR results.
Conclusions. Most of the MT techniques employed in our experiments improve MT of medical search queries. Especially the intelligent training data selection proves to be very successful for domain adaptation of MT. Certain improvements are also obtained from German compound splitting on the source language side. Translation quality, however, does not appear to correlate with the IR performance – better translation does not necessarily yield better retrieval. We discuss in detail the contribution of the individual techniques and state-of-the-art features and provide future research directions
Cross-view Embeddings for Information Retrieval
In this dissertation, we deal with the cross-view tasks related to information retrieval
using embedding methods. We study existing methodologies and propose new methods to overcome their limitations. We formally introduce the concept of mixed-script
IR, which deals with the challenges faced by an IR system when a language is written
in different scripts because of various technological and sociological factors. Mixed-script terms are represented by a small and finite feature space comprised of character
n-grams. We propose the cross-view autoencoder (CAE) to model such terms in an
abstract space and CAE provides the state-of-the-art performance.
We study a wide variety of models for cross-language information retrieval (CLIR)
and propose a model based on compositional neural networks (XCNN) which overcomes the limitations of the existing methods and achieves the best results for many
CLIR tasks such as ad-hoc retrieval, parallel sentence retrieval and cross-language
plagiarism detection. We empirically test the proposed models for these tasks on
publicly available datasets and present the results with analyses.
In this dissertation, we also explore an effective method to incorporate contextual
similarity for lexical selection in machine translation. Concretely, we investigate a
feature based on context available in source sentence calculated using deep autoencoders. The proposed feature exhibits statistically significant improvements over the
strong baselines for English-to-Spanish and English-to-Hindi translation tasks.
Finally, we explore the the methods to evaluate the quality of autoencoder generated representations of text data and analyse its architectural properties. For this,
we propose two metrics based on reconstruction capabilities of the autoencoders:
structure preservation index (SPI) and similarity accumulation index (SAI). We also
introduce a concept of critical bottleneck dimensionality (CBD) below which the
structural information is lost and present analyses linking CBD and language perplexity.En esta disertación estudiamos problemas de vistas-múltiples relacionados con la recuperación de información utilizando técnicas de representación en espacios de baja dimensionalidad. Estudiamos las técnicas existentes y proponemos nuevas técnicas para solventar algunas de las limitaciones existentes. Presentamos formalmente el concepto de recuperación de información con escritura mixta, el cual trata las dificultades de los sistemas de recuperación de información cuando los textos contienen escrituras en distintos alfabetos debido a razones tecnológicas y socioculturales. Las palabras en escritura mixta son representadas en un espacio de características finito y reducido, compuesto por n-gramas de caracteres. Proponemos los auto-codificadores de vistas-múltiples (CAE, por sus siglas en inglés) para modelar dichas palabras en un espacio abstracto, y esta técnica produce resultados de vanguardia.
En este sentido, estudiamos varios modelos para la recuperación de información entre lenguas diferentes (CLIR, por sus siglas en inglés) y proponemos un modelo basado en redes neuronales composicionales (XCNN, por sus siglas en inglés), el cual supera las limitaciones de los métodos existentes. El método de XCNN propuesto produce mejores resultados en diferentes tareas de CLIR tales como la recuperación de información ad-hoc, la identificación de oraciones equivalentes en lenguas distintas y la detección de plagio entre lenguas diferentes. Para tal efecto, realizamos pruebas experimentales para dichas tareas sobre conjuntos de datos disponibles públicamente, presentando los resultados y análisis correspondientes.
En esta disertación, también exploramos un método eficiente para utilizar similitud semántica de contextos en el proceso de selección léxica en traducción automática. Específicamente, proponemos características extraídas de los contextos disponibles en las oraciones fuentes mediante el uso de auto-codificadores. El uso de las características propuestas demuestra mejoras estadísticamente significativas sobre sistemas de traducción robustos para las tareas de traducción entre inglés y español, e inglés e hindú.
Finalmente, exploramos métodos para evaluar la calidad de las representaciones de datos de texto generadas por los auto-codificadores, a la vez que analizamos las propiedades de sus arquitecturas. Como resultado, proponemos dos nuevas métricas para cuantificar la calidad de las reconstrucciones generadas por los auto-codificadores: el índice de preservación de estructura (SPI, por sus siglas en inglés) y el índice de acumulación de similitud (SAI, por sus siglas en inglés). También presentamos el concepto de dimensión crítica de cuello de botella (CBD, por sus siglas en inglés), por debajo de la cual la información estructural se deteriora. Mostramos que, interesantemente, la CBD está relacionada con la perplejidad de la lengua.En aquesta dissertació estudiem els problemes de vistes-múltiples relacionats amb la recuperació d'informació utilitzant tècniques de representació en espais de baixa dimensionalitat. Estudiem les tècniques existents i en proposem unes de noves per solucionar algunes de les limitacions existents. Presentem formalment el concepte de recuperació d'informació amb escriptura mixta, el qual tracta les dificultats dels sistemes de recuperació d'informació quan els textos contenen escriptures en diferents alfabets per motius tecnològics i socioculturals. Les paraules en escriptura mixta són representades en un espai de característiques finit i reduït, composat per n-grames de caràcters. Proposem els auto-codificadors de vistes-múltiples (CAE, per les seves sigles en anglès) per modelar aquestes paraules en un espai abstracte, i aquesta tècnica produeix resultats d'avantguarda.
En aquest sentit, estudiem diversos models per a la recuperació d'informació entre llengües diferents (CLIR , per les sevas sigles en anglès) i proposem un model basat en xarxes neuronals composicionals (XCNN, per les sevas sigles en anglès), el qual supera les limitacions dels mètodes existents. El mètode de XCNN proposat produeix millors resultats en diferents tasques de CLIR com ara la recuperació d'informació ad-hoc, la identificació d'oracions equivalents en llengües diferents, i la detecció de plagi entre llengües diferents. Per a tal efecte, realitzem proves experimentals per aquestes tasques sobre conjunts de dades disponibles públicament, presentant els resultats i anàlisis corresponents.
En aquesta dissertació, també explorem un mètode eficient per utilitzar similitud semàntica de contextos en el procés de selecció lèxica en traducció automàtica. Específicament, proposem característiques extretes dels contextos disponibles a les oracions fonts mitjançant l'ús d'auto-codificadors. L'ús de les característiques proposades demostra millores estadísticament significatives sobre sistemes de traducció robustos per a les tasques de traducció entre anglès i espanyol, i anglès i hindú.
Finalment, explorem mètodes per avaluar la qualitat de les representacions de dades de text generades pels auto-codificadors, alhora que analitzem les propietats de les seves arquitectures. Com a resultat, proposem dues noves mètriques per quantificar la qualitat de les reconstruccions generades pels auto-codificadors: l'índex de preservació d'estructura (SCI, per les seves sigles en anglès) i l'índex d'acumulació de similitud (SAI, per les seves sigles en anglès). També presentem el concepte de dimensió crítica de coll d'ampolla (CBD, per les seves sigles en anglès), per sota de la qual la informació estructural es deteriora. Mostrem que, de manera interessant, la CBD està relacionada amb la perplexitat de la llengua.Gupta, PA. (2017). Cross-view Embeddings for Information Retrieval [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/78457TESI
Recommended from our members
Lost and Found in Translation: Cross-Lingual Question Answering with Result Translation
Using cross-lingual question answering (CLQA), users can find information in languages that they do not know. In this thesis, we consider the broader problem of CLQA with result translation, where answers retrieved by a CLQA system must be translated back to the user's language by a machine translation (MT) system. This task is challenging because answers must be both relevant to the question and adequately translated in order to be correct. In this work, we show that integrating the MT closely with cross-lingual retrieval can improve result relevance and we further demonstrate that automatically correcting errors in the MT output can improve the adequacy of translated results. To understand the task better, we undertake detailed error analyses examining the impact of MT errors on CLQA with result translation. We identify which MT errors are most detrimental to the task and how different cross-lingual information retrieval (CLIR) systems respond to different kinds of MT errors. We describe two main types of CLQA errors caused by MT errors: lost in retrieval errors, where relevant results are not returned, and lost in translation errors, where relevant results are perceived irrelevant due to inadequate MT. To address the lost in retrieval errors, we introduce two novel models for cross-lingual information retrieval that combine complementary source-language and target-language information from MT. We show empirically that these hybrid, bilingual models outperform both monolingual models and a prior hybrid model. Even once relevant results are retrieved, if they are not translated adequately, users will not understand that they are relevant. Rather than improving a specific MT system, we take a more general approach that can be applied to the output of any MT system. Our adequacy-oriented automatic post-editors (APEs) use resources from the CLQA context and information from the MT system to automatically detect and correct phrase-level errors in MT at query time, focusing on the errors that are most likely to impact CLQA: deleted or missing content words and mistranslated named entities. Human evaluations show that these adequacy-oriented APEs can successfully adapt task-agnostic MT systems to the needs of the CLQA task. Since there is no existing test data for translingual QA or IR tasks, we create a translingual information retrieval (TLIR) evaluation corpus. Furthermore, we develop an analysis framework for isolating the impact of MT errors on CLIR and on result understanding, as well as evaluating the whole TLIR task. We use the TLIR corpus to carry out a task-embedded MT evaluation, which shows that our CLIR models address lost in retrieval errors, resulting in higher TLIR recall; and that the APEs successfully correct many lost in translation errors, leading to more adequately translated results
- …