34 research outputs found
Evaluating Information Retrieval and Access Tasks
This open access book summarizes the first two decades of the NII Testbeds and Community for Information access Research (NTCIR). NTCIR is a series of evaluation forums run by a global team of researchers and hosted by the National Institute of Informatics (NII), Japan. The book is unique in that it discusses not just what was done at NTCIR, but also how it was done and the impact it has achieved. For example, in some chapters the reader sees the early seeds of what eventually grew to be the search engines that provide access to content on the World Wide Web, todayâs smartphones that can tailor what they show to the needs of their owners, and the smart speakers that enrich our lives at home and on the move. We also get glimpses into how new search engines can be built for mathematical formulae, or for the digital record of a lived human life. Key to the success of the NTCIR endeavor was early recognition that information access research is an empirical discipline and that evaluation therefore lay at the core of the enterprise. Evaluation is thus at the heart of each chapter in this book. They show, for example, how the recognition that some documents are more important than others has shaped thinking about evaluation design. The thirty-three contributors to this volume speak for the many hundreds of researchers from dozens of countries around the world who together shaped NTCIR as organizers and participants. This book is suitable for researchers, practitioners, and studentsâanyone who wants to learn about past and present evaluation efforts in information retrieval, information access, and natural language processing, as well as those who want to participate in an evaluation task or even to design and organize one
Toward higher effectiveness for recall-oriented information retrieval: A patent retrieval case study
Research in information retrieval (IR) has largely been directed towards tasks requiring high precision. Recently, other IR applications which can be described as recall-oriented IR tasks have received increased attention in the IR research domain. Prominent among these IR applications are patent search and legal search, where users are typically ready to check hundreds or possibly thousands of documents in order to find any possible relevant document. The main concerns in this kind of application are very different from those in standard precision-oriented IR tasks, where users tend to be focused on finding an answer to their information need that can typically be addressed by one or two relevant documents. For precision-oriented tasks, mean average precision continues to be used as the primary evaluation metric for almost all IR applications. For recall-oriented IR applications the nature of the search task, including objectives, users, queries, and document collections, is different from that of standard precision-oriented search tasks. In this research study, two dimensions in IR are explored for the recall-oriented patent search task. The study includes IR system evaluation and multilingual IR for patent search. In each of these dimensions, current IR techniques are studied and novel techniques developed especially for this kind of recall-oriented IR application are proposed and investigated experimentally in the context of patent retrieval. The techniques developed in this thesis provide a significant contribution toward evaluating the effectiveness of recall-oriented IR in general and particularly patent search, and improving the efficiency of multilingual search for this kind of task
Document Meta-Information as Weak Supervision for Machine Translation
Data-driven machine translation has advanced considerably since the first pioneering work
in the 1990s with recent systems claiming human parity on sentence translation for highresource tasks. However, performance degrades for low-resource domains with no available
sentence-parallel training data. Machine translation systems also rarely incorporate the
document context beyond the sentence level, ignoring knowledge which is essential for
some situations. In this thesis, we aim to address the two issues mentioned above by
examining ways to incorporate document-level meta-information into data-driven machine
translation. Examples of document meta-information include document authorship and
categorization information, as well as cross-lingual correspondences between documents,
such as hyperlinks or citations between documents. As this meta-information is much more
coarse-grained than reference translations, it constitutes a source of weak supervision for
machine translation. We present four cumulatively conducted case studies where we devise
and evaluate methods to exploit these sources of weak supervision both in low-resource
scenarios where no task-appropriate supervision from parallel data exists, and in a full
supervision scenario where weak supervision from document meta-information is used to
supplement supervision from sentence-level reference translations. All case studies show
improved translation quality when incorporating document meta-information
Quantifying Cross-lingual Semantic Similarity for Natural Language Processing Applications
Translation and cross-lingual access to information are key technologies in a global economy. Even though the quality of machine translation (MT) output is still far from the level of human translations, many real-world applications have emerged, for which MT can be employed. Machine translation supports human translators in computer-assisted translation (CAT), providing the opportunity to improve translation systems based on human interaction and feedback. Besides, many tasks that involve natural language processing operate in a cross-lingual setting, where there is no need for perfectly fluent translations and the transfer of meaning can be modeled by employing MT technology. This thesis describes cumulative work in the field of cross-lingual natural language processing in a user-oriented setting. A common denominator of the presented approaches is their anchoring in an alignment between texts in two different languages to quantify the similarity of their content
Translation-based Ranking in Cross-Language Information Retrieval
Today's amount of user-generated, multilingual textual data generates the necessity for information processing
systems, where cross-linguality, i.e the ability to work on more than one
language, is fully integrated into the underlying models. In the particular
context of Information Retrieval (IR), this amounts to rank and retrieve relevant
documents from a large repository in language A, given a user's information
need expressed in a query in language B. This kind of application is commonly
termed a Cross-Language Information Retrieval (CLIR) system. Such
CLIR systems typically involve a translation component of varying complexity,
which is responsible for translating the user input into the document
language. Using query translations from modern, phrase-based Statistical
Machine Translation (SMT) systems, and subsequently retrieving monolingually
is thus a straightforward choice. However, the amount of work committed to
integrate such SMT models into CLIR, or even jointly model translation and
retrieval, is rather small.
In this thesis, I focus on the shared aspect of ranking in translation-based
CLIR: Both, translation and retrieval models, induce rankings over a set of
candidate structures through assignment of scores. The subject of this thesis
is to exploit this commonality in three different ranking tasks: (1) "Mate-ranking" refers to the
task of mining comparable data for SMT domain adaptation through translation-based
CLIR. "Cross-lingual mates" are direct or close translations of the query.
I will show that such a CLIR system is able to find
in-domain comparable data from noisy user-generated corpora and improves
in-domain translation performance of an SMT system. Conversely, the CLIR system
relies itself on a translation model that is tailored for retrieval. This
leads to the second direction of research, in which I develop two ways to
optimize an SMT model for retrieval, namely (2) by SMT parameter optimization
towards a retrieval objective ("translation ranking"), and (3) by presenting
a joint model of translation and retrieval for "document ranking". The latter
abandons the common architecture of modeling both components separately. The
former task refers to optimizing for preference of
translation candidates that work well for retrieval. In the core task of "document ranking" for CLIR, I present a model that directly ranks documents using an SMT decoder. I present substantial improvements
over state-of-the-art translation-based CLIR baseline systems, indicating that
a joint model of translation and retrieval is a promising direction of
research in the field of CLIR
Recommended from our members
Linking Textual Resources to Support Information Discovery
A vast amount of information is today stored in the form of textual documents, many of which are available online. These documents come from different sources and are of different types. They include newspaper articles, books, corporate reports, encyclopedia entries and research papers. At a semantic level, these documents contain knowledge, which was created by explicitly connecting information and expressing it in the form of a natural language. However, a significant amount of knowledge is not explicitly stated in a single document, yet can be derived or discovered by researching, i.e. accessing, comparing, contrasting and analysing, information from multiple documents. Carrying out this work using traditional search interfaces is tedious due to information overload and the difficulty of formulating queries that would help us to discover information we are not aware of.
In order to support this exploratory process, we need to be able to effectively navigate between related pieces of information across documents. While information can be connected using manually curated cross-document links, this approach not only does not scale, but cannot systematically assist us in the discovery of sometimes non-obvious (hidden) relationships. Consequently, there is a need for automatic approaches to link discovery.
This work studies how people link content, investigates the properties of different link types, presents new methods for automatic link discovery and designs a system in which link discovery is applied on a collection of millions of documents to improve access to public knowledge
Mixed-Language Arabic- English Information Retrieval
Includes abstract.Includes bibliographical references.This thesis attempts to address the problem of mixed querying in CLIR. It proposes mixed-language (language-aware) approaches in which mixed queries are used to retrieve most relevant documents, regardless of their languages. To achieve this goal, however, it is essential firstly to suppress the impact of most problems that are caused by the mixed-language feature in both queries and documents and which result in biasing the final ranked list. Therefore, a cross-lingual re-weighting model was developed. In this cross-lingual model, term frequency, document frequency and document length components in mixed queries are estimated and adjusted, regardless of languages, while at the same time the model considers the unique mixed-language features in queries and documents, such as co-occurring terms in two different languages. Furthermore, in mixed queries, non-technical terms (mostly those in non-English language) would likely overweight and skew the impact of those technical terms (mostly those in English) due to high document frequencies (and thus low weights) of the latter terms in their corresponding collection (mostly the English collection). Such phenomenon is caused by the dominance of the English language in scientific domains. Accordingly, this thesis also proposes reasonable re-weighted Inverse Document Frequency (IDF) so as to moderate the effect of overweighted terms in mixed queries
From Evaluating to Forecasting Performance: How to Turn Information Retrieval, Natural Language Processing and Recommender Systems into Predictive Sciences
We describe the state-of-the-art in performance modeling and prediction for Information Retrieval
(IR), Natural Language Processing (NLP) and Recommender Systems (RecSys) along with its
shortcomings and strengths. We present a framework for further research, identifying five major
problem areas: understanding measures, performance analysis, making underlying assumptions
explicit, identifying application features determining performance, and the development of prediction
models describing the relationship between assumptions, features and resulting performanc
Matching Meaning for Cross-Language Information Retrieval
Cross-language information retrieval concerns the problem of finding information in one language in response to search requests expressed in another language. The explosive growth of the World Wide Web, with access to information in many languages, has provided a substantial impetus for research on this important problem. In recent years, significant advances in cross-language retrieval effectiveness have resulted from the application of statistical techniques to estimate accurate translation probabilities for individual terms from automated analysis of human-prepared translations. With few exceptions, however, those results have been obtained by applying evidence about the meaning of terms to translation in one direction at a time (e.g., by
translating the queries into the document language).
This dissertation introduces a more general framework for the use of translation probability in cross-language information retrieval based on the notion that information retrieval is dependent fundamentally upon matching what the searcher means with what the
document author meant. The perspective yields a simple computational formulation that provides a natural way of combining what have been known traditionally as query and document translation. When combined with the use of synonym sets as a computational model of meaning, cross-language search results are obtained using English queries that approximate a strong monolingual baseline for both French and Chinese documents. Two well-known techniques (structured queries and probabilistic structured queries) are also shown to be a special case of this model under restrictive assumptions