1,775 research outputs found
Identifying effective translations for cross-lingual Arabic-to-English user-generated speech search
Cross Language Information Retrieval
(CLIR) systems are a valuable tool to enable speakers of one language to search for
content of interest expressed in a different
language. A group for whom this is of particular interest is bilingual Arabic speakers
who wish to search for English language
content using information needs expressed
in Arabic queries. A key challenge in
CLIR is crossing the language barrier
between the query and the documents.
The most common approach to bridging
this gap is automated query translation,
which can be unreliable for vague or short
queries. In this work, we examine the
potential for improving CLIR effectiveness
by predicting the translation effectiveness
using Query Performance Prediction (QPP)
techniques. We propose a novel QPP
method to estimate the quality of translation for an Arabic-Engish Cross-lingual
User-generated Speech Search (CLUGS)
task. We present an empirical evaluation
that demonstrates the quality of our method
on alternative translation outputs extracted
from an Arabic-to-English Machine Translation system developed for this task. Finally, we show how this framework can be
integrated in CLUGS to find relevant translations for improved retrieval performance
Introduction to the special issue on cross-language algorithms and applications
With the increasingly global nature of our everyday interactions, the need for multilingual technologies to support efficient and efective information access and communication cannot be overemphasized. Computational modeling of language has been the focus of
Natural Language Processing, a subdiscipline of Artificial Intelligence. One of the current challenges for this discipline is to design methodologies and algorithms that are cross-language in order to create multilingual technologies rapidly. The goal of this JAIR special
issue on Cross-Language Algorithms and Applications (CLAA) is to present leading research in this area, with emphasis on developing unifying themes that could lead to the development of the science of multi- and cross-lingualism. In this introduction, we provide the reader with the motivation for this special issue and summarize the contributions of the papers that have been included. The selected papers cover a broad range of cross-lingual technologies including machine translation, domain and language adaptation for sentiment
analysis, cross-language lexical resources, dependency parsing, information retrieval and knowledge representation. We anticipate that this special issue will serve as an invaluable resource for researchers interested in topics of cross-lingual natural language processing.Postprint (published version
Reordering in statistical machine translation
PhDMachine translation is a challenging task that its difficulties arise from several characteristics
of natural language. The main focus of this work is on reordering as one of
the major problems in MT and statistical MT, which is the method investigated in this
research. The reordering problem in SMT originates from the fact that not all the words
in a sentence can be consecutively translated. This means words must be skipped and
be translated out of their order in the source sentence to produce a fluent and grammatically
correct sentence in the target language. The main reason that reordering is
needed is the fundamental word order differences between languages. Therefore, reordering
becomes a more dominant issue, the more source and target languages are
structurally different.
The aim of this thesis is to study the reordering phenomenon by proposing new methods
of dealing with reordering in SMT decoders and evaluating the effectiveness of
the methods and the importance of reordering in the context of natural language processing
tasks. In other words, we propose novel ways of performing the decoding to
improve the reordering capabilities of the SMT decoder and in addition we explore
the effect of improving the reordering on the quality of specific NLP tasks, namely
named entity recognition and cross-lingual text association. Meanwhile, we go beyond
reordering in text association and present a method to perform cross-lingual text fragment
alignment, based on models of divergence from randomness.
The main contribution of this thesis is a novel method named dynamic distortion,
which is designed to improve the ability of the phrase-based decoder in performing
reordering by adjusting the distortion parameter based on the translation context. The
model employs a discriminative reordering model, which is combining several fea-
2
tures including lexical and syntactic, to predict the necessary distortion limit for each
sentence and each hypothesis expansion. The discriminative reordering model is also
integrated into the decoder as an extra feature. The method achieves substantial improvements
over the baseline without increase in the decoding time by avoiding reordering
in unnecessary positions.
Another novel method is also presented to extend the phrase-based decoder to dynamically
chunk, reorder, and apply phrase translations in tandem. Words inside the chunks
are moved together to enable the decoder to make long-distance reorderings to capture
the word order differences between languages with different sentence structures.
Another aspect of this work is the task-based evaluation of the reordering methods and
other translation algorithms used in the phrase-based SMT systems. With more successful
SMT systems, performing multi-lingual and cross-lingual tasks through translating
becomes more feasible. We have devised a method to evaluate the performance
of state-of-the art named entity recognisers on the text translated by a SMT decoder.
Specifically, we investigated the effect of word reordering and incorporating reordering
models in improving the quality of named entity extraction.
In addition to empirically investigating the effect of translation in the context of crosslingual
document association, we have described a text fragment alignment algorithm
to find sections of the two documents in different languages, that are content-wise related.
The algorithm uses similarity measures based on divergence from randomness
and word-based translation models to perform text fragment alignment on a collection
of documents in two different languages.
All the methods proposed in this thesis are extensively empirically examined. We have
tested all the algorithms on common translation collections used in different evaluation
campaigns. Well known automatic evaluation metrics are used to compare the
suggested methods to a state-of-the art baseline and results are analysed and discussed
Cross-language Information Retrieval
Two key assumptions shape the usual view of ranked retrieval: (1) that the
searcher can choose words for their query that might appear in the documents
that they wish to see, and (2) that ranking retrieved documents will suffice
because the searcher will be able to recognize those which they wished to find.
When the documents to be searched are in a language not known by the searcher,
neither assumption is true. In such cases, Cross-Language Information Retrieval
(CLIR) is needed. This chapter reviews the state of the art for CLIR and
outlines some open research questions.Comment: 49 pages, 0 figure
Utilisation of metadata fields and query expansion in cross-lingual search of user-generated Internet video
Recent years have seen signicant eorts in the area of Cross Language Information Retrieval (CLIR) for text retrieval. This work initially focused on formally published content, but more recently research has begun to concentrate on CLIR for informal social media content. However, despite the current expansion in online multimedia archives, there has been little work on CLIR for this content. While there has been some limited work on Cross-Language Video Retrieval (CLVR) for professional videos, such as documentaries or TV news broadcasts, there has to date, been no signicant investigation of CLVR for the rapidly growing archives of informal user generated (UGC) content. Key differences between such UGC and professionally produced content are the nature and structure of the textual UGC metadata associated with it, as well as the form and quality of the content itself. In this setting, retrieval eectiveness may not only suer from translation errors common to all CLIR tasks, but also recognition errors associated with the automatic speech recognition (ASR) systems used to transcribe the spoken content of the video and with the informality and inconsistency of the associated user-created metadata for each video. This work proposes and evaluates techniques to improve CLIR effectiveness of such noisy UGC content. Our experimental investigation shows that dierent sources of evidence, e.g. the content from dierent elds of the structured metadata, significantly affect CLIR effectiveness. Results from our experiments also show that each metadata eld
has a varying robustness to query expansion (QE) and hence can have a negative impact on the CLIR eectiveness. Our work proposes a novel adaptive QE technique that predicts the most reliable source for expansion and shows how this technique can be effective for improving CLIR effectiveness for UGC content
Are You Finding the Right Person? A Name Translation System Towards Web 2.0
In a multilingual world, information available in global information systems is increasing rapidly. Searching for proper names in foreign language becomes an important task in multilingual search and knowledge discovery. However, these names are the most difficult to handle because they are often unknown words that cannot be found in a translation dictionary and even human experts cannot handle the variation generated during translation. Furthermore, existing research on name translation have focused on translation algorithms. However, user experience during name translation and name search are often ignored. With the Web technology moving towards Web 2.0, creating a platform that allow easier distributed collaboration and information sharing, we seek methods to incorporate Web 2.0 technologies into a name translation system. In this research, we review challenges in name translation and propose an interactive name translation and search system: NameTran. This system takes English names and translates them into Chinese using a combined hybrid Hidden Markov Model-based (HMM-based) transliteration approach and a web mining approach. Evaluation results showed that web mining consistently boosted the performance of a pure HMM approach. Our system achieved top-1 accuracy of 0.64 and top-8 accuracy of 0.96. To cope with changing popularity and variation in name translations, we demonstrated the feasibility of allowing users to rank translations and the new ranking serves as feedback to the original trained HMM model. We believe that such user input will significantly improve system usability
Building Knowledge Management System for Researching Terrorist Groups on the Web
Nowadays, terrorist organizations have found a cost-effective resource to advance their courses by posting high-impact Web sites on the Internet. This alternate side of the Web is referred to as the “Dark Web.” While counterterrorism researchers seek to obtain and analyze information from the Dark Web, several problems prevent effective and efficient knowledge discovery: the dynamic and hidden character of terrorist Web sites, information overload, and language barrier problems. This study proposes an intelligent knowledge management system to support the discovery and analysis of multilingual terrorist-created Web data. We developed a systematic approach to identify, collect and store up-to-date multilingual terrorist Web data. We also propose to build an intelligent Web-based knowledge portal integrated with advanced text and Web mining techniques such as summarization, categorization and cross-lingual retrieval to facilitate the knowledge discovery from Dark Web resources. We believe our knowledge portal provide counterterrorism research communities with valuable datasets and tools in knowledge discovery and sharing
Towards effective cross-lingual search of user-generated internet speech
The very rapid growth in user-generated social spoken content on online platforms is creating new challenges for Spoken Content Retrieval (SCR) technologies. There are many potential choices for how to design a robust SCR framework for UGS content, but the current lack of detailed investigation means that there is a lack of understanding of the specifc challenges, and little or no guidance available to inform these choices. This thesis investigates the challenges of effective SCR for UGS content, and proposes novel SCR methods that are designed to cope with the
challenges of UGS content. The work presented in this thesis can be divided into three areas of contribution as follows.
The first contribution of this work is critiquing the issues and challenges that in influence the effectiveness of searching UGS content in both mono-lingual and cross-lingual settings.
The second contribution is to develop an effective Query Expansion (QE) method for UGS. This research reports that, encountered in UGS content, the variation in the length, quality and structure of the relevant documents can harm the effectiveness of QE techniques across different queries. Seeking to address this issue, this work examines the utilisation of Query Performance Prediction (QPP) techniques for improving QE in UGS, and presents a novel framework specifically designed for predicting of the effectiveness of QE.
Thirdly, this work extends the utilisation of QPP in UGS search to improve cross-lingual search for UGS by predicting the translation effectiveness. The thesis proposes novel methods to estimate the quality of translation for cross-lingual UGS search. An empirical evaluation that demonstrates the quality of the proposed method on alternative translation outputs extracted from several Machine Translation (MT) systems developed for this task. The research then shows how this framework can be integrated in cross-lingual UGS search to find relevant translations for improved retrieval performance
Mixed-Language Arabic- English Information Retrieval
Includes abstract.Includes bibliographical references.This thesis attempts to address the problem of mixed querying in CLIR. It proposes mixed-language (language-aware) approaches in which mixed queries are used to retrieve most relevant documents, regardless of their languages. To achieve this goal, however, it is essential firstly to suppress the impact of most problems that are caused by the mixed-language feature in both queries and documents and which result in biasing the final ranked list. Therefore, a cross-lingual re-weighting model was developed. In this cross-lingual model, term frequency, document frequency and document length components in mixed queries are estimated and adjusted, regardless of languages, while at the same time the model considers the unique mixed-language features in queries and documents, such as co-occurring terms in two different languages. Furthermore, in mixed queries, non-technical terms (mostly those in non-English language) would likely overweight and skew the impact of those technical terms (mostly those in English) due to high document frequencies (and thus low weights) of the latter terms in their corresponding collection (mostly the English collection). Such phenomenon is caused by the dominance of the English language in scientific domains. Accordingly, this thesis also proposes reasonable re-weighted Inverse Document Frequency (IDF) so as to moderate the effect of overweighted terms in mixed queries
Searching to Translate and Translating to Search: When Information Retrieval Meets Machine Translation
With the adoption of web services in daily life, people have access to tremendous amounts of information, beyond any human's reading and comprehension capabilities. As a result, search technologies have become a fundamental tool for accessing information. Furthermore, the web contains information in multiple languages, introducing another barrier between people and information.
Therefore, search technologies need to handle content written in
multiple languages, which requires techniques to account for the linguistic differences. Information Retrieval (IR) is the study of search techniques, in which the task is to find material relevant to a given information need. Cross-Language Information Retrieval (CLIR) is a special case of IR when the search takes place in a multi-lingual collection.
Of course, it is not helpful to retrieve content in languages the user cannot understand. Machine Translation (MT) studies the translation of text from one language into another efficiently (within a reasonable amount of time) and effectively (fluent and retaining the original meaning), which helps people understand what is being written, regardless of the source language.
Putting these together, we observe that search and translation technologies are part of an important user application, calling for a better integration of search (IR) and translation (MT), since these two technologies need to work together to produce high-quality output.
In this dissertation, the main goal is to build better connections between IR and MT, for which we present solutions to two problems: Searching to translate explores approximate search techniques for extracting bilingual data from multilingual Wikipedia collections to train better translation models. Translating to search explores the integration of a modern statistical MT system into the cross-language search processes. In both cases, our best-performing approach yielded improvements over strong baselines for a variety of language pairs.
Finally, we propose a general architecture, in which various components of IR and MT systems can be connected together into a feedback loop, with potential improvements to both search and translation tasks. We hope that the ideas presented in this dissertation will spur more interest in the integration of search and
translation technologies
- …