1,475 research outputs found

    Introduction to the special issue on cross-language algorithms and applications

    Get PDF
    With the increasingly global nature of our everyday interactions, the need for multilingual technologies to support efficient and efective information access and communication cannot be overemphasized. Computational modeling of language has been the focus of Natural Language Processing, a subdiscipline of Artificial Intelligence. One of the current challenges for this discipline is to design methodologies and algorithms that are cross-language in order to create multilingual technologies rapidly. The goal of this JAIR special issue on Cross-Language Algorithms and Applications (CLAA) is to present leading research in this area, with emphasis on developing unifying themes that could lead to the development of the science of multi- and cross-lingualism. In this introduction, we provide the reader with the motivation for this special issue and summarize the contributions of the papers that have been included. The selected papers cover a broad range of cross-lingual technologies including machine translation, domain and language adaptation for sentiment analysis, cross-language lexical resources, dependency parsing, information retrieval and knowledge representation. We anticipate that this special issue will serve as an invaluable resource for researchers interested in topics of cross-lingual natural language processing.Postprint (published version

    Teaching a New Dog Old Tricks: Resurrecting Multilingual Retrieval Using Zero-shot Learning

    Full text link
    While billions of non-English speaking users rely on search engines every day, the problem of ad-hoc information retrieval is rarely studied for non-English languages. This is primarily due to a lack of data set that are suitable to train ranking algorithms. In this paper, we tackle the lack of data by leveraging pre-trained multilingual language models to transfer a retrieval system trained on English collections to non-English queries and documents. Our model is evaluated in a zero-shot setting, meaning that we use them to predict relevance scores for query-document pairs in languages never seen during training. Our results show that the proposed approach can significantly outperform unsupervised retrieval techniques for Arabic, Chinese Mandarin, and Spanish. We also show that augmenting the English training collection with some examples from the target language can sometimes improve performance.Comment: ECIR 2020 (short

    Embedding Web-based Statistical Translation Models in Cross-Language Information Retrieval

    Get PDF
    Although more and more language pairs are covered by machine translation services, there are still many pairs that lack translation resources. Cross-language information retrieval (CLIR) is an application which needs translation functionality of a relatively low level of sophistication since current models for information retrieval (IR) are still based on a bag-of-words. The Web provides a vast resource for the automatic construction of parallel corpora which can be used to train statistical translation models automatically. The resulting translation models can be embedded in several ways in a retrieval model. In this paper, we will investigate the problem of automatically mining parallel texts from the Web and different ways of integrating the translation models within the retrieval process. Our experiments on standard test collections for CLIR show that the Web-based translation models can surpass commercial MT systems in CLIR tasks. These results open the perspective of constructing a fully automatic query translation device for CLIR at a very low cost.Comment: 37 page

    Which user interaction for cross-language information retrieval? Design issues and reflections

    Get PDF
    A novel and complex form of information access is cross-language information retrieval: searching for texts written in foreign languages based on native language queries. Although the underlying technology for achieving such a search is relatively well understood, the appropriate interface design is not. The authors present three user evaluations undertaken during the iterative design of Clarity, a cross-language retrieval system for low-density languages, and shows how the user-interaction design evolved depending on the results of usability tests. The first test was instrumental to identify weaknesses in both functionalities and interface; the second was run to determine if query translation should be shown or not; the final was a global assessment and focused on user satisfaction criteria. Lessons were learned at every stage of the process leading to a much more informed view of what a cross-language retrieval system should offer to users

    Utilisation of metadata fields and query expansion in cross-lingual search of user-generated Internet video

    Get PDF
    Recent years have seen signicant eorts in the area of Cross Language Information Retrieval (CLIR) for text retrieval. This work initially focused on formally published content, but more recently research has begun to concentrate on CLIR for informal social media content. However, despite the current expansion in online multimedia archives, there has been little work on CLIR for this content. While there has been some limited work on Cross-Language Video Retrieval (CLVR) for professional videos, such as documentaries or TV news broadcasts, there has to date, been no signicant investigation of CLVR for the rapidly growing archives of informal user generated (UGC) content. Key differences between such UGC and professionally produced content are the nature and structure of the textual UGC metadata associated with it, as well as the form and quality of the content itself. In this setting, retrieval eectiveness may not only suer from translation errors common to all CLIR tasks, but also recognition errors associated with the automatic speech recognition (ASR) systems used to transcribe the spoken content of the video and with the informality and inconsistency of the associated user-created metadata for each video. This work proposes and evaluates techniques to improve CLIR effectiveness of such noisy UGC content. Our experimental investigation shows that dierent sources of evidence, e.g. the content from dierent elds of the structured metadata, significantly affect CLIR effectiveness. Results from our experiments also show that each metadata eld has a varying robustness to query expansion (QE) and hence can have a negative impact on the CLIR eectiveness. Our work proposes a novel adaptive QE technique that predicts the most reliable source for expansion and shows how this technique can be effective for improving CLIR effectiveness for UGC content

    Cross-lingual Question Answering with QED

    Get PDF
    We present improvements and modifications of the QED open-domain question answering system developed for TREC-2003 to make it cross-lingual for participation in the CrossLinguistic Evaluation Forum (CLEF) Question Answering Track 2004 for the source languages French and German and the target language English. We use rule-based question translation extended with surface pattern-oriented pre- and post-processing rules for question reformulation to create and English query from its French or German original. Our system uses deep processing for the question and answers, which requires efficient and radical prior search space pruning. For answering factoid questions, we report an accuracy of 16% (German to English) and 20% (French to English), respectively

    Natural language processing

    Get PDF
    Beginning with the basic issues of NLP, this chapter aims to chart the major research activities in this area since the last ARIST Chapter in 1996 (Haas, 1996), including: (i) natural language text processing systems - text summarization, information extraction, information retrieval, etc., including domain-specific applications; (ii) natural language interfaces; (iii) NLP in the context of www and digital libraries ; and (iv) evaluation of NLP systems

    DCU's experiments in NTCIR-8 IR4QA task

    Get PDF
    We describe DCU's participation in the NTCIR-8 IR4QA task [16]. This task is a cross-language information retrieval(CLIR) task from English to Simplified Chinese which seeks to provide relevant documents for later cross language question answering (CLQA) tasks. For the IR4QA task, we submitted 5 official runs including two monolingual runs and three CLIR runs. For the monolingual retrieval we tested two information retrieval models. The results show that the KL-Divergence language model method performs better than the Okapi BM25 model for the Simplified Chinese retrieval task. This agrees with our previous CLIR experimental results at NTCIR-5. For the CLIR task, we compare query translation and document translation methods. In the query translation based runs, we tested a method for query expansion from external resource (QEE) before query translation. Our result for this run is slightly lower than the run without QEE. Our results show that the document translation method achieves 68.24% MAP performance compared to our best query translation run. For the document translation method, we found that the main issue is the lack of named entity translation in the documents since we do not have a suitable parallel corpus for training data for the statistical machine translation system. Our best CLIR run comes from the combination of query translation using Google translate and the KL-Divergence language model retrieval method. It achieves 79.94% MAP relative to our best monolingual run
    corecore