1,510 research outputs found

    MIRACLE Retrieval Experiments with East Asian Languages

    Get PDF
    This paper describes the participation of MIRACLE in NTCIR 2005 CLIR task. Although our group has a strong background and long expertise in Computational Linguistics and Information Retrieval applied to European languages and using Latin and Cyrillic alphabets, this was our first attempt on East Asian languages. Our main goal was to study the particularities and distinctive characteristics of Japanese, Chinese and Korean, specially focusing on the similarities and differences with European languages, and carry out research on CLIR tasks which include those languages. The basic idea behind our participation in NTCIR is to test if the same familiar linguisticbased techniques may also applicable to East Asian languages, and study the necessary adaptations

    Introduction to the special issue on cross-language algorithms and applications

    Get PDF
    With the increasingly global nature of our everyday interactions, the need for multilingual technologies to support efficient and efective information access and communication cannot be overemphasized. Computational modeling of language has been the focus of Natural Language Processing, a subdiscipline of Artificial Intelligence. One of the current challenges for this discipline is to design methodologies and algorithms that are cross-language in order to create multilingual technologies rapidly. The goal of this JAIR special issue on Cross-Language Algorithms and Applications (CLAA) is to present leading research in this area, with emphasis on developing unifying themes that could lead to the development of the science of multi- and cross-lingualism. In this introduction, we provide the reader with the motivation for this special issue and summarize the contributions of the papers that have been included. The selected papers cover a broad range of cross-lingual technologies including machine translation, domain and language adaptation for sentiment analysis, cross-language lexical resources, dependency parsing, information retrieval and knowledge representation. We anticipate that this special issue will serve as an invaluable resource for researchers interested in topics of cross-lingual natural language processing.Postprint (published version

    Off the Beaten Path: Let's Replace Term-Based Retrieval with k-NN Search

    Full text link
    Retrieval pipelines commonly rely on a term-based search to obtain candidate records, which are subsequently re-ranked. Some candidates are missed by this approach, e.g., due to a vocabulary mismatch. We address this issue by replacing the term-based search with a generic k-NN retrieval algorithm, where a similarity function can take into account subtle term associations. While an exact brute-force k-NN search using this similarity function is slow, we demonstrate that an approximate algorithm can be nearly two orders of magnitude faster at the expense of only a small loss in accuracy. A retrieval pipeline using an approximate k-NN search can be more effective and efficient than the term-based pipeline. This opens up new possibilities for designing effective retrieval pipelines. Our software (including data-generating code) and derivative data based on the Stack Overflow collection is available online

    A Comparison of Different Machine Transliteration Models

    Full text link
    Machine transliteration is a method for automatically converting words in one language into phonetically equivalent ones in another language. Machine transliteration plays an important role in natural language applications such as information retrieval and machine translation, especially for handling proper nouns and technical terms. Four machine transliteration models -- grapheme-based transliteration model, phoneme-based transliteration model, hybrid transliteration model, and correspondence-based transliteration model -- have been proposed by several researchers. To date, however, there has been little research on a framework in which multiple transliteration models can operate simultaneously. Furthermore, there has been no comparison of the four models within the same framework and using the same data. We addressed these problems by 1) modeling the four models within the same framework, 2) comparing them under the same conditions, and 3) developing a way to improve machine transliteration through this comparison. Our comparison showed that the hybrid and correspondence-based models were the most effective and that the four models can be used in a complementary manner to improve machine transliteration performance

    بناء أداة تفاعلية متعددة اللغات لاسترجاع المعلومات

    Get PDF
    The growing requirement on the Internet have made users access to the information expressed in a language other than their own , which led to Cross lingual information retrieval (CLIR) .CLIR is established as a major topic in Information Retrieval (IR). One approach to CLIR uses different methods of translation to translate queries to documents and indexes in other languages. As queries submitted to search engines suffer lack of untranslatable query keys (i.e., words that the dictionary is missing) and translation ambiguity, which means difficulty in choosing between alternatives of translation. Our approach in this thesis is to build and develop the software tool (MORTAJA-IR-TOOL) , a new tool for retrieving information using programming JAVA language with JDK 1.6. This tool has many features, which is develop multiple systematic languages system to be use as a basis for translation when using CLIR, as well as the process of stemming the words entered in the query process as a stage preceding the translation process. The evaluation of the proposed methodology translator of the query comparing it with the basic translation that uses readable dictionary automatically the percentage of improvement is 8.96%. The evaluation of the impact of the process of stemming the words entered in the query on the quality of the output process in the retrieval of matched data in other process the rate of improvement is 4.14%. Finally the rated output of the merger between the use of stemming methodology proposed and translation process (MORTAJA-IR-TOOL) which concluded that the proportion of advanced in the process of improvement in data rate of retrieval is 15.86%. Keywords: Cross lingual information retrieval, CLIR, Information Retrieval, IR, Translation, stemming.الاحتياجات المتنامية على شبكة الإنترنت جعلت المستخدمين لهم حق الوصول إلى المعلومات بلغة غير لغتهم الاصلية، مما يقودنا الى مصطلح عبور اللغات لاسترجاع المعلومات (CLIR). CLIR أنشئت كموضوع رئيسي في "استرجاع المعلومات" (IR). نهج واحد ل CLIR يستخدم أساليب مختلفة للترجمة ومنها لترجمة الاستعلامات وترجمة الوثائق والفهارس في لغات أخرى. الاستفسارات والاستعلامات المقدمة لمحركات البحث تعاني من عدم وجود ترجمه لمفاتيح الاستعلام (أي أن العبارة مفقودة من القاموس) وايضا تعاني من غموض الترجمة، مما يعني صعوبة في الاختيار بين بدائل الترجمة. في نهجنا في هذه الاطروحة تم بناء وتطوير الأداة البرمجية (MORTAJA-IR-TOOL) أداة جديدة لاسترجاع المعلومات باستخدام لغة البرمجة JAVA مع JDK 1.6، وتمتلك هذه الأداة العديد من الميزات، حيث تم تطوير منظومة منهجية متعددة اللغات لاستخدامها كأساس للترجمة عند استخدام CLIR، وكذلك عملية تجذير للكلمات المدخلة في عملية الاستعلام كمرحلة تسبق عملية الترجمة. وتم تقييم الترجمة المنهجية المقترحة للاستعلام ومقارنتها مع الترجمة الأساسية التي تستخدم قاموس مقروء اليا كأساس للترجمة في تجربة تركز على المستخدم وكانت نسبة التحسين 8.96% , وكذلك يتم تقييم مدى تأثير عملية تجذير الكلمات المدخلة في عملية الاستعلام على جودة المخرجات في عملية استرجاع البيانات المتطابقة باللغة الاخرى وكانت نسبة التحسين 4.14% , وفي النهاية تم تقييم ناتج عملية الدمج بين استخدام التجذير والترجمة المنهجية المقترحة (MORTAJA-IR-TOOL) والتي خلصت الى نسبة متقدمة في عملية التحسين في نسبة البيانات المرجعة وكانت 15.86%

    Knowledge Management and Cultural Heritage Repositories. Cross-Lingual Information Retrieval Strategies

    Get PDF
    In the last years important initiatives, like the development of the European Library and Europeana, aim to increase the availability of cultural content from various types of providers and institutions. The accessibility to these resources requires the development of environments which allow both to manage multilingual complexity and to preserve the semantic interoperability. The creation of Natural Language Processing (NLP) applications is finalized to the achievement of CrossLingual Information Retrieval (CLIR). This paper presents an ongoing research on language processing based on the LexiconGrammar (LG) approach with the goal of improving knowledge management in the Cultural Heritage repositories. The proposed framework aims to guarantee interoperability between multilingual systems in order to overcome crucial issues like cross-language and cross-collection retrieval. Indeed, the LG methodology tries to overcome the shortcomings of statistical approaches as in Google Translate or Bing by Microsoft concerning Multi-Word Unit (MWU) processing in queries, where the lack of linguistic context represents a serious obstacle to disambiguation. In particular, translations concerning specific domains, as it is has been widely recognized, is unambiguous since the meanings of terms are mono-referential and the type of relation that links a given term to its equivalent in a foreign language is biunivocal, i.e. a one-to-one coupling which causes this relation to be exclusive and reversible. Ontologies are used in CLIR and are considered by several scholars a promising research area to improve the effectiveness of Information Extraction (IE) techniques particularly for technical-domain queries. Therefore, we present a methodological framework which allows to map both the data and the metadata among the language-specific ont

    A survey of cross-lingual word embedding models

    Get PDF
    Cross-lingual representations of words enable us to reason about word meaning in multilingual contexts and are a key facilitator of cross-lingual transfer when developing natural language processing models for low-resource languages. In this survey, we provide a comprehensive typology of cross-lingual word embedding models. We compare their data requirements and objective functions. The recurring theme of the survey is that many of the models presented in the literature optimize for the same objectives, and that seemingly different models are often equivalent, modulo optimization strategies, hyper-parameters, and such. We also discuss the different ways cross-lingual word embeddings are evaluated, as well as future challenges and research horizons.</jats:p
    corecore