156 research outputs found

    Japanese/English Cross-Language Information Retrieval: Exploration of Query Translation and Transliteration

    Full text link
    Cross-language information retrieval (CLIR), where queries and documents are in different languages, has of late become one of the major topics within the information retrieval community. This paper proposes a Japanese/English CLIR system, where we combine a query translation and retrieval modules. We currently target the retrieval of technical documents, and therefore the performance of our system is highly dependent on the quality of the translation of technical terms. However, the technical term translation is still problematic in that technical terms are often compound words, and thus new terms are progressively created by combining existing base words. In addition, Japanese often represents loanwords based on its special phonogram. Consequently, existing dictionaries find it difficult to achieve sufficient coverage. To counter the first problem, we produce a Japanese/English dictionary for base words, and translate compound words on a word-by-word basis. We also use a probabilistic method to resolve translation ambiguity. For the second problem, we use a transliteration method, which corresponds words unlisted in the base word dictionary to their phonetic equivalents in the target language. We evaluate our system using a test collection for CLIR, and show that both the compound word translation and transliteration methods improve the system performance

    PRIME: A System for Multi-lingual Patent Retrieval

    Full text link
    Given the growing number of patents filed in multiple countries, users are interested in retrieving patents across languages. We propose a multi-lingual patent retrieval system, which translates a user query into the target language, searches a multilingual database for patents relevant to the query, and improves the browsing efficiency by way of machine translation and clustering. Our system also extracts new translations from patent families consisting of comparable patents, to enhance the translation dictionary

    Applying Machine Translation to Two-Stage Cross-Language Information Retrieval

    Full text link
    Cross-language information retrieval (CLIR), where queries and documents are in different languages, needs a translation of queries and/or documents, so as to standardize both of them into a common representation. For this purpose, the use of machine translation is an effective approach. However, computational cost is prohibitive in translating large-scale document collections. To resolve this problem, we propose a two-stage CLIR method. First, we translate a given query into the document language, and retrieve a limited number of foreign documents. Second, we machine translate only those documents into the user language, and re-rank them based on the translation result. We also show the effectiveness of our method by way of experiments using Japanese queries and English technical documents.Comment: 13 pages, 1 Postscript figur

    Kannada and Telugu Native Languages to English Cross Language Information Retrieval

    Get PDF
    One of the crucial challenges in cross lingual information retrieval is the retrieval of relevant information for a query expressed in as native language. While retrieval of relevant documents is slightly easier, analysing the relevance of the retrieved documents and the presentation of the results to the users are non-trivial tasks. To accomplish the above task, we present our Kannada English and Telugu English CLIR systems as part of Ad-Hoc Bilingual task. We take a query translation based approach using bi-lingual dictionaries. When a query words not found in the dictionary then the words are transliterated using a simple rule based approach which utilizes the corpus to return the ‘k’ closest English transliterations of the given Kannada/Telugu word. The resulting multiple translation/transliteration choices for each query word are disambiguated using an iterative page-rank style algorithm which, based on term-term co-occurrence statistics, produces the final translated query. Finally we conduct experiments on these translated query using a Kannada/Telugu document collection and a set of English queries to report the improvements, performance achieved for each task is to be presented and statistical analysis of these results are given

    A comparative study of online translation services for cross language Information retrieval

    Get PDF
    Technical advances and its increasing availability, mean that Machine Translation (MT) is now widely used for the translation of search queries in multilingual search tasks. A number of free-to-use high-quality online MT systems are now available and, although imperfect in their translation behaviour, are found to produce good performance in CrossLanguage Information Retrieval (CLIR) applications. Users of these MT systems in CLIR tasks generally assume that they all behave similarly in CLIR applications, and the choice of MT system is often made on the basis of convenience. We present a set of experiments which compare the impact of applying two of the best known online systems, Google and Bing translation, for query translation across multiple language pairs and for two very different CLIR tasks. Our experiments show that the MT systems perform differently on average for different tasks and language pairs, but more significantly for different individual queries. We examine the differing translation behaviour of these tools and seek to draw conclusions in terms of their suitability for use in different settings

    Improving the quality of Gujarati-Hindi Machine Translation through part-of-speech tagging and stemmer-assisted transliteration

    Get PDF
    Machine Translation for Indian languages is an emerging research area. Transliteration is one such module that we design while designing a translation system. Transliteration means mapping of source language text into the target language. Simple mapping decreases the efficiency of overall translation system. We propose the use of stemming and part-of-speech tagging for transliteration. The effectiveness of translation can be improved if we use part-of-speech tagging and stemming assisted transliteration.We have shown that much of the content in Gujarati gets transliterated while being processed for translation to Hindi language

    Arabic Information Retrieval: A Relevancy Assessment Survey

    Get PDF
    The paper presents a research in Arabic Information Retrieval (IR). It surveys the impact of statistical and morphological analysis of Arabic text in improving Arabic IR relevancy. We investigated the contributions of Stemming, Indexing, Query Expansion, Text Summarization (TS), Text Translation, and Named Entity Recognition (NER) in enhancing the relevancy of Arabic IR. Our survey emphasizing on the quantitative relevancy measurements provided in the surveyed publications. The paper shows that the researchers achieved significant enhancements especially in building accurate stemmers, with accuracy reaches 97%, and in measuring the impact of different indexing strategies. Query expansion and Text Translation showed positive relevancy effect. However, other tasks such as NER and TS still need more research to realize their impact on Arabic IR

    Analyst-Focused Arabic Information Retrieval

    Get PDF
    An English-Arabic Cross-Language Information Retrieval Environment was created in which the analyst can query an Arabic database in English and retrieve a set of relevant Arabic documents. The retrieved Arabic documents are automatically translated into English to facilitate readability by the English-only analyst. Proper names of people, places, and organizations are extracted from the retrieved documents and transliterated from Arabic into English. They are presented to the analyst and serve to provide a brief summarization of the retrieved document search query in English. Cross-Language Information Retrieval (CLIR), itself a desideratum in the ARDA workshop, is a special case of Information Retrieval where retrieval is not restricted to the language of the query but queries in one language retrieve documents in other language(s) (Oard and Diekema, 1998). The Arabic that is used in the system is called Modern Standard Arabic (MSA). MSA is the formal Arabic that is used throughout the Arab world in news and broadcast media, and the lingua franca of the Arab. MSA has an estimated 200 million speakers living in Iraq, the Arabian Peninsula, the Levant, Egypt, and Northern Africa
    corecore