156 research outputs found
Japanese/English Cross-Language Information Retrieval: Exploration of Query Translation and Transliteration
Cross-language information retrieval (CLIR), where queries and documents are
in different languages, has of late become one of the major topics within the
information retrieval community. This paper proposes a Japanese/English CLIR
system, where we combine a query translation and retrieval modules. We
currently target the retrieval of technical documents, and therefore the
performance of our system is highly dependent on the quality of the translation
of technical terms. However, the technical term translation is still
problematic in that technical terms are often compound words, and thus new
terms are progressively created by combining existing base words. In addition,
Japanese often represents loanwords based on its special phonogram.
Consequently, existing dictionaries find it difficult to achieve sufficient
coverage. To counter the first problem, we produce a Japanese/English
dictionary for base words, and translate compound words on a word-by-word
basis. We also use a probabilistic method to resolve translation ambiguity. For
the second problem, we use a transliteration method, which corresponds words
unlisted in the base word dictionary to their phonetic equivalents in the
target language. We evaluate our system using a test collection for CLIR, and
show that both the compound word translation and transliteration methods
improve the system performance
PRIME: A System for Multi-lingual Patent Retrieval
Given the growing number of patents filed in multiple countries, users are
interested in retrieving patents across languages. We propose a multi-lingual
patent retrieval system, which translates a user query into the target
language, searches a multilingual database for patents relevant to the query,
and improves the browsing efficiency by way of machine translation and
clustering. Our system also extracts new translations from patent families
consisting of comparable patents, to enhance the translation dictionary
Applying Machine Translation to Two-Stage Cross-Language Information Retrieval
Cross-language information retrieval (CLIR), where queries and documents are
in different languages, needs a translation of queries and/or documents, so as
to standardize both of them into a common representation. For this purpose, the
use of machine translation is an effective approach. However, computational
cost is prohibitive in translating large-scale document collections. To resolve
this problem, we propose a two-stage CLIR method. First, we translate a given
query into the document language, and retrieve a limited number of foreign
documents. Second, we machine translate only those documents into the user
language, and re-rank them based on the translation result. We also show the
effectiveness of our method by way of experiments using Japanese queries and
English technical documents.Comment: 13 pages, 1 Postscript figur
Kannada and Telugu Native Languages to English Cross Language Information Retrieval
One of the crucial challenges in cross lingual
information retrieval is the retrieval of relevant information for
a query expressed in as native language. While retrieval of
relevant documents is slightly easier, analysing the relevance of
the retrieved documents and the presentation of the results to
the users are non-trivial tasks. To accomplish the above task,
we present our Kannada English and Telugu English CLIR
systems as part of Ad-Hoc Bilingual task. We take a query
translation based approach using bi-lingual dictionaries. When
a query words not found in the dictionary then the words are
transliterated using a simple rule based approach which
utilizes the corpus to return the ‘k’ closest English
transliterations of the given Kannada/Telugu word. The
resulting multiple translation/transliteration choices for each
query word are disambiguated using an iterative page-rank
style algorithm which, based on term-term co-occurrence
statistics, produces the final translated query. Finally we
conduct experiments on these translated query using a
Kannada/Telugu document collection and a set of English
queries to report the improvements, performance achieved for
each task is to be presented and statistical analysis of these
results are given
A comparative study of online translation services for cross language Information retrieval
Technical advances and its increasing availability, mean that Machine Translation (MT) is now widely used for the translation of search queries in multilingual search tasks. A number of free-to-use high-quality online MT systems are now available and, although imperfect in their translation behaviour, are found to produce good performance in CrossLanguage Information Retrieval (CLIR) applications. Users
of these MT systems in CLIR tasks generally assume that they all behave similarly in CLIR applications, and the choice of MT system is often made on the basis of convenience. We present a set of experiments which compare the impact of
applying two of the best known online systems, Google and Bing translation, for query translation across multiple language pairs and for two very different CLIR tasks. Our experiments show that the MT systems perform differently on average for different tasks and language pairs, but more significantly for different individual queries. We examine the differing translation behaviour of these tools and seek
to draw conclusions in terms of their suitability for use in different settings
Improving the quality of Gujarati-Hindi Machine Translation through part-of-speech tagging and stemmer-assisted transliteration
Machine Translation for Indian languages is an emerging research area. Transliteration is one such module that we design while designing a translation system. Transliteration means mapping of source language text into the target language. Simple mapping decreases the efficiency of overall translation system. We propose the use of stemming and part-of-speech tagging for transliteration. The effectiveness of translation can be improved if we use part-of-speech tagging and stemming assisted transliteration.We have shown that much of the content in Gujarati gets transliterated while being processed for translation to Hindi language
Arabic Information Retrieval: A Relevancy Assessment Survey
The paper presents a research in Arabic Information Retrieval (IR). It surveys the impact of statistical and morphological analysis of Arabic text in improving Arabic IR relevancy. We investigated the contributions of Stemming, Indexing, Query Expansion, Text Summarization (TS), Text Translation, and Named Entity Recognition (NER) in enhancing the relevancy of Arabic IR. Our survey emphasizing on the quantitative relevancy measurements provided in the surveyed publications. The paper shows that the researchers achieved significant enhancements especially in building accurate stemmers, with accuracy reaches 97%, and in measuring the impact of different indexing strategies. Query expansion and Text Translation showed positive relevancy effect. However, other tasks such as NER and TS still need more research to realize their impact on Arabic IR
Analyst-Focused Arabic Information Retrieval
An English-Arabic Cross-Language Information Retrieval Environment was created in which the analyst can query an Arabic database in English and retrieve a set of relevant Arabic documents. The retrieved Arabic documents are automatically translated into English to facilitate readability by the English-only analyst. Proper names of people, places, and organizations are extracted from the retrieved documents and transliterated from Arabic into English. They are presented to the analyst and serve to provide a brief summarization of the retrieved document search query in English. Cross-Language Information Retrieval (CLIR), itself a desideratum in the ARDA workshop, is a special case of Information Retrieval where retrieval is not restricted to the language of the query but queries in one language retrieve documents in other language(s) (Oard and Diekema, 1998).
The Arabic that is used in the system is called Modern Standard Arabic (MSA). MSA is the formal Arabic that is used throughout the Arab world in news and broadcast media, and the lingua franca of the Arab. MSA has an estimated 200 million speakers living in Iraq, the Arabian Peninsula, the Levant, Egypt, and Northern Africa
- …