114,995 research outputs found
Embedding Web-based Statistical Translation Models in Cross-Language Information Retrieval
Although more and more language pairs are covered by machine translation
services, there are still many pairs that lack translation resources.
Cross-language information retrieval (CLIR) is an application which needs
translation functionality of a relatively low level of sophistication since
current models for information retrieval (IR) are still based on a
bag-of-words. The Web provides a vast resource for the automatic construction
of parallel corpora which can be used to train statistical translation models
automatically. The resulting translation models can be embedded in several ways
in a retrieval model. In this paper, we will investigate the problem of
automatically mining parallel texts from the Web and different ways of
integrating the translation models within the retrieval process. Our
experiments on standard test collections for CLIR show that the Web-based
translation models can surpass commercial MT systems in CLIR tasks. These
results open the perspective of constructing a fully automatic query
translation device for CLIR at a very low cost.Comment: 37 page
Japanese/English Cross-Language Information Retrieval: Exploration of Query Translation and Transliteration
Cross-language information retrieval (CLIR), where queries and documents are
in different languages, has of late become one of the major topics within the
information retrieval community. This paper proposes a Japanese/English CLIR
system, where we combine a query translation and retrieval modules. We
currently target the retrieval of technical documents, and therefore the
performance of our system is highly dependent on the quality of the translation
of technical terms. However, the technical term translation is still
problematic in that technical terms are often compound words, and thus new
terms are progressively created by combining existing base words. In addition,
Japanese often represents loanwords based on its special phonogram.
Consequently, existing dictionaries find it difficult to achieve sufficient
coverage. To counter the first problem, we produce a Japanese/English
dictionary for base words, and translate compound words on a word-by-word
basis. We also use a probabilistic method to resolve translation ambiguity. For
the second problem, we use a transliteration method, which corresponds words
unlisted in the base word dictionary to their phonetic equivalents in the
target language. We evaluate our system using a test collection for CLIR, and
show that both the compound word translation and transliteration methods
improve the system performance
Sub-word indexing and blind relevance feedback for English, Bengali, Hindi, and Marathi IR
The Forum for Information Retrieval Evaluation (FIRE) provides document collections, topics, and relevance assessments for information retrieval (IR) experiments on Indian languages. Several research questions are explored in this paper: 1. how to create create a simple, languageindependent corpus-based stemmer, 2. how to identify sub-words and which types of sub-words are suitable as indexing units, and 3. how to apply blind relevance feedback on sub-words and how feedback term selection is affected by the type of the indexing unit. More than 140 IR experiments are conducted using the BM25 retrieval model on the topic titles and descriptions (TD) for the FIRE 2008 English, Bengali, Hindi, and Marathi document collections. The major findings are: The corpus-based stemming approach is effective as a knowledge-light
term conation step and useful in case of few language-specific resources. For English, the corpusbased
stemmer performs nearly as well as the Porter stemmer and significantly better than the baseline of indexing words when combined with query expansion. In combination with blind relevance feedback, it also performs significantly better than the baseline for Bengali and Marathi IR.
Sub-words such as consonant-vowel sequences and word prefixes can yield similar or better performance in comparison to word indexing. There is no best performing method for all languages. For English, indexing using the Porter stemmer performs best, for Bengali and Marathi, overlapping 3-grams obtain the best result, and for Hindi, 4-prefixes yield the highest MAP. However, in combination with blind relevance feedback using 10 documents and 20 terms, 6-prefixes for English and 4-prefixes for Bengali, Hindi, and Marathi IR yield the highest MAP. Sub-word identification is a general case of decompounding. It results in one or more index terms for a single word form and increases the number of index terms but decreases their average length. The corresponding retrieval experiments show that relevance feedback on sub-words benefits from
selecting a larger number of index terms in comparison with retrieval on word forms. Similarly, selecting the number of relevance feedback terms depending on the ratio of word vocabulary size to sub-word vocabulary size almost always slightly increases information retrieval effectiveness
compared to using a fixed number of terms for different languages
Investigating cross-language speech retrieval for a spontaneous conversational speech collection
Cross-language retrieval of spontaneous speech combines the challenges of working with noisy automated transcription and language translation. The CLEF 2005 Cross-Language Speech Retrieval (CL-SR) task provides a standard test collection to investigate these challenges. We show that we can improve retrieval performance: by careful selection of the term weighting scheme; by decomposing automated transcripts into
phonetic substrings to help ameliorate transcription
errors; and by combining automatic transcriptions with manually-assigned metadata. We further show that topic translation with online machine translation resources
yields effective CL-SR
User experiments with the Eurovision cross-language image retrieval system
In this paper we present Eurovision, a text-based system for cross-language (CL) image retrieval.
The system is evaluated by multilingual users for two search tasks with the system configured in
English and five other languages. To our knowledge this is the first published set of user
experiments for CL image retrieval. We show that: (1) it is possible to create a usable multilingual
search engine using little knowledge of any language other than English, (2) categorizing images
assists the user's search, and (3) there are differences in the way users search between the proposed
search tasks. Based on the two search tasks and user feedback, we describe important aspects of
any CL image retrieval system
Transitive probabilistic CLIR models.
Transitive translation could be a useful technique to enlarge the number of supported language pairs for a cross-language information retrieval (CLIR) system in a cost-effective manner. The paper describes several setups for transitive translation based on probabilistic translation models. The transitive CLIR models were evaluated on the CLEF test collection and yielded a retrieval effectiveness\ud
up to 83% of monolingual performance, which is significantly better than a baseline using the synonym operator
Applying Machine Translation to Two-Stage Cross-Language Information Retrieval
Cross-language information retrieval (CLIR), where queries and documents are
in different languages, needs a translation of queries and/or documents, so as
to standardize both of them into a common representation. For this purpose, the
use of machine translation is an effective approach. However, computational
cost is prohibitive in translating large-scale document collections. To resolve
this problem, we propose a two-stage CLIR method. First, we translate a given
query into the document language, and retrieve a limited number of foreign
documents. Second, we machine translate only those documents into the user
language, and re-rank them based on the translation result. We also show the
effectiveness of our method by way of experiments using Japanese queries and
English technical documents.Comment: 13 pages, 1 Postscript figur
GeoCLEF 2007: the CLEF 2007 cross-language geographic information retrieval track overview
GeoCLEF ran as a regular track for the second time within the Cross
Language Evaluation Forum (CLEF) 2007. The purpose of GeoCLEF is to test
and evaluate cross-language geographic information retrieval (GIR): retrieval
for topics with a geographic specification. GeoCLEF 2007 consisted of two sub
tasks. A search task ran for the third time and a query classification task was
organized for the first. For the GeoCLEF 2007 search task, twenty-five search
topics were defined by the organizing groups for searching English, German,
Portuguese and Spanish document collections. All topics were translated into
English, Indonesian, Portuguese, Spanish and German. Several topics in 2007
were geographically challenging. Thirteen groups submitted 108 runs. The
groups used a variety of approaches. For the classification task, a query log
from a search engine was provided and the groups needed to identify the
queries with a geographic scope and the geographic components within the
local queries
Searching and organizing images across languages
With the continual growth of users on the Web
from a wide range of countries, supporting
such users in their search of cultural heritage
collections will grow in importance. In the
next few years, the growth areas of Internet
users will come from the Indian sub-continent
and China. Consequently, if holders of cultural
heritage collections wish their content to be
viewable by the full range of users coming to
the Internet, the range of languages that they
need to support will have to grow. This paper
will present recent work conducted at the
University of Sheffield (and now being
implemented in BRICKS) on how to use
automatic translation to provide search and
organisation facilities for a historical image
search engine. The system allows users to
search for images in seven different languages,
providing means for the user to examine
translated image captions and browse retrieved
images organised by categories written in their
native language
- …