8,241 research outputs found
Sheffield University CLEF 2000 submission - bilingual track: German to English
We investigated dictionary based cross language information
retrieval using lexical triangulation. Lexical triangulation combines the results
of different transitive translations. Transitive translation uses a pivot language
to translate between two languages when no direct translation resource is
available. We took German queries and translated then via Spanish, or Dutch
into English. We compared the results of retrieval experiments using these
queries, with other versions created by combining the transitive translations or
created by direct translation. Direct dictionary translation of a query introduces
considerable ambiguity that damages retrieval, an average precision 79% below
monolingual in this research. Transitive translation introduces more ambiguity,
giving results worse than 88% below direct translation. We have shown that
lexical triangulation between two transitive translations can eliminate much of
the additional ambiguity introduced by transitive translation
Cross-lingual document retrieval categorisation and navigation based on distributed services
The widespread use of the Internet across countries has increased the need for access to document collections
that are often written in languages different from a user’s native language. In this paper we describe Clarity, a
Cross Language Information Retrieval (CLIR) system for English, Finnish, Swedish, Latvian and Lithuanian.
Clarity is a fully-fledged retrieval system that supports the user during the whole process of query formulation,
text retrieval and document browsing. We address four of the major aspects of Clarity: (i) the user-driven
methodology that formed the basis for the iterative design cycle and framework in the project, (ii) the system
architecture that was developed to support the interaction and coordination of Clarity’s distributed services, (iii)
the data resources and methods for query translation, and (iv) the support for Baltic languages. Clarity is an
example of a distributed CLIR system built with minimal translation resources and, to our knowledge, the only
such system that currently supports Baltic languages
Sub-word indexing and blind relevance feedback for English, Bengali, Hindi, and Marathi IR
The Forum for Information Retrieval Evaluation (FIRE) provides document collections, topics, and relevance assessments for information retrieval (IR) experiments on Indian languages. Several research questions are explored in this paper: 1. how to create create a simple, languageindependent corpus-based stemmer, 2. how to identify sub-words and which types of sub-words are suitable as indexing units, and 3. how to apply blind relevance feedback on sub-words and how feedback term selection is affected by the type of the indexing unit. More than 140 IR experiments are conducted using the BM25 retrieval model on the topic titles and descriptions (TD) for the FIRE 2008 English, Bengali, Hindi, and Marathi document collections. The major findings are: The corpus-based stemming approach is effective as a knowledge-light
term conation step and useful in case of few language-specific resources. For English, the corpusbased
stemmer performs nearly as well as the Porter stemmer and significantly better than the baseline of indexing words when combined with query expansion. In combination with blind relevance feedback, it also performs significantly better than the baseline for Bengali and Marathi IR.
Sub-words such as consonant-vowel sequences and word prefixes can yield similar or better performance in comparison to word indexing. There is no best performing method for all languages. For English, indexing using the Porter stemmer performs best, for Bengali and Marathi, overlapping 3-grams obtain the best result, and for Hindi, 4-prefixes yield the highest MAP. However, in combination with blind relevance feedback using 10 documents and 20 terms, 6-prefixes for English and 4-prefixes for Bengali, Hindi, and Marathi IR yield the highest MAP. Sub-word identification is a general case of decompounding. It results in one or more index terms for a single word form and increases the number of index terms but decreases their average length. The corresponding retrieval experiments show that relevance feedback on sub-words benefits from
selecting a larger number of index terms in comparison with retrieval on word forms. Similarly, selecting the number of relevance feedback terms depending on the ratio of word vocabulary size to sub-word vocabulary size almost always slightly increases information retrieval effectiveness
compared to using a fixed number of terms for different languages
BIKE: Bilingual Keyphrase Experiments
This paper presents a novel strategy for translating lists
of keyphrases. Typical keyphrase lists appear in
scientific articles, information retrieval systems and
web page meta-data. Our system combines a statistical
translation model trained on a bilingual corpus of
scientific papers with sense-focused look-up in a large
bilingual terminological resource. For the latter,
we developed a novel technique that benefits from viewing
the keyphrase list as contextual help for sense
disambiguation. The optimal combination of modules was
discovered by a genetic algorithm. Our work applies to
the French / English language pair
User experiments with the Eurovision cross-language image retrieval system
In this paper we present Eurovision, a text-based system for cross-language (CL) image retrieval.
The system is evaluated by multilingual users for two search tasks with the system configured in
English and five other languages. To our knowledge this is the first published set of user
experiments for CL image retrieval. We show that: (1) it is possible to create a usable multilingual
search engine using little knowledge of any language other than English, (2) categorizing images
assists the user's search, and (3) there are differences in the way users search between the proposed
search tasks. Based on the two search tasks and user feedback, we describe important aspects of
any CL image retrieval system
Classifying Amharic News Text Using Self-Organizing Maps
The paper addresses using artificial neural networks for classification of Amharic news items. Amharic is the language for countrywide communication in Ethiopia and has its own writing system containing extensive systematic redundancy. It is quite dialectally diversified and probably representative of the languages of a continent that so far has received little attention within the language processing field.
The experiments investigated document clustering around user queries using Self-Organizing Maps, an unsupervised learning neural network strategy. The best ANN model showed a precision of 60.0% when trying to cluster unseen data, and a 69.5% precision when trying to classify it
ANNOTATION MODEL FOR LOANWORDS IN INDONESIAN CORPUS: A LOCAL GRAMMAR FRAMEWORK
There is a considerable number for loanwords in Indonesian language as it has been,
or even continuously, in contact with other languages. The contact takes place via different
media; one of them is via machine readable medium. As the information in different languages
can be obtained by a mouse click these days, the contact becomes more and more intense. This
paper aims at proposing an annotation model and lexical resource for loanwords in
Indonesian. The lexical resource is applied to a corpus by a corpus processing software called
UNITEX. This software works under local grammar framewor
Embedding Web-based Statistical Translation Models in Cross-Language Information Retrieval
Although more and more language pairs are covered by machine translation
services, there are still many pairs that lack translation resources.
Cross-language information retrieval (CLIR) is an application which needs
translation functionality of a relatively low level of sophistication since
current models for information retrieval (IR) are still based on a
bag-of-words. The Web provides a vast resource for the automatic construction
of parallel corpora which can be used to train statistical translation models
automatically. The resulting translation models can be embedded in several ways
in a retrieval model. In this paper, we will investigate the problem of
automatically mining parallel texts from the Web and different ways of
integrating the translation models within the retrieval process. Our
experiments on standard test collections for CLIR show that the Web-based
translation models can surpass commercial MT systems in CLIR tasks. These
results open the perspective of constructing a fully automatic query
translation device for CLIR at a very low cost.Comment: 37 page
- …