10,068 research outputs found
User experiments with the Eurovision cross-language image retrieval system
In this paper we present Eurovision, a text-based system for cross-language (CL) image retrieval.
The system is evaluated by multilingual users for two search tasks with the system configured in
English and five other languages. To our knowledge this is the first published set of user
experiments for CL image retrieval. We show that: (1) it is possible to create a usable multilingual
search engine using little knowledge of any language other than English, (2) categorizing images
assists the user's search, and (3) there are differences in the way users search between the proposed
search tasks. Based on the two search tasks and user feedback, we describe important aspects of
any CL image retrieval system
Applying digital content management to support localisation
The retrieval and presentation of digital content such as that on the World Wide Web (WWW) is a substantial area of research. While recent years have seen huge expansion in the size of web-based archives that can be searched efficiently by commercial search engines, the presentation of potentially relevant content is still limited to ranked document lists represented by simple text snippets or image keyframe surrogates. There is expanding interest in techniques to personalise the presentation of content to improve the richness and effectiveness of the user experience. One of the most significant challenges to achieving this is the increasingly multilingual nature of this data, and the need to provide suitably localised responses to users based on this content. The Digital Content Management (DCM) track of the Centre for Next Generation Localisation (CNGL) is seeking to develop technologies to support advanced personalised access and presentation of information by combining elements from the existing research areas of Adaptive Hypermedia and Information Retrieval. The combination of these technologies is intended to produce significant improvements in the way users access information. We review key features of these technologies and introduce early ideas for how these technologies can support localisation and localised content before concluding with some impressions of future directions in DCM
Searching and organizing images across languages
With the continual growth of users on the Web
from a wide range of countries, supporting
such users in their search of cultural heritage
collections will grow in importance. In the
next few years, the growth areas of Internet
users will come from the Indian sub-continent
and China. Consequently, if holders of cultural
heritage collections wish their content to be
viewable by the full range of users coming to
the Internet, the range of languages that they
need to support will have to grow. This paper
will present recent work conducted at the
University of Sheffield (and now being
implemented in BRICKS) on how to use
automatic translation to provide search and
organisation facilities for a historical image
search engine. The system allows users to
search for images in seven different languages,
providing means for the user to examine
translated image captions and browse retrieved
images organised by categories written in their
native language
Transitive probabilistic CLIR models.
Transitive translation could be a useful technique to enlarge the number of supported language pairs for a cross-language information retrieval (CLIR) system in a cost-effective manner. The paper describes several setups for transitive translation based on probabilistic translation models. The transitive CLIR models were evaluated on the CLEF test collection and yielded a retrieval effectiveness\ud
up to 83% of monolingual performance, which is significantly better than a baseline using the synonym operator
Evaluation of MIRACLE approach results for CLEF 2003
This paper describes MIRACLE (Multilingual Information RetrievAl for the CLEf campaign) approach and results for the mono, bi and multilingual Cross Language Evaluation Forum tasks. The approach is based on the combination of linguistic and statistic techniques to perform indexing and retrieval tasks
Embedding Web-based Statistical Translation Models in Cross-Language Information Retrieval
Although more and more language pairs are covered by machine translation
services, there are still many pairs that lack translation resources.
Cross-language information retrieval (CLIR) is an application which needs
translation functionality of a relatively low level of sophistication since
current models for information retrieval (IR) are still based on a
bag-of-words. The Web provides a vast resource for the automatic construction
of parallel corpora which can be used to train statistical translation models
automatically. The resulting translation models can be embedded in several ways
in a retrieval model. In this paper, we will investigate the problem of
automatically mining parallel texts from the Web and different ways of
integrating the translation models within the retrieval process. Our
experiments on standard test collections for CLIR show that the Web-based
translation models can surpass commercial MT systems in CLIR tasks. These
results open the perspective of constructing a fully automatic query
translation device for CLIR at a very low cost.Comment: 37 page
Domain-speciïŹc query translation for multilingual access to digital libraries
Accurate high-coverage translation is a vital component of reliable cross language information access (CLIR) systems. This is particularly true of access to archives such as Digital Libraries which are often speciïŹc to certain domains. While general machine translation (MT) has been shown to be effective for CLIR tasks in information retrieval evaluation workshops, it is not well suited to specialized tasks where domain speciïŹc translations are required. We demonstrate that effective query translation
in the domain of cultural heritage (CH) can be achieved by augmenting a standard MT system with domain-speciïŹc phrase dictionaries automatically mined from the online Wikipedia. Experiments using our hybrid translation system with sample query logs from users of CH websites demonstrate a large improvement in the accuracy of domain speciïŹc phrase detection and translation
DCU@FIRE2010: term conflation, blind relevance feedback, and cross-language IR with manual and automatic query translation
For the first participation of Dublin City University (DCU)
in the FIRE 2010 evaluation campaign, information retrieval
(IR) experiments on English, Bengali, Hindi, and Marathi
documents were performed to investigate term conation
(different stemming approaches and indexing word prefixes),
blind relevance feedback, and manual and automatic query
translation. The experiments are based on BM25 and on
language modeling (LM) for IR. Results show that term conation always improves mean average precision (MAP)
compared to indexing unprocessed word forms, but different approaches seem to work best for different languages. For example, in monolingual Marathi experiments indexing 5-prefixes outperforms our corpus-based stemmer; in Hindi,
the corpus-based stemmer achieves a higher MAP. For Bengali, the LM retrieval model achieves a much higher MAP
than BM25 (0.4944 vs. 0.4526). In all experiments using
BM25, blind relevance feedback yields considerably higher
MAP in comparison to experiments without it. Bilingual IR experiments (English!Bengali and English!Hindi) are
based on query translations obtained from native speakers
and the Google translate web service. For the automatically
translated queries, MAP is slightly (but not significantly)
lower compared to experiments with manual query translations. The bilingual English!Bengali (English!Hindi)
experiments achieve 81.7%-83.3% (78.0%-80.6%) of the best
corresponding monolingual experiments
Termhood-based Comparability Metrics of Comparable Corpus in Special Domain
Cross-Language Information Retrieval (CLIR) and machine translation (MT)
resources, such as dictionaries and parallel corpora, are scarce and hard to
come by for special domains. Besides, these resources are just limited to a few
languages, such as English, French, and Spanish and so on. So, obtaining
comparable corpora automatically for such domains could be an answer to this
problem effectively. Comparable corpora, that the subcorpora are not
translations of each other, can be easily obtained from web. Therefore,
building and using comparable corpora is often a more feasible option in
multilingual information processing. Comparability metrics is one of key issues
in the field of building and using comparable corpus. Currently, there is no
widely accepted definition or metrics method of corpus comparability. In fact,
Different definitions or metrics methods of comparability might be given to
suit various tasks about natural language processing. A new comparability,
namely, termhood-based metrics, oriented to the task of bilingual terminology
extraction, is proposed in this paper. In this method, words are ranked by
termhood not frequency, and then the cosine similarities, calculated based on
the ranking lists of word termhood, is used as comparability. Experiments
results show that termhood-based metrics performs better than traditional
frequency-based metrics
- âŠ