Search CORE

8 research outputs found

DCU and UTA at ImageCLEFPhoto 2007

Author: Adamek Tomasz
Airio Eija
Jones Gareth J.F.
Järvelin Anni
Wilkins Peter
Publication venue
Publication date: 01/09/2007
Field of study

Dublin City University (DCU) and University of Tampere(UTA) participated in the ImageCLEF 2007 photographic ad-hoc retrieval task with several monolingual and bilingual runs. Our approach was language independent: text retrieval based on fuzzy s-gram query translation was combined with visual retrieval. Data fusion between text and image content was performed using unsupervised query-time weight generation approaches. Our baseline was a combination of dictionary-based query translation and visual retrieval, which achieved the best result. The best mixed modality runs using fuzzy s-gram translation achieved on average around 83% of the performance of the baseline. Performance was more similar when only top rank precision levels of P10 and P20 were considered. This suggests that fuzzy sgram query translation combined with visual retrieval is a cheap alternative for cross-lingual image retrieval where only a small number of relevant items are required. Both sets of results emphasize the merit of our query-time weight generation schemes for data fusion, with the fused runs exhibiting marked performance increases over single modalities, this is achieved without the use of any prior training data

Irish Universities

DCU Online Research Access Service

Word normalization and decompounding in mono- and bilingual IR

Author: Airio Eija
Publication venue
Publication date: 01/01/2006
Field of study

Trepo - Institutional Repository of Tampere University

Transitive dictionary translation challenges direct dictionary translation in CLIR

Author: Airio Eija
Järvelin Kalervo
Lehtokangas Raija
Publication venue: 'Elsevier BV'
Publication date: 01/01/2004
Field of study

Trepo - Institutional Repository of Tampere University

Empirical Methods for Compound Splitting

Author: Knight Kevin
Koehn Philipp
Publication venue
Publication date: 01/01/2003
Field of study

Compounded words are a challenge for NLP applications such as machine translation (MT). We introduce methods to learn splitting rules from monolingual and parallel corpora. We evaluate them against a gold standard and measure their impact on performance of statistical MT systems. Results show accuracy of 99.1% and performance gains for MT of 0.039 BLEU on a German-English noun phrase translation task.Comment: 8 pages, 2 figures. Published at EACL 200

arXiv.org e-Print Archive

CiteSeerX

Crossref

Edinburgh Research Explorer

Experiments with transitive dictionary translation and pseudo-relevance feedback using graded relevance assessments

Author: Järvelin Kalervo
Keskustalo Heikki
Lehtokangas Raija
Publication venue: 'Wiley'
Publication date: 01/01/2008
Field of study

TamPub Julkaisuarkisto - TamPub Institutional Repository

Trepo - Institutional Repository of Tampere University

Effective techniques for Indonesian text retrieval

Author: Asian J
Publication venue: RMIT University
Publication date: 01/01/2007
Field of study

The Web is a vast repository of data, and information on almost any subject can be found with the aid of search engines. Although the Web is international, the majority of research on finding of information has a focus on languages such as English and Chinese. In this thesis, we investigate information retrieval techniques for Indonesian. Although Indonesia is the fourth most populous country in the world, little attention has been given to search of Indonesian documents. Stemming is the process of reducing morphological variants of a word to a common stem form. Previous research has shown that stemming is language-dependent. Although several stemming algorithms have been proposed for Indonesian, there is no consensus on which gives better performance. We empirically explore these algorithms, showing that even the best algorithm still has scope for improvement. We propose novel extensions to this algorithm and develop a new Indonesian stemmer, and show that these can improve stemming correctness by up to three percentage points; our approach makes less than one error in thirty-eight words. We propose a range of techniques to enhance the performance of Indonesian information retrieval. These techniques include: stopping; sub-word tokenisation; and identification of proper nouns; and modifications to existing similarity functions. Our experiments show that many of these techniques can increase retrieval performance, with the highest increase achieved when we use grams of size five to tokenise words. We also present an effective method for identifying the language of a document; this allows various information retrieval techniques to be applied selectively depending on the language of target documents. We also address the problem of automatic creation of parallel corpora --- collections of documents that are the direct translations of each other --- which are essential for cross-lingual information retrieval tasks. Well-curated parallel corpora are rare, and for many languages, such as Indonesian, do not exist at all. We describe algorithms that we have developed to automatically identify parallel documents for Indonesian and English. Unlike most current approaches, which consider only the context and structure of the documents, our approach is based on the document content itself. Our algorithms do not make any prior assumptions about the documents, and are based on the Needleman-Wunsch algorithm for global alignment of protein sequences. Our approach works well in identifying Indonesian-English parallel documents, especially when no translation is performed. It can increase the separation value, a measure to discriminate good matches of parallel documents from bad matches, by approximately ten percentage points. We also investigate the applicability of our identification algorithms for other languages that use the Latin alphabet. Our experiments show that, with minor modifications, our alignment methods are effective for English-French, English-German, and French-German corpora, especially when the documents are not translated. Our technique can increase the separation value for the European corpus by up to twenty-eight percentage points. Together, these results provide a substantial advance in understanding techniques that can be applied for effective Indonesian text retrieval

RMIT Research Repository

Inquiries into words, constraints and contexts : Festschrift in the honour of Kimmo Koskenniemi on his 60th birthday

Author: Arppe Antti
Carlson Lauri
Linden Krister
Piitulainen Jussi Olavi
Suominen Mickael
Vainio Martti
Westerlund Hanna
Yli-Jyrä Anssi Mikael
Publication venue: CSLI publications
Publication date: 01/01/2005
Field of study

Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Utaclir @ CLEF 2001 — Effects of Compound Splitting and N-Gram Techniques

Author: A.M. Robertson
C. Jacquemin
G. Grefenstette
J. N. Levi
J. Zhou
J. Zobel
S. W. Haas
T. Gadd
T. Hedlund
T. Hedlund
U. Pfeifer
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref