8 research outputs found

    DCU and UTA at ImageCLEFPhoto 2007

    Get PDF
    Dublin City University (DCU) and University of Tampere(UTA) participated in the ImageCLEF 2007 photographic ad-hoc retrieval task with several monolingual and bilingual runs. Our approach was language independent: text retrieval based on fuzzy s-gram query translation was combined with visual retrieval. Data fusion between text and image content was performed using unsupervised query-time weight generation approaches. Our baseline was a combination of dictionary-based query translation and visual retrieval, which achieved the best result. The best mixed modality runs using fuzzy s-gram translation achieved on average around 83% of the performance of the baseline. Performance was more similar when only top rank precision levels of P10 and P20 were considered. This suggests that fuzzy sgram query translation combined with visual retrieval is a cheap alternative for cross-lingual image retrieval where only a small number of relevant items are required. Both sets of results emphasize the merit of our query-time weight generation schemes for data fusion, with the fused runs exhibiting marked performance increases over single modalities, this is achieved without the use of any prior training data

    Empirical Methods for Compound Splitting

    Get PDF
    Compounded words are a challenge for NLP applications such as machine translation (MT). We introduce methods to learn splitting rules from monolingual and parallel corpora. We evaluate them against a gold standard and measure their impact on performance of statistical MT systems. Results show accuracy of 99.1% and performance gains for MT of 0.039 BLEU on a German-English noun phrase translation task.Comment: 8 pages, 2 figures. Published at EACL 200

    Effective techniques for Indonesian text retrieval

    Get PDF
    The Web is a vast repository of data, and information on almost any subject can be found with the aid of search engines. Although the Web is international, the majority of research on finding of information has a focus on languages such as English and Chinese. In this thesis, we investigate information retrieval techniques for Indonesian. Although Indonesia is the fourth most populous country in the world, little attention has been given to search of Indonesian documents. Stemming is the process of reducing morphological variants of a word to a common stem form. Previous research has shown that stemming is language-dependent. Although several stemming algorithms have been proposed for Indonesian, there is no consensus on which gives better performance. We empirically explore these algorithms, showing that even the best algorithm still has scope for improvement. We propose novel extensions to this algorithm and develop a new Indonesian stemmer, and show that these can improve stemming correctness by up to three percentage points; our approach makes less than one error in thirty-eight words. We propose a range of techniques to enhance the performance of Indonesian information retrieval. These techniques include: stopping; sub-word tokenisation; and identification of proper nouns; and modifications to existing similarity functions. Our experiments show that many of these techniques can increase retrieval performance, with the highest increase achieved when we use grams of size five to tokenise words. We also present an effective method for identifying the language of a document; this allows various information retrieval techniques to be applied selectively depending on the language of target documents. We also address the problem of automatic creation of parallel corpora --- collections of documents that are the direct translations of each other --- which are essential for cross-lingual information retrieval tasks. Well-curated parallel corpora are rare, and for many languages, such as Indonesian, do not exist at all. We describe algorithms that we have developed to automatically identify parallel documents for Indonesian and English. Unlike most current approaches, which consider only the context and structure of the documents, our approach is based on the document content itself. Our algorithms do not make any prior assumptions about the documents, and are based on the Needleman-Wunsch algorithm for global alignment of protein sequences. Our approach works well in identifying Indonesian-English parallel documents, especially when no translation is performed. It can increase the separation value, a measure to discriminate good matches of parallel documents from bad matches, by approximately ten percentage points. We also investigate the applicability of our identification algorithms for other languages that use the Latin alphabet. Our experiments show that, with minor modifications, our alignment methods are effective for English-French, English-German, and French-German corpora, especially when the documents are not translated. Our technique can increase the separation value for the European corpus by up to twenty-eight percentage points. Together, these results provide a substantial advance in understanding techniques that can be applied for effective Indonesian text retrieval
    corecore