Search CORE

5 research outputs found

Statistical transliteration for English-Arabic cross language information retrieval

Author: Leah S. Larkey
Nasreen Abduljaleel
Publication venue
Publication date: 01/01/2003
Field of study

Out of vocabulary (OOV) words are problematic for cross language information retrieval. One way to deal with OOV words when the two languages have different alphabets, is to transliterate the unknown words, that is, to render them in the orthography of the second language. In the present study, we present a simple statistical technique to train an English to Arabic transliteration model from pairs of names. We call this a selected n-gram model because a two-stage training procedure first learns which n-gram segments should be added to the unigram inventory for the source language, and then a second stage learns the translation model over this inventory. This technique requires no heuristics or linguistic knowledge of either language. We evaluate the statistically-trained model and a simpler hand-crafted model on a test set of named entities from the Arabic AFP corpus and demonstrate that they perform better than two online translation sources. We also explore the effectiveness of these systems on the TREC 2002 cross language IR task. We find that transliteration either of OOV named entities or of all OOV words is an effective approach for cross language IR

CiteSeerX

What's in a Name?: Proper Names in Arabic Cross Language Information Retrieval

Author: Leah S. Larkey
Margaret Connell
Nasreen Abduljaleel
Publication venue
Publication date
Field of study

Proper names are problematic for cross language information retrieval. Standard bilingual dictionaries typically have poor coverage of proper names. On the other hand, IR tasks involving news corpora, like TDT and TREC cross language IR, have proper names at their core. In this study, we demonstrate the importance of proper names in one such task, the TREC 2002 (Arabic-English) cross language track, by showing that performance degrades a tremendous amount when the bilingual lexicons do not have proper names. We then examine several different sources of proper name translations from English to Arabic, both static and generative (transliteration) and explore their effectiveness in the context of the TREC 2002 cross language IR task. We support a conclusion that a combination of static translation resources plus transliteration provides a successful solution

CiteSeerX

Hindi CLIR in thirty days

Author: AND Margaret E Connell
Leah S Larkey
Nasreen Abduljaleel
Publication venue
Publication date: 01/01/2003
Field of study

CiteSeerX

Hindi CLIR in thirty days

Author: Abduljaleel N.
Aljlayl M.
Ballestros L.
Berger A.
Chen A.
Davis M.W.
Larkey L.S.
Larkey L.S.
Leah S. Larkey
Margaret E. Connell
Nasreen Abduljaleel
NTCIR
Oard D.W.
Och F.J.
Peters C.
Pirkola A.
Ramanathan A.
Xu J.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref