Search CORE

3,625 research outputs found

A Comparison of Different Machine Transliteration Models

Author: Choi K.
Isahara H.
Oh J.
Publication venue: 'AI Access Foundation'
Publication date: 06/10/2011
Field of study

Machine transliteration is a method for automatically converting words in one language into phonetically equivalent ones in another language. Machine transliteration plays an important role in natural language applications such as information retrieval and machine translation, especially for handling proper nouns and technical terms. Four machine transliteration models -- grapheme-based transliteration model, phoneme-based transliteration model, hybrid transliteration model, and correspondence-based transliteration model -- have been proposed by several researchers. To date, however, there has been little research on a framework in which multiple transliteration models can operate simultaneously. Furthermore, there has been no comparison of the four models within the same framework and using the same data. We addressed these problems by 1) modeling the four models within the same framework, 2) comparing them under the same conditions, and 3) developing a way to improve machine transliteration through this comparison. Our comparison showed that the hybrid and correspondence-based models were the most effective and that the four models can be used in a complementary manner to improve machine transliteration performance

arXiv.org e-Print Archive

Crossref

Miracle’s 2005 Approach to Monolingual Information Retrieval

Author: González Cristóbal José Carlos
Goñi Menoyo José Miguel
Villena Román Julio
Publication venue: E.T.S.I. Telecomunicación (UPM)
Publication date: 01/01/2005
Field of study

This paper presents the 2005 Miracle’s team approach to Monolingual Information Retrieval. The goal for the experiments in this year was twofold: continue testing the effect of combination approaches on information retrieval tasks, and improving our basic processing and indexing tools, adapting them to new languages with strange encoding schemes. The starting point was a set of basic components: stemming, transforming, filtering, proper nouns extracting, paragraph extracting, and pseudo-relevance feedback. Some of these basic components were used in different combinations and order of application for document indexing and for query processing. Second order combinations were also tested, by averaging or selective combination of the documents retrieved by different approaches for a particular query

Archivo Digital UPM

An automatically built named entity lexicon for Arabic

Author: Attia Mohammed
Monachini Monica
Toral Antonio
Tounsi Lamia
van Genabith Josef
Publication venue: European Language Resources Association
Publication date: 01/01/2010
Field of study

We have successfully adapted and extended the automatic Multilingual, Interoperable Named Entity Lexicon approach to Arabic, using Arabic WordNet (AWN) and Arabic Wikipedia (AWK). First, we extract AWN’s instantiable nouns and identify the corresponding categories and hyponym subcategories in AWK. Then, we exploit Wikipedia inter-lingual links to locate correspondences between articles in ten different languages in order to identify Named Entities (NEs). We apply keyword search on AWK abstracts to provide for Arabic articles that do not have a correspondence in any of the other languages. In addition, we perform a post-processing step to fetch further NEs from AWK not reachable through AWN. Finally, we investigate diacritization using matching with geonames databases, MADA-TOKAN tools and different heuristics for restoring vowel marks of Arabic NEs. Using this methodology, we have extracted approximately 45,000 Arabic NEs and built, to the best of our knowledge, the largest, most mature and well-structured Arabic NE lexical resource to date. We have stored and organised this lexicon following the Lexical Markup Framework (LMF) ISO standard. We conduct a quantitative and qualitative evaluation of the lexicon against a manually annotated gold standard and achieve precision scores from 95.83% (with 66.13% recall) to 99.31% (with 61.45% recall) according to different values of a threshold

CiteSeerX

Irish Universities

DCU Online Research Access Service

MIRACLE’s hybrid approach to bilingual and monolingual Information Retrieval

Author: Alonso Sánchez Javier
García Serrano Ana
González Cristóbal José Carlos
Goñi Menoyo José Miguel
Martínez Fernández José Luis
Martínez Fernández Paloma
Pablo Sánchez César de
Villena Román Julio
Publication venue: E.T.S.I. Telecomunicación (UPM)
Publication date: 01/01/2004
Field of study

The main goal of the bilingual and monolingual participation of the MIRACLE team at CLEF 2004 was testing the effect of combination approaches to information retrieval. The starting point is a set of basic components: stemming, transformation, filtering, generation of n-grams, weighting and relevance feedback. Some of these basic components are used in different combinations and order of application for document indexing and for query processing. Besides this, a second order combination is done, mainly by averaging or by selective combination of the documents retrieved by different approaches for a particular query

Archivo Digital UPM

Sentiment analysis for Hinglish code-mixed tweets by means of cross-lingual word embeddings

Author: Lefever Els
Singh Pranaydeep
Publication venue: European Language Resources Association (ELRA)
Publication date: 01/01/2020
Field of study

Ghent University Academic Bibliography