6 research outputs found

    A Comparison of Approaches for Measuring Cross-Lingual Similarity of Wikipedia Articles

    Wikipedia has been used as a source of comparable texts for a range of tasks, such as Statistical Machine Translation and CrossLanguage Information Retrieval. Articles written in different languages on the same topic are often connected through inter-language-links. However, the extent to which these articles are similar is highly variable and this may impact on the use of Wikipedia as a comparable resource. In this paper we compare various language-independent methods for measuring cross-lingual similarity: character n-grams, cognateness, word count ratio, and an approach based on outlinks. These approaches are compared against a baseline utilising MT resources. Measures are also compared to human judgements of similarity using a manually created resource containing 700 pairs of Wikipedia articles (in 7 language pairs). Results indicate that a combination of language-independent models (char-ngrams, outlinks and word-count ratio) is highly effective for identifying cross-lingual similarity and performs comparably to language-dependent models (translation and monolingual analysis).The work of the first author was in the framework of the Tacardi research project (TIN2012-38523-C02-00). The work of the fourth author was in the framework of the DIANA-Applications (TIN2012-38603-C02-01) and WIQ-EI IRSES (FP7 Marie Curie No. 269180) research projects.Barrón Cedeño, LA.; Paramita, ML.; Clough, P.; Rosso, P. (2014). A Comparison of Approaches for Measuring Cross-Lingual Similarity of Wikipedia Articles. En Advances in Information Retrieval. Springer Verlag (Germany). 424-429. https://doi.org/10.1007/978-3-319-06028-6_36S424429

    Correlation between Similarity Measures for Inter-Language Linked Wikipedia Articles

    Wikipedia articles in different languages have been mined to support various tasks, such as Cross-Language Information Retrieval (CLIR) and Statistical Machine Translation (SMT). Articles on the same topic in different languages are often connected by inter-language links, which can be used to identify similar or comparable content. In this work, we investigate the correlation between similarity measures utilising language-independent and language-dependent features and respective human judgments. A collection of 800 Wikipedia pairs from 8 different language pairs were collected and judged for similarity by two assessors. We report the development of this corpus and inter-assessor agreement between judges across the languages. Results show that similarity measured using language independent features is comparable to using an approach based on translating non-English documents. In both cases the correlation with human judgments is low but also dependent upon the language pair. The results and corpus generated from this work also provide insights into the measurement of cross-language similarity

    Correlation between Similarity Measures for Inter-Language Linked Wikipedia Articles

    Extracting bilingual terms from the Web

    In this paper we make two contributions. First, we describe a multi-component system called BiTES (Bilingual Term Extraction System) designed to automatically gather domain-specific bilingual term pairs from Web data. BiTES components consist of data gathering tools, domain classifiers, monolingual text extraction systems and bilingual term aligners. BiTES is readily extendable to new language pairs and has been successfully used to gather bilingual terminology for 24 language pairs, including English and all official EU languages, save Irish. Second, we describe a novel set of methods for evaluating the main components of BiTES and present the results of our evaluation for six language pairs. Results show that the BiTES approach can be used to successfully harvest quality bilingual term pairs from the Web. Our evaluation method delivers significant insights about the strengths and weaknesses of our techniques. It can be straightforwardly reused to evaluate other bilingual term extraction systems and makes a novel contribution to the study of how to evaluate bilingual terminology extraction systems

    On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism

    Barrón Cedeño, LA. (2012). On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/16012Palanci

    Hiztegigintza elebiduna: Euskara-Alemana

    Get PDF
    506 p., 44 p.Tesi honetan, azkenengo bost urteotan burutu ditugun ikerketa-lanak aurkezten ditugu. Euskarazko eta alemanezko hiztegigintza elebiduna da jorratutako ildo guztiek elkarrekin duten gaia. Hiztegi berria sortzea zen asmoa, euskara-alemanezko hiztegi elektronikoa, hain zuzen ere. Asmoa gauzatzeko bidean, Hiztegigintza Historikoan, Metalexikografian, Hizkuntzalaritza Konputazionalean eta Hiztegigintza Aplikatuan kokatuko genituzkeen egin ditugun urratsen berri ematen dugu. Lehenik, gaiari ikuspuntu diakroniko batetik hurbiltzen gara. Alemana-euskara konbinazioan gaur arte ditugun lanak aztertzen ditugu, horien artean XIX. mendeko hiru lan, eta 1968, 1999 eta 2007ko hiztegi bana. Tesiaren bigarren atalean, euskararekiko eta alemanarekiko hiztegigintza elektronikoan dugun artearen egoera dugu hizpide, aro elektronikoaren aurreko zenbait lan ere kontuan hartuz. Hainbat paperezko zein formatu elektronikoan datozen hiztegitako laginak ikusi eta elkarrekin konparatzen ditugu. Bigarren atalean garatutako irizpideetatik abiatuz, proposamen zehatz batera igarotzen gara hirugarren atalean: alemana eta euskara lotzen dituen EuDeLex hiztegi elektroniko elebiduna egituratzeari ekiten diogu, makroegitura eta mikroegitura proposatuz, XML-egitura zein argitaratzeko formatuko hiztegi gisa.Laugarren atalean, EuDeLex hiztegia aleman-euskarazko itzulpen-ordainez osatzeko jokabideak dira gaia. Aleman-euskarazko itzulpen-ordainen bikoteak lortzeko metodo sorta ezartzen dugu hizkuntzalaritza konputazionaleko lankideekin elkarlanean, eta aurretik eskuz landutako EuDeLex hiztegiko datuak baliatzen ditugu metodo erdi-automatikoen eta automatikoen bitartez sorturiko euskaraz-alemanezko glosario elebidunen egokitasuna ebaluatzeko. Euskara-alemanezko corpus paraleloen, WordNet eta EDBL HAP-alorreko baliabideen eta bi hizkuntzetako Wikipedia entziklopedien gainean, besteak beste, burutzen ditugu esperimentu konputazionalak. Aplikaturiko metodoek hiztegia ekoizteko prozesuan eskuzko lanak modu eraginkorrean murrizten laguntzen dutela ondorioztatzen dugu