Search CORE

365 research outputs found

Character-level and syntax-level models for low-resource and multilingual natural language processing

Author: Severini Silvia
Publication venue: Ludwig-Maximilians-Universität München
Publication date: 05/07/2023
Field of study

There are more than 7000 languages in the world, but only a small portion of them benefit from Natural Language Processing resources and models. Although languages generally present different characteristics, “cross-lingual bridges” can be exploited, such as transliteration signals and word alignment links. Such information, together with the availability of multiparallel corpora and the urge to overcome language barriers, motivates us to build models that represent more of the world’s languages. This thesis investigates cross-lingual links for improving the processing of low-resource languages with language-agnostic models at the character and syntax level. Specifically, we propose to (i) use orthographic similarities and transliteration between Named Entities and rare words in different languages to improve the construction of Bilingual Word Embeddings (BWEs) and named entity resources, and (ii) exploit multiparallel corpora for projecting labels from high- to low-resource languages, thereby gaining access to weakly supervised processing methods for the latter. In the first publication, we describe our approach for improving the translation of rare words and named entities for the Bilingual Dictionary Induction (BDI) task, using orthography and transliteration information. In our second work, we tackle BDI by enriching BWEs with orthography embeddings and a number of other features, using our classification-based system to overcome script differences among languages. The third publication describes cheap cross-lingual signals that should be considered when building mapping approaches for BWEs since they are simple to extract, effective for bootstrapping the mapping of BWEs, and overcome the failure of unsupervised methods. The fourth paper shows our approach for extracting a named entity resource for 1340 languages, including very low-resource languages from all major areas of linguistic diversity. We exploit parallel corpus statistics and transliteration models and obtain improved performance over prior work. Lastly, the fifth work models annotation projection as a graph-based label propagation problem for the part of speech tagging task. Part of speech models trained on our labeled sets outperform prior work for low-resource languages like Bambara (an African language spoken in Mali), Erzya (a Uralic language spoken in Russia’s Republic of Mordovia), Manx (the Celtic language of the Isle of Man), and Yoruba (a Niger-Congo language spoken in Nigeria and surrounding countries)

Digitale Hochschulschriften der LMU

A Word Sense Disambiguation Model for Amharic Words using Semi-Supervised Learning Paradigm

Author: Meshesha M
Ramesh BP
Teferra S
Wassie G
Publication venue: 'African Journals Online (AJOL)'
Publication date: 18/11/2014
Field of study

The main objective of this research was to design a WSD (word sense disambiguation) prototype model for Amharic words using semi-supervised learning method to extract training sets which minimizes the amount of the required human intervention and it can produce considerable improvement in learning accuracy. Due to the unavailability of Amharic word net, only five words were selected. These words were atena (አጠና), derese (ደረሰ), tenesa (ተነሳ), bela (በላ) and ale (አለ). A separate data sets using five ambiguous words were prepared for the development of this Amharic WSD prototype. The final classification task was done on fully labelled training set using Adaboost, bagging, and AD tree classification algorithms on WEKA package.Keywords: Ambiguity Bootstrapping Word Sense disambiguatio

AJOL - African Journals Online

Combining Word Embeddings with Bilingual Orthography Embeddings for Bilingual Dictionary Induction

Author: Fraser Alexander
Hangya Viktor
Schütze Hinrich
Severini Silvia
Publication venue
Publication date: 01/01/2020
Field of study

Crossref

Open Access LMU

Deciphering Speech: a Zero-Resource Approach to Cross-Lingual Transfer in ASR

Author: Bell Peter
Klejch Ondrej
Wallington Electra
Publication venue: 'International Speech Communication Association'
Publication date: 06/06/2022
Field of study

We present a method for cross-lingual training an ASR system using absolutely no transcribed training data from the target language, and with no phonetic knowledge of the language in question. Our approach uses a novel application of a decipherment algorithm, which operates given only unpaired speech and text data from the target language. We apply this decipherment to phone sequences generated by a universal phone recogniser trained on out-of-language speech corpora, which we follow with flat-start semi-supervised training to obtain an acoustic model for the new language. To the best of our knowledge, this is the first practical approach to zero-resource cross-lingual ASR which does not rely on any hand-crafted phonetic information. We carry out experiments on read speech from the GlobalPhone corpus, and show that it is possible to learn a decipherment model on just 20 minutes of data from the target language. When used to generate pseudo-labels for semi-supervised training, we obtain WERs that range from 32.5% to just 1.9% absolute worse than the equivalent fully supervised models trained on the same data.Comment: Submitted to Interspeech 202

arXiv.org e-Print Archive

Edinburgh Research Explorer

Deciphering Speech: a Zero-Resource Approach to Cross-Lingual Transfer in ASR

Author: Bell Peter
Klejch Ondrej
Wallington Electra
Publication venue: 'International Speech Communication Association'
Publication date: 18/09/2022
Field of study

Edinburgh Research Explorer

Integrating an Unsupervised Transliteration Model into Statistical Machine Translation

Author: Durrani Nadir
Hoang Hieu
Koehn Philipp
Sajjad Hassan
Publication venue
Publication date: 01/01/2014
Field of study

Edinburgh Research Explorer

A Comparative Review of Machine Learning for Arabic Named Entity Recognition

Author: Qadri binti Zakaria Lailatul
Salah Ramzi Esmail
Publication venue: 'Insight Society'
Publication date: 16/04/2017
Field of study

Arabic Named Entity Recognition (ANER) systems aim to identify and classify Arabic Named entities (NEs) within Arabic text. Other important tasks in Arabic Natural Language Processing (NLP) depends on ANER such as machine translation, question-answering, information extraction, etc. In general, ANER systems can be classified into three main approaches, namely, rule-based, machine-learning or hybrid systems. In this paper, we focus on research progress in machine-learning (ML) ANER and compare between linguistic resource, entity type, domain, method and performance. We also highlight the challenges when processing Arabic NEs through ML systems

International Journal on Advanced Science, Engineering and Information Technology