161 research outputs found
Projecting named entity tags from a resource rich language to a resource poor language
Named Entities (NE) are the prominent entities appearing in textual documents.Automatic classification of NE in a textual corpus is a vital process in Information Extraction and Information Retrieval research. Named Entity Recognition (NER) is the identification of words in text that correspond to a pre-defined taxonomy such as person, organization, location, date, time, etc.This article focuses on the person (PER), organization (ORG) and location (LOC) entities for a Malay journalistic corpus of terrorism.A projection algorithm, using the Dice Coefficient function and bigram scoring method with domain-specific rules, is suggested to map the NE information from the English corpus to the Malay corpus of terrorism.The English corpus is the translated version of the Malay corpus.Hence, these two corpora are treated as parallel corpora. The method computes the string similarity between the English words and the list of available lexemes in a pre-built lexicon that approximates the best NE mapping.The algorithm has been effectively evaluated using our own terrorism tagged corpus; it achieved satisfactory results in terms of precision, recall, and F-measure.An evaluation of the selected open source NER tool for English is also presented
Can Multilingual Language Models Transfer to an Unseen Dialect? A Case Study on North African Arabizi
Building natural language processing systems for non standardized and low
resource languages is a difficult challenge. The recent success of large-scale
multilingual pretrained language models provides new modeling tools to tackle
this. In this work, we study the ability of multilingual language models to
process an unseen dialect. We take user generated North-African Arabic as our
case study, a resource-poor dialectal variety of Arabic with frequent
code-mixing with French and written in Arabizi, a non-standardized
transliteration of Arabic to Latin script. Focusing on two tasks,
part-of-speech tagging and dependency parsing, we show in zero-shot and
unsupervised adaptation scenarios that multilingual language models are able to
transfer to such an unseen dialect, specifically in two extreme cases: (i)
across scripts, using Modern Standard Arabic as a source language, and (ii)
from a distantly related language, unseen during pretraining, namely Maltese.
Our results constitute the first successful transfer experiments on this
dialect, paving thus the way for the development of an NLP ecosystem for
resource-scarce, non-standardized and highly variable vernacular languages
Automatic Speech Recognition for Low-resource Languages and Accents Using Multilingual and Crosslingual Information
This thesis explores methods to rapidly bootstrap automatic speech recognition systems for languages, which lack resources for speech and language processing. We focus on finding approaches which allow using data from multiple languages to improve the performance for those languages on different levels, such as feature extraction, acoustic modeling and language modeling. Under application aspects, this thesis also includes research work on non-native and Code-Switching speech
Recommended from our members
Cross-generational linguistic variation in the Canberra Vietnamese heritage language community: A corpus-centred investigation
This dissertation investigates cross-generational linguistic differences in the Canberra Vietnamese bilingual community, with a particular focus on Vietnamese as the heritage language. Specifically, it documents the vernacular and considers key aspects of this data from different theoretical perspectives. Its main contribution is an insight into a rarely studied heritage language variety in a contact community that has never been examined.
The dissertation consists of five core chapters, organised into two parts. In the first part (Chapters 2–3), I describe how I documented the vernacular and created the Canberra Vietnamese English Corpus (CanVEC), an original corpus compiled specifically for this study that is also the first to be freely available for research purposes. The corpus consists of over ten hours of spontaneous speech produced by 45 Vietnamese-English bilingual speakers across two generations living in Canberra. In the second part of the study (Chapters 4–6), I put the corpus to use and investigate aspects of the cross-generational differences in Vietnamese as the heritage language in this community.
In particular, I first probe the Vietnamese heritage language via its participation in the code-switching discourse (Chapter 4). In doing so, I focus on the applicability of the Matrix Language Framework (MLF) (Myers-Scotton, 1993, 2002) and its associated Matrix Language (ML) Turnover Hypothesis (Myers-Scotton, 1998) to the code-switching data in CanVEC. Since support for this prominent model has mainly come from language pairs that have different clausal word order or vastly different inventories of inflectional morphology, Vietnamese-English as a pair in which both languages are SVO and essentially isolating offers a tantalising testing ground for its application. Results show that the universal claims of this model do not hold so straight-forwardly. CanVEC data challenges several assumptions of the MLF, with the model ultimately only being able to account for around half of the CanVEC code-switching data. I further demonstrate that even when the ML is putatively identifiable and a cross-generational ML ‘turnover’ is quantitatively observed, the predictions do not reflect the direction of structural influence that we see in CanVEC. The MLF approach therefore sheds only limited light on cross-generational language shift and variation in this community.
Given that null elements emerge as a distinct area of difficulty in Chapter 4, I take this aspect as the focal point for the next part of the investigation (Chapter 5), where I use the variationist approach (Labov, 1972 et seq.) to explore three cases where null and overt realisation alternates in Vietnamese: subjects, objects, and copulas. In doing so, I move away from the bilingual portion of CanVEC to examine the monolingual heritage Vietnamese subset directly. Results show that Vietnamese null subjects vary significantly across generations, while null objects and copulas remain stable in terms of use. As speakers also overwhelmingly prefer overt forms over null forms (∼70:30) across all the three of the variables of interest, I appeal to the generative interface-oriented approach (Sorace & Filiaci, 2006 et seq.) to next examine the distribution of overt subjects, objects, copulas (Chapter 6). These results converge with what was found for null forms: cross-generational effects were observed for pronominal subjects, but not pronominal objects and copulas. This finding also supports the importance of a distinction drawn in previous works between internal (syntax-semantics) and external (syntax-discourse/pragmatics) interface phenomena, with the latter being seemingly more susceptible to change.
Ultimately, this dissertation highlights the empirical and theoretical value of studying rarely considered contact varieties, while deploying an integrated approach that acknowledges the multi-faceted complexity of the contact communities where these varieties are spoken.Cambridge Trust International Scholarshi
- …