202 research outputs found

    Corpus-Based Approaches to Igbo Diacritic Restoration

    Get PDF
    With natural language processing (NLP), researchers aim to get the computer to identify and understand the patterns in human languages. This is often difficult because a language embeds many dynamic and varied properties in its syntaxes, pragmatics and phonology, which needs to be captured and processed. The capacity of computers to process natural languages is increasing because NLP researchers are pushing its boundaries. But these research works focus more on well resourced languages such as English, Japanese, German, French, Russian, Mandarin Chinese etc. Over 95% of the world’s 7000 languages are low-resourced for NLP i.e. they have little or no data, tools, and techniques for NLP work. In this thesis, we present an overview of diacritic ambiguity and a review of previous diacritic disambiguation approaches on other languages. Focusing on Igbo language, we report the steps taken to develop a flexible framework for generating datasets for diacritic restoration. Three main approaches, the standard n-gram model, the classification models and the embedding models were proposed. The standard n-gram models use a sequence of previous words to the target stripped word as key predictors of the correct variants. For the classification models, a window of words on both sides of the target stripped word were use. The embedding models compare the similarity scores of the combined context word embeddings and the embeddings of each of the candidate variant vectors. The processes and techniques involved in projecting embeddings from a model trained with English texts to an Igbo embedding space and the creation of intrinsic evaluation tasks to validate the models were also discussed. A comparative analysis of the results indicate that all the approaches significantly improved on the baseline performance which uses the unigram model. The details of the processed involved in building the models as well as the possible directions for future work are discussed in this work

    The Future of Information Sciences : INFuture2009 : Digital Resources and Knowledge Sharing

    Get PDF

    Conversational Arabic Automatic Speech Recognition

    Get PDF
    Colloquial Arabic (CA) is the set of spoken variants of modern Arabic that exist in the form of regional dialects and are considered generally to be mother-tongues in those regions. CA has limited textual resource because it exists only as a spoken language and without a standardised written form. Normally the modern standard Arabic (MSA) writing convention is employed that has limitations in phonetically representing CA. Without phonetic dictionaries the pronunciation of CA words is ambiguous, and can only be obtained through word and/or sentence context. Moreover, CA inherits the MSA complex word structure where words can be created from attaching affixes to a word. In automatic speech recognition (ASR), commonly used approaches to model acoustic, pronunciation and word variability are language independent. However, one can observe significant differences in performance between English and CA, with the latter yielding up to three times higher error rates. This thesis investigates the main issues for the under-performance of CA ASR systems. The work focuses on two directions: first, the impact of limited lexical coverage, and insufficient training data for written CA on language modelling is investigated; second, obtaining better models for the acoustics and pronunciations by learning to transfer between written and spoken forms. Several original contributions result from each direction. Using data-driven classes from decomposed text are shown to reduce out-of-vocabulary rate. A novel colloquialisation system to import additional data is introduced; automatic diacritisation to restore the missing short vowels was found to yield good performance; and a new acoustic set for describing CA was defined. Using the proposed methods improved the ASR performance in terms of word error rate in a CA conversational telephone speech ASR task

    Self-supervised learning in natural language processing

    Get PDF
    Most natural language processing (NLP) learning algorithms require labeled data. While this is given for a select number of (mostly English) tasks, the availability of labeled data is sparse or non-existent for the vast majority of use-cases. To alleviate this, unsupervised learning and a wide array of data augmentation techniques have been developed (Hedderich et al., 2021a). However, unsupervised learning often requires massive amounts of unlabeled data and also fails to perform in difficult (low-resource) data settings, i.e., if there is an increased distance between the source and target data distributions (Kim et al., 2020). This distributional distance can be the case if there is a domain drift or large linguistic distance between the source and target data. Unsupervised learning in itself does not exploit the highly informative (labeled) supervisory signals hidden in unlabeled data. In this dissertation, we show that by combining the right unsupervised auxiliary task (e.g., sentence pair extraction) with an appropriate primary task (e.g., machine translation), self-supervised learning can exploit these hidden supervisory signals more efficiently than purely unsupervised approaches, while functioning on less labeled data than supervised approaches. Our self-supervised learning approach can be used to learn NLP tasks in an efficient manner, even when the amount of training data is sparse or the data comes with strong differences in its underlying distribution, e.g., stemming from unrelated languages. For our general approach, we applied unsupervised learning as an auxiliary task to learn a supervised primary task. Concretely, we have focused on the auxiliary task of sentence pair extraction for sequence-to-sequence primary tasks (i.e., machine translation and style transfer) as well as language modeling, clustering, subspace learning and knowledge integration for primary classification tasks (i.e., hate speech detection and sentiment analysis). For sequence-to-sequence tasks, we show that self-supervised neural machine translation (NMT) achieves competitive results on high-resource language pairs in comparison to unsupervised NMT while requiring less data. Further combining self-supervised NMT with unsupervised NMT-inspired augmentation techniques makes the learning of low-resource (similar, distant and unrelated) language pairs possible. Further, using our self-supervised approach, we show how style transfer can be learned without the need for parallel data, generating stylistic rephrasings of highest overall performance on all tested tasks. For sequence-to-label tasks, we underline the benefit of auxiliary task-based augmentation over primary task augmentation. An auxiliary task that showed to be especially beneficial to the primary task performance was subspace learning, which led to impressive gains in (cross-lingual) zero-shot classification performance on similar or distant target tasks, also on similar, distant and unrelated languages.Die meisten Lernalgorithmen der Computerlingistik (CL) benötigen gelabelte Daten. Diese sind zwar für eine Auswahl an (hautpsächlich Englischen) Aufgaben verfügbar, für den Großteil aller Anwendungsfälle sind gelabelte Daten jedoch nur spärrlich bis gar nicht vorhanden. Um dem gegenzusteuern, wurde eine große Auswahl an Techniken entwickelt, welche sich das unüberwachte Lernen oder Datenaugmentierung zu eigen machen (Hedderich et al., 2021a). Unüberwachtes Lernen benötigt jedoch massive Mengen an ungelabelten Daten und versagt, wenn es mit schwierigen (resourcenarmen) Datensituationen konfrontiert wird, d.h. wenn eine größere Distanz zwischen der Quellen- und Zieldatendistributionen vorhanden ist (Kim et al., 2020). Eine distributionelle Distanz kann zum Beispiel der Fall sein, wenn ein Domänenunterschied oder eine größere sprachliche Distanz zwischen der Quellenund Zieldaten besteht. Unüberwachtes Lernen selbst nutzt die hochinformativen (gelabelten) Überwachungssignale, welche sich in ungelabelte Daten verstecken, nicht aus. In dieser Dissertation zeigen wir, dass selbstüberwachtes Lernen, durch die Kombination der richtigen unüberwachten Hilfsaufgabe (z.B. Satzpaarextraktion) mit einer passenden Hauptaufgabe (z.B. maschinelle Übersetzung), diese versteckten Überwachsungssignale effizienter ausnutzen kann als pure unüberwachte Lernalgorithmen, und dabei auch noch weniger gelabelte Daten benötigen als überwachte Lernalgorithmen. Unser selbstüberwachter Lernansatz erlaubt es uns, CL Aufgaben effizient zu lernen, selbst wenn die Trainingsdatenmenge spärrlich ist oder die Daten mit starken distributionellen Differenzen einher gehen, z.B. weil die Daten von zwei nicht verwandten Sprachen stammen. Im Generellen haben wir unüberwachtes Lernen als Hilfsaufgabe angewandt um eine überwachte Hauptaufgabe zu erlernen. Konkret haben wir uns auf Satzpaarextraktion als Hilfsaufgabe für Sequenz-zu-Sequenz Hauptaufgaben (z.B. maschinelle Übersetzung und Stilübertragung) konzentriert sowohl als auch Sprachmodelierung, Clustern, Teilraumlernen und Wissensintegration zum erlernen von Klassifikationsaufgaben (z.B. Hassredenidentifikation und Sentimentanalyse). Für Sequenz-zu-Sequenz Aufgaben zeigen wir, dass selbstüberwachte maschinelle Übersetzung (MÜ) im Vergleich zur unüberwachten MÜ wettbewerbsfähige Ergebnisse auf resourcenreichen Sprachpaaren erreicht und währenddessen weniger Daten zum Lernen benötigt. Wenn selbstüberwachte MÜ mit Augmentationstechniken, inspiriert durch unüberwachte MÜ, kombiniert wird, wird auch das Lernen von resourcenarmen (ähnlichen, entfernt verwandten und nicht verwandten) Sprachpaaren möglich. Außerdem zeigen wir, wie unser selbsüberwachter Lernansatz es ermöglicht Stilübertragung ohne parallele Daten zu erlernen und dabei stylistische Umformulierungen von höchster Qualität auf allen geprüften Aufgaben zu erlangen. Für Sequenz-zu-Label Aufgaben unterstreichen wir den Vorteil, welchen hilfsaufgabenseitige Augmentierung über hauptaufgabenseitige Augmentierung hat. Eine Hilfsaufgabe welche sich als besonders hilfreich für die Qualität der Hauptaufgabe herausstellte ist das Teilraumlernen, welches zu beeindruckenden Leistungssteigerungen für (sprachübergreifende) zero-shot Klassifikation ähnlicher und entfernter Zielaufgaben (auch für ähnliche, entfernt verwandte und nicht verwandte Sprachen) führt

    MULTILINGUAL TEXT READING IN NATURAL SCENE IMAGES

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    ‘The Erasures, the Silences Where There Should Have Been Evidence’: Dismantling Archived History and Dwelling Experience Within the Works of Hilary Mantel

    Get PDF
    The thesis critically examines Hilary Mantel’s work and its interpretation of the past through literary depictions of dwellings and archives. This thesis explores the concept of the dwelling-as-archive which has the overlapping roles and functions of a domestic dwelling and a historical archive. This thesis investigates this concept as it appears throughout Hilary Mantel’s works. These dwelling-as-archives are also in-between places where memory and historical meaning slip or are hidden. This thesis claims that Mantel’s historical fiction writing is an affective response that questions the ways characters struggle to preserve historical documents in both dwellings and archives. The thesis’ historico-materialist focus on Mantel’s works and dwellings-as-archives argues for a re-historicisation of phenomenology which the thesis defines as a phenomenological approach grounded in historical awareness. Characters attempt to reconstruct an affective experience of the past through domestic objects and archives. The dwelling-as-archives are conceptualised as containers of memory. However, their status as in-between places mean archived objects are frequently hidden, slip, or are waiting to be discovered to re-experience the past. The thesis explores Mantel’s poetic dissembling of the borders between absence/presence, belonging/rootlessness, and private/public that helps expose possible lost historical meaning. The various sites of dwelling examined in this thesis are haunted by memory and the dead, with the inhabitants of such dwellings causing the decay of architectural structures and family homelife. This thesis then examines instances of inscription, intertextuality, and historiographic metafiction in a select number of Mantel’s novels to reveal the archive as an unstable product of familial, national, and institutional historical consciousness. The dwelling-as-archive is a hinterland through which Mantel’s imagination can reconstruct the missing gaps she finds in the historical record. This thesis contributes to a growing critical field of studies on Mantel’s works which includes Eileen Pollard and Ginette Carpenter’s Hilary Mantel: Contemporary Critical Perspectives (2018) and Lucy Arnold’s reading of spectrality and intertextuality in Reading Mantel: Haunted Decades (2019). Recent commercial and critical responses to the Wolf Hall Trilogy often do not recognise a literary career spanning twenty-five years from 1985. This thesis then investigates Mantel's specific challenges when (re)interpreting the past through her imagination and research, with the dwelling-as-archive often a background character in her character’s experience of dwelling and history

    Meaning refinement to improve cross-lingual information retrieval

    Get PDF
    Magdeburg, Univ., Fak. für Informatik, Diss., 2012von Farag Ahme

    An Automatic Modern Standard Arabic Text Simplification System: A Corpus-Based Approach

    Get PDF
    This thesis brings together an overview of Text Readability (TR) about Text Simplification (TS) with an application of both to Modern Standard Arabic (MSA). It will present our findings on using automatic TR and TS tools to teach MSA, along with challenges, limitations, and recommendations about enhancing the TR and TS models. Reading is one of the most vital tasks that provide language input for communication and comprehension skills. It is proved that the use of long sentences, connected sentences, embedded phrases, passive voices, non- standard word orders, and infrequent words can increase the text difficulty for people with low literacy levels, as well as second language learners. The thesis compares the use of sentence embeddings of different types (fastText, mBERT, XLM-R and Arabic-BERT), as well as traditional language features such as POS tags, dependency trees, readability scores and frequency lists for language learners. The accuracy of the 3-way CEFR (The Common European Framework of Reference for Languages Proficiency Levels) classification is F-1 of 0.80 and 0.75 for Arabic-Bert and XLM-R classification, respectively and 0.71 Spearman correlation for the regression task. At the same time, the binary difficulty classifier reaches F-1 0.94 and F-1 0.98 for the sentence-pair semantic similarity classifier. TS is an NLP task aiming to reduce the linguistic complexity of the text while maintaining its meaning and original information (Siddharthan, 2002; Camacho Collados, 2013; Saggion, 2017). The simplification study experimented using two approaches: (i) a classification approach and (ii) a generative approach. It then evaluated the effectiveness of these methods using the BERTScore (Zhang et al., 2020) evaluation metric. The simple sentences produced by the mT5 model achieved P 0.72, R 0.68 and F-1 0.70 via BERTScore while combining Arabic- BERT and fastText achieved P 0.97, R 0.97 and F-1 0.97. To reiterate, this research demonstrated the effectiveness of the implementation of a corpus-based method combined with extracting extensive linguistic features via the latest NLP techniques. It provided insights which can be of use in various Arabic corpus studies and NLP tasks such as translation for educational purposes
    corecore