10 research outputs found
Efficient Convolutional Neural Networks for Diacritic Restoration
Diacritic restoration has gained importance with the growing need for
machines to understand written texts. The task is typically modeled as a
sequence labeling problem and currently Bidirectional Long Short Term Memory
(BiLSTM) models provide state-of-the-art results. Recently, Bai et al. (2018)
show the advantages of Temporal Convolutional Neural Networks (TCN) over
Recurrent Neural Networks (RNN) for sequence modeling in terms of performance
and computational resources. As diacritic restoration benefits from both
previous as well as subsequent timesteps, we further apply and evaluate a
variant of TCN, Acausal TCN (A-TCN), which incorporates context from both
directions (previous and future) rather than strictly incorporating previous
context as in the case of TCN. A-TCN yields significant improvement over TCN
for diacritization in three different languages: Arabic, Yoruba, and
Vietnamese. Furthermore, A-TCN and BiLSTM have comparable performance, making
A-TCN an efficient alternative over BiLSTM since convolutions can be trained in
parallel. A-TCN is significantly faster than BiLSTM at inference time
(270%-334% improvement in the amount of text diacritized per minute).Comment: accepted in EMNLP 201
Neural Arabic Text Diacritization: State of the Art Results and a Novel Approach for Machine Translation
In this work, we present several deep learning models for the automatic
diacritization of Arabic text. Our models are built using two main approaches,
viz. Feed-Forward Neural Network (FFNN) and Recurrent Neural Network (RNN),
with several enhancements such as 100-hot encoding, embeddings, Conditional
Random Field (CRF) and Block-Normalized Gradient (BNG). The models are tested
on the only freely available benchmark dataset and the results show that our
models are either better or on par with other models, which require
language-dependent post-processing steps, unlike ours. Moreover, we show that
diacritics in Arabic can be used to enhance the models of NLP tasks such as
Machine Translation (MT) by proposing the Translation over Diacritization (ToD)
approach.Comment: 18 pages, 17 figures, 14 table
IDAT@FIRE2019: Overview of the Track on Irony Detection in Arabic Tweets
[EN] This overview paper describes the first shared task on irony
detection for the Arabic language. The task consists of a binary classification of tweets as ironic or not using a dataset composed of 5,030
Arabic tweets about different political issues and events related to the
Middle East and the Maghreb. Tweets in our dataset are written in
Modern Standard Arabic but also in different Arabic language varieties
including Egypt, Gulf, Levantine and Maghrebi dialects. Eighteen teams
registered to the task among which ten submitted their runs. The methods of participants ranged from feature-based to neural networks using
either classical machine learning techniques or ensemble methods. The
best performing system achieved F-score value of 0.844, showing that
classical feature-based models outperform the neural ones.This publication was made possible by NPRP grant 9-175-1-033 from the Qatar
National Research Fund (a member of Qatar Foundation). The findings achieved
herein are solely the responsibility of the last author. The work of Paolo Rosso
was also partially funded by Generalitat Valenciana under grant PROMETEO/2019/121.Ghanem, B.; Karoui, J.; Benamara, F.; Moriceau, V.; Rosso, P. (2019). IDAT@FIRE2019: Overview of the Track on Irony Detection in Arabic Tweets. CEUR-WS.org. 380-390. http://hdl.handle.net/10251/180744S38039
Multi-dialect Arabic broadcast speech recognition
Dialectal Arabic speech research suffers from the lack of labelled resources and
standardised orthography. There are three main challenges in dialectal Arabic
speech recognition: (i) finding labelled dialectal Arabic speech data, (ii) training
robust dialectal speech recognition models from limited labelled data and (iii)
evaluating speech recognition for dialects with no orthographic rules. This thesis
is concerned with the following three contributions:
Arabic Dialect Identification: We are mainly dealing with Arabic speech
without prior knowledge of the spoken dialect. Arabic dialects could be sufficiently
diverse to the extent that one can argue that they are different languages
rather than dialects of the same language. We have two contributions:
First, we use crowdsourcing to annotate a multi-dialectal speech corpus collected
from Al Jazeera TV channel. We obtained utterance level dialect labels for 57
hours of high-quality consisting of four major varieties of dialectal Arabic (DA),
comprised of Egyptian, Levantine, Gulf or Arabic peninsula, North African or
Moroccan from almost 1,000 hours. Second, we build an Arabic dialect identification
(ADI) system. We explored two main groups of features, namely acoustic
features and linguistic features. For the linguistic features, we look at a wide
range of features, addressing words, characters and phonemes. With respect to
acoustic features, we look at raw features such as mel-frequency cepstral coefficients
combined with shifted delta cepstra (MFCC-SDC), bottleneck features and
the i-vector as a latent variable. We studied both generative and discriminative
classifiers, in addition to deep learning approaches, namely deep neural network
(DNN) and convolutional neural network (CNN). In our work, we propose Arabic
as a five class dialect challenge comprising of the previously mentioned four
dialects as well as modern standard Arabic.
Arabic Speech Recognition: We introduce our effort in building Arabic automatic
speech recognition (ASR) and we create an open research community
to advance it. This section has two main goals: First, creating a framework for
Arabic ASR that is publicly available for research. We address our effort in building
two multi-genre broadcast (MGB) challenges. MGB-2 focuses on broadcast
news using more than 1,200 hours of speech and 130M words of text collected
from the broadcast domain. MGB-3, however, focuses on dialectal multi-genre
data with limited non-orthographic speech collected from YouTube, with special
attention paid to transfer learning. Second, building a robust Arabic ASR system
and reporting a competitive word error rate (WER) to use it as a potential
benchmark to advance the state of the art in Arabic ASR. Our overall system is
a combination of five acoustic models (AM): unidirectional long short term memory
(LSTM), bidirectional LSTM (BLSTM), time delay neural network (TDNN),
TDNN layers along with LSTM layers (TDNN-LSTM) and finally TDNN layers
followed by BLSTM layers (TDNN-BLSTM). The AM is trained using purely
sequence trained neural networks lattice-free maximum mutual information (LFMMI).
The generated lattices are rescored using a four-gram language model
(LM) and a recurrent neural network with maximum entropy (RNNME) LM.
Our official WER is 13%, which has the lowest WER reported on this task.
Evaluation: The third part of the thesis addresses our effort in evaluating dialectal
speech with no orthographic rules. Our methods learn from multiple
transcribers and align the speech hypothesis to overcome the non-orthographic
aspects. Our multi-reference WER (MR-WER) approach is similar to the BLEU
score used in machine translation (MT). We have also automated this process
by learning different spelling variants from Twitter data. We mine automatically
from a huge collection of tweets in an unsupervised fashion to build more than
11M n-to-m lexical pairs, and we propose a new evaluation metric: dialectal
WER (WERd). Finally, we tried to estimate the word error rate (e-WER) with
no reference transcription using decoding and language features. We show that
our word error rate estimation is robust for many scenarios with and without the
decoding features
Conversational Arabic Automatic Speech Recognition
Colloquial Arabic (CA) is the set of spoken variants of modern Arabic that exist in the form of regional dialects and are considered generally to be mother-tongues in those regions. CA has limited textual resource because it exists only as a spoken language and without a standardised written form. Normally the modern standard Arabic (MSA) writing convention is employed that has limitations in phonetically representing CA. Without phonetic dictionaries the pronunciation of CA words is ambiguous, and can only be obtained through word and/or sentence context. Moreover, CA inherits the MSA complex word structure where words can be created from attaching affixes to a word.
In automatic speech recognition (ASR), commonly used approaches to model acoustic, pronunciation and word variability are language independent. However, one can observe significant differences in performance between English and CA, with the latter yielding up to three times higher error rates.
This thesis investigates the main issues for the under-performance of CA ASR systems. The work focuses on two directions: first, the impact of limited lexical coverage, and insufficient training data for written CA on language modelling is investigated; second, obtaining better models for the acoustics and pronunciations by learning to transfer between written and spoken forms. Several original contributions result from each direction. Using data-driven classes from decomposed text are shown to reduce out-of-vocabulary rate. A novel colloquialisation system to import additional data is introduced; automatic diacritisation to restore the missing short vowels was found to yield good performance; and a new acoustic set for describing CA was defined. Using the proposed methods improved the ASR performance in terms of word error rate in a CA conversational telephone speech ASR task
Twitter Analysis to Predict the Satisfaction of Saudi Telecommunication Companies’ Customers
The flexibility in mobile communications allows customers to quickly switch from one service provider to
another, making customer churn one of the most critical challenges for the data and voice telecommunication
service industry. In 2019, the percentage of post-paid telecommunication customers in Saudi Arabia
decreased; this represents a great deal of customer dissatisfaction and subsequent corporate fiscal losses.
Many studies correlate customer satisfaction with customer churn. The Telecom companies have depended
on historical customer data to measure customer churn. However, historical data does not reveal current
customer satisfaction or future likeliness to switch between telecom companies. Current methods of analysing
churn rates are inadequate and faced some issues, particularly in the Saudi market.
This research was conducted to realize the relationship between customer satisfaction and customer churn
and how to use social media mining to measure customer satisfaction and predict customer churn.
This research conducted a systematic review to address the churn prediction models problems and their
relation to Arabic Sentiment Analysis. The findings show that the current churn models lack integrating
structural data frameworks with real-time analytics to target customers in real-time. In addition, the findings
show that the specific issues in the existing churn prediction models in Saudi Arabia relate to the Arabic
language itself, its complexity, and lack of resources.
As a result, I have constructed the first gold standard corpus of Saudi tweets related to telecom companies,
comprising 20,000 manually annotated tweets. It has been generated as a dialect sentiment lexicon extracted
from a larger Twitter dataset collected by me to capture text characteristics in social media. I developed a
new ASA prediction model for telecommunication that fills the detected gaps in the ASA literature and fits
the telecommunication field. The proposed model proved its effectiveness for Arabic sentiment analysis and
churn prediction. This is the first work using Twitter mining to predict potential customer loss (churn) in
Saudi telecom companies, which has not been attempted before. Different fields, such as education, have
different features, making applying the proposed model is interesting because it based on text-mining
Self-supervised learning in natural language processing
Most natural language processing (NLP) learning algorithms require labeled data. While this is given for a select number of (mostly English) tasks, the availability of labeled data is sparse or non-existent for the vast majority of use-cases. To alleviate this, unsupervised learning and a wide array of data augmentation techniques have been developed (Hedderich et al., 2021a). However, unsupervised learning often requires massive amounts of unlabeled data and also fails to perform in difficult (low-resource) data settings, i.e., if there is an increased distance between the source and target data distributions (Kim et al., 2020). This distributional distance can be the case if there is a domain drift or large linguistic distance between the source and target data. Unsupervised learning in itself does not exploit the highly informative (labeled) supervisory signals hidden in unlabeled data. In this dissertation, we show that by combining the right unsupervised auxiliary task (e.g., sentence pair extraction) with an appropriate primary task (e.g., machine translation), self-supervised learning can exploit these hidden supervisory signals more efficiently than purely unsupervised approaches, while functioning on less labeled data than supervised approaches. Our self-supervised learning approach can be used to learn NLP tasks in an efficient manner, even when the amount of training data is sparse or the data comes with strong differences in its underlying distribution, e.g., stemming from unrelated languages. For our general approach, we applied unsupervised learning as an auxiliary task to learn a supervised primary task. Concretely, we have focused on the auxiliary task of sentence pair extraction for sequence-to-sequence primary tasks (i.e., machine translation and style transfer) as well as language modeling, clustering, subspace learning and knowledge integration for primary classification tasks (i.e., hate speech detection and sentiment analysis). For sequence-to-sequence tasks, we show that self-supervised neural machine translation (NMT) achieves competitive results on high-resource language pairs in comparison to unsupervised NMT while requiring less data. Further combining self-supervised NMT with unsupervised NMT-inspired augmentation techniques makes the learning of low-resource (similar, distant and unrelated) language pairs possible. Further, using our self-supervised approach, we show how style transfer can be learned without the need for parallel data, generating stylistic rephrasings of highest overall performance on all tested tasks. For sequence-to-label tasks, we underline the benefit of auxiliary task-based augmentation over primary task augmentation. An auxiliary task that showed to be especially beneficial to the primary task performance was subspace learning, which led to impressive gains in (cross-lingual) zero-shot classification performance on similar or distant target tasks, also on similar, distant and unrelated languages.Die meisten Lernalgorithmen der Computerlingistik (CL) benötigen gelabelte Daten. Diese sind zwar für eine Auswahl an (hautpsächlich Englischen) Aufgaben verfügbar, für den Großteil aller Anwendungsfälle sind gelabelte Daten jedoch nur spärrlich bis gar nicht vorhanden. Um dem gegenzusteuern, wurde eine große Auswahl an Techniken entwickelt, welche sich das unüberwachte Lernen oder Datenaugmentierung zu eigen machen (Hedderich et al., 2021a). Unüberwachtes Lernen benötigt jedoch massive Mengen an ungelabelten Daten und versagt, wenn es mit schwierigen (resourcenarmen) Datensituationen konfrontiert wird, d.h. wenn eine größere Distanz zwischen der Quellen- und Zieldatendistributionen vorhanden ist (Kim et al., 2020). Eine distributionelle Distanz kann zum Beispiel der Fall sein, wenn ein Domänenunterschied oder eine größere sprachliche Distanz zwischen der Quellenund Zieldaten besteht. Unüberwachtes Lernen selbst nutzt die hochinformativen (gelabelten) Überwachungssignale, welche sich in ungelabelte Daten verstecken, nicht aus. In dieser Dissertation zeigen wir, dass selbstüberwachtes Lernen, durch die Kombination der richtigen unüberwachten Hilfsaufgabe (z.B. Satzpaarextraktion) mit einer passenden Hauptaufgabe (z.B. maschinelle Übersetzung), diese versteckten Überwachsungssignale effizienter ausnutzen kann als pure unüberwachte Lernalgorithmen, und dabei auch noch weniger gelabelte Daten benötigen als überwachte Lernalgorithmen. Unser selbstüberwachter Lernansatz erlaubt es uns, CL Aufgaben effizient zu lernen, selbst wenn die Trainingsdatenmenge spärrlich ist oder die Daten mit starken distributionellen Differenzen einher gehen, z.B. weil die Daten von zwei nicht verwandten Sprachen stammen. Im Generellen haben wir unüberwachtes Lernen als Hilfsaufgabe angewandt um eine überwachte Hauptaufgabe zu erlernen. Konkret haben wir uns auf Satzpaarextraktion als Hilfsaufgabe für Sequenz-zu-Sequenz Hauptaufgaben (z.B. maschinelle Übersetzung und Stilübertragung) konzentriert sowohl als auch Sprachmodelierung, Clustern, Teilraumlernen und Wissensintegration zum erlernen von Klassifikationsaufgaben (z.B. Hassredenidentifikation und Sentimentanalyse). Für Sequenz-zu-Sequenz Aufgaben zeigen wir, dass selbstüberwachte maschinelle Übersetzung (MÜ) im Vergleich zur unüberwachten MÜ wettbewerbsfähige Ergebnisse auf resourcenreichen Sprachpaaren erreicht und währenddessen weniger Daten zum Lernen benötigt. Wenn selbstüberwachte MÜ mit Augmentationstechniken, inspiriert durch unüberwachte MÜ, kombiniert wird, wird auch das Lernen von resourcenarmen (ähnlichen, entfernt verwandten und nicht verwandten) Sprachpaaren möglich. Außerdem zeigen wir, wie unser selbsüberwachter Lernansatz es ermöglicht Stilübertragung ohne parallele Daten zu erlernen und dabei stylistische Umformulierungen von höchster Qualität auf allen geprüften Aufgaben zu erlangen. Für Sequenz-zu-Label Aufgaben unterstreichen wir den Vorteil, welchen hilfsaufgabenseitige Augmentierung über hauptaufgabenseitige Augmentierung hat. Eine Hilfsaufgabe welche sich als besonders hilfreich für die Qualität der Hauptaufgabe herausstellte ist das Teilraumlernen, welches zu beeindruckenden Leistungssteigerungen für (sprachübergreifende) zero-shot Klassifikation ähnlicher und entfernter Zielaufgaben (auch für ähnliche, entfernt verwandte und nicht verwandte Sprachen) führt
Tune your brown clustering, please
Brown clustering, an unsupervised hierarchical clustering technique based on ngram mutual information, has proven useful in many NLP applications. However, most uses of Brown clustering employ the same default configuration; the appropriateness of this configuration has gone predominantly unexplored. Accordingly, we present information for practitioners on the behaviour of Brown clustering in order to assist hyper-parametre tuning, in the form of a theoretical model of Brown clustering utility. This model is then evaluated empirically in two sequence labelling tasks over two text types. We explore the dynamic between the input corpus size, chosen number of classes, and quality of the resulting clusters, which has an impact for any approach using Brown clustering. In every scenario that we examine, our results reveal that the values most commonly used for the clustering are sub-optimal