574 research outputs found
Phonological Level wav2vec2-based Mispronunciation Detection and Diagnosis Method
The automatic identification and analysis of pronunciation errors, known as
Mispronunciation Detection and Diagnosis (MDD) plays a crucial role in Computer
Aided Pronunciation Learning (CAPL) tools such as Second-Language (L2) learning
or speech therapy applications. Existing MDD methods relying on analysing
phonemes can only detect categorical errors of phonemes that have an adequate
amount of training data to be modelled. With the unpredictable nature of the
pronunciation errors of non-native or disordered speakers and the scarcity of
training datasets, it is unfeasible to model all types of mispronunciations.
Moreover, phoneme-level MDD approaches have a limited ability to provide
detailed diagnostic information about the error made. In this paper, we propose
a low-level MDD approach based on the detection of speech attribute features.
Speech attribute features break down phoneme production into elementary
components that are directly related to the articulatory system leading to more
formative feedback to the learner. We further propose a multi-label variant of
the Connectionist Temporal Classification (CTC) approach to jointly model the
non-mutually exclusive speech attributes using a single model. The pre-trained
wav2vec2 model was employed as a core model for the speech attribute detector.
The proposed method was applied to L2 speech corpora collected from English
learners from different native languages. The proposed speech attribute MDD
method was further compared to the traditional phoneme-level MDD and achieved a
significantly lower False Acceptance Rate (FAR), False Rejection Rate (FRR),
and Diagnostic Error Rate (DER) over all speech attributes compared to the
phoneme-level equivalent
Artificial Neural Network (ANN) in a Small Dataset to determine Neutrality in the Pronunciation of English as a Foreign Language in Filipino Call Center Agents
Artificial Neural Networks (ANNs) have continued to be efficient models in solving classification problems. In this paper, we explore the use of an ANN with a small dataset to accurately classify whether Filipino call center agents’ pronunciations are neutral or not based on their employer’s standards. Isolated utterances of the ten most commonly used words in the call center were recorded from eleven agents creating a dataset of 110 utterances. Two learning specialists were consulted to establish ground truths and Cohen’s Kappa was computed as 0.82, validating the reliability of the dataset. The first thirteen Mel-Frequency Cepstral Coefficients (MFCCs) were then extracted from each word and an ANN was trained with Ten-fold Stratified Cross Validation. Experimental results on the model recorded a classification accuracy of 89.60% supported by an overall F-Score of 0.92
Recommended from our members
Deep Learning for Automatic Assessment and Feedback of Spoken English
Growing global demand for learning a second language (L2), particularly English, has led to
considerable interest in automatic spoken language assessment, whether for use in computerassisted language learning (CALL) tools or for grading candidates for formal qualifications.
This thesis presents research conducted into the automatic assessment of spontaneous nonnative English speech, with a view to be able to provide meaningful feedback to learners. One
of the challenges in automatic spoken language assessment is giving candidates feedback on
particular aspects, or views, of their spoken language proficiency, in addition to the overall
holistic score normally provided. Another is detecting pronunciation and other types of errors
at the word or utterance level and feeding them back to the learner in a useful way.
It is usually difficult to obtain accurate training data with separate scores for different
views and, as examiners are often trained to give holistic grades, single-view scores can
suffer issues of consistency. Conversely, holistic scores are available for various standard
assessment tasks such as Linguaskill. An investigation is thus conducted into whether
assessment scores linked to particular views of the speaker’s ability can be obtained from
systems trained using only holistic scores.
End-to-end neural systems are designed with structures and forms of input tuned to single
views, specifically each of pronunciation, rhythm, intonation and text. By training each
system on large quantities of candidate data, individual-view information should be possible
to extract. The relationships between the predictions of each system are evaluated to examine
whether they are, in fact, extracting different information about the speaker. Three methods
of combining the systems to predict holistic score are investigated, namely averaging their
predictions and concatenating and attending over their intermediate representations. The
combined graders are compared to each other and to baseline approaches.
The tasks of error detection and error tendency diagnosis become particularly challenging
when the speech in question is spontaneous and particularly given the challenges posed by
the inconsistency of human annotation of pronunciation errors. An approach to these tasks is
presented by distinguishing between lexical errors, wherein the speaker does not know how a
particular word is pronounced, and accent errors, wherein the candidate’s speech exhibits
consistent patterns of phone substitution, deletion and insertion. Three annotated corpora
x
of non-native English speech by speakers of multiple L1s are analysed, the consistency of
human annotation investigated and a method presented for detecting individual accent and
lexical errors and diagnosing accent error tendencies at the speaker level
Artificial Neural Network (ANN) in a Small Dataset to determine Neutrality in the Pronunciation of English as a Foreign Language in Filipino Call Center Agents
Artificial Neural Networks (ANNs) have continued to be efficient models in solving classification problems. In this paper, we explore the use of an A NN with a small dataset to accurately classify whet her Filipino call center agents’ pronunciations are neutral or not based on their employer’s standards. Isolated utterances of the
ten most commonly used words in the call center were recorded from eleven agents creating a dataset of
110 utterances. Two learning specialists were consulted to establish ground truths and Cohen’s Kappa was computed as 0.82, validating the reliability of the dataset. The first thirteen Mel-Frequency Cepstral Coefficients (MFCCs) were then extracted from each word and an ANN was trained with Ten-fold Stratified Cross Validation.
Experimental results on the model recorded a classification accuracy of 89.60% supported by an overall F-Score
of 0.92
Articulatory Copy Synthesis Based on the Speech Synthesizer VocalTractLab
Articulatory copy synthesis (ACS), a subarea of speech inversion, refers to the reproduction of natural utterances and involves both the physiological articulatory processes and their corresponding acoustic results. This thesis proposes two novel methods for the ACS of human speech using the articulatory speech synthesizer VocalTractLab (VTL) to address or mitigate the existing problems of speech inversion, such as non-unique mapping, acoustic variation among different speakers, and the time-consuming nature of the process.
The first method involved finding appropriate VTL gestural scores for given natural utterances using a genetic algorithm. It consisted of two steps: gestural score initialization and optimization. In the first step, gestural scores were initialized using the given acoustic signals with speech recognition, grapheme-to-phoneme (G2P), and a VTL rule-based method for converting phoneme sequences to gestural scores. In the second step, the initial gestural scores were optimized by a genetic algorithm via an analysis-by-synthesis (ABS) procedure that sought to minimize the cosine distance between the acoustic features of the synthetic and natural utterances. The articulatory parameters were also regularized during the optimization process to restrict them to reasonable values.
The second method was based on long short-term memory (LSTM) and convolutional neural networks, which were responsible for capturing the temporal dependence and the spatial structure of the acoustic features, respectively. The neural network regression models were trained, which used acoustic features as inputs and produced articulatory trajectories as outputs. In addition, to cover as much of the articulatory and acoustic space as possible, the training samples were augmented by manipulating the phonation type, speaking effort, and the vocal tract length of the synthetic utterances. Furthermore, two regularization methods were proposed: one based on the smoothness loss of articulatory trajectories and another based on the acoustic loss between original and predicted acoustic features.
The best-performing genetic algorithms and convolutional LSTM systems (evaluated in terms of the difference between the estimated and reference VTL articulatory parameters) obtained average correlation coefficients of 0.985 and 0.983 for speaker-dependent utterances, respectively, and their reproduced speech achieved recognition accuracies of 86.25% and 64.69% for speaker-independent utterances of German words, respectively. When applied to German sentence utterances, as well as English and Mandarin Chinese word utterances, the neural network based ACS systems achieved recognition accuracies of 73.88%, 52.92%, and 52.41%, respectively. The results showed that both of these methods not only reproduced the articulatory processes but also reproduced the acoustic signals of reference utterances. Moreover, the regularization methods led to more physiologically plausible articulatory processes and made the estimated articulatory trajectories be more articulatorily preferred by VTL, thus reproducing more natural and intelligible speech. This study also found that the convolutional layers, when used in conjunction with batch normalization layers, automatically learned more distinctive features from log power spectrograms. Furthermore, the neural network based ACS systems trained using German data could be generalized to the utterances of other languages
An automated lexical stress classification tool for assessing dysprosody in childhood apraxia of speech
Childhood apraxia of speech (CAS) commonly affects the production of lexical stress contrast in polysyllabic words. Automated classification tools have the potential to increase reliability and efficiency in measuring lexical stress. Here, factors affecting the accuracy of a custom-built deep neural network (DNN)-based classification tool are evaluated. Sixteen children with typical development (TD) and 26 with CAS produced 50 polysyllabic words. Words with strong–weak (SW, e.g., dinosaur) or WS (e.g., banana) stress were fed to the classification tool, and the accuracy measured (a) against expert judgment, (b) for speaker group, and (c) with/without prior knowledge of phonemic errors in the sample. The influence of segmental features and participant factors on tool accuracy was analysed. Linear mixed modelling showed significant interaction between group and stress type, surviving adjustment for age and CAS severity. For TD, agreement for SW and WS words was >80%, but CAS speech was higher for SW (>80%) than WS (~60%). Prior knowledge of segmental errors conferred no clear advantage. Automatic lexical stress classification shows promise for identifying errors in children’s speech at diagnosis or with treatment-related change, but accuracy for WS words in apraxic speech needs improvement. Further training of algorithms using larger sets of labelled data containing impaired speech and WS words may increase accuracy
Dealing with linguistic mismatches for automatic speech recognition
Recent breakthroughs in automatic speech recognition (ASR) have resulted in a word error rate (WER) on par with human transcribers on the English Switchboard benchmark. However, dealing with linguistic mismatches between the training and testing data is still a significant challenge that remains unsolved. Under the monolingual environment, it is well-known that the performance of ASR systems degrades significantly when presented with the speech from speakers with different accents, dialects, and speaking styles than those encountered during system training. Under the multi-lingual environment, ASR systems trained on a source language achieve even worse performance when tested on another target language because of mismatches in terms of the number of phonemes, lexical ambiguity, and power of phonotactic constraints provided by phone-level n-grams.
In order to address the issues of linguistic mismatches for current ASR systems, my dissertation investigates both knowledge-gnostic and knowledge-agnostic solutions. In the first part, classic theories relevant to acoustics and articulatory phonetics that present capability of being transferred across a dialect continuum from local dialects to another standardized language are re-visited. Experiments demonstrate the potentials that acoustic correlates in the vicinity of landmarks could help to build a bridge for dealing with mismatches across difference local or global varieties in a dialect continuum. In the second part, we design an end-to-end acoustic modeling approach based on connectionist temporal classification loss and propose to link the training of acoustics and accent altogether in a manner similar to the learning process in human speech perception. This joint model not only performed well on ASR with multiple accents but also boosted accuracies of accent identification task in comparison to separately-trained models
Mobile Platform with Dynamic Optimization of the Pattern in Education in Colleges Through the Perspective of Network Informatization
The combination of mobile learning platforms and network informatization offers numerous benefits to learners, educators, and institutions. Learners can take control of their learning journey, accessing educational materials at their convenience and engaging in collaborative learning activities with peers from diverse backgrounds. This paper aims to explore the integration of mobile learning platforms and network informatization, examining their impact on educational practices, learner engagement, and the overall learning experience. The network informatization is assessed and monitored with Dynamic Programming Optimization (DPO) to compute the feature in reverse osmosis in English education. The attributes and features in the English language are computed and estimated for the periodic information update within the system. The DPO process is implemented along with the mandhani fuzzy set for the estimation of features in English education in colleges and universities. The information processed is updated in the mobile learning platform for the computation of the features in the English language and classification is performed with the deep learning model. Simulation analysis stated that constructed model is effective for the estimation and computation of the features and patterns in English language teaching in colleges and universities
Artificial Intelligence for Multimedia Signal Processing
Artificial intelligence technologies are also actively applied to broadcasting and multimedia processing technologies. A lot of research has been conducted in a wide variety of fields, such as content creation, transmission, and security, and these attempts have been made in the past two to three years to improve image, video, speech, and other data compression efficiency in areas related to MPEG media processing technology. Additionally, technologies such as media creation, processing, editing, and creating scenarios are very important areas of research in multimedia processing and engineering. This book contains a collection of some topics broadly across advanced computational intelligence algorithms and technologies for emerging multimedia signal processing as: Computer vision field, speech/sound/text processing, and content analysis/information mining
Data Augmentation Techniques for Robust Audio Analysis
Having large amounts of training data is necessary for the ever more popular neural networks to perform reliably. Data augmentation, i.e. the act of creating additional training data by performing label-preserving transformations for existing training data, is an efficient solution for this problem. While increasing the amount of data, introducing variations to the data via the transformations also has the power to make machine learning models more robust in real life conditions with noisy environments and mismatches between the training and test data.
In this thesis, data augmentation techniques in audio analysis are reviewed, and a tool for audio data augmentation (TADA) is presented. TADA is capable of performing three audio data augmentation techniques, which are convolution with mobile device microphone impulse responses, convolution with room impulse responses, and addition of background noises. TADA is evaluated by using it in a pronunciation error classification task, where typical pronunciation errors of Finnish people uttering English words are classified. All the techniques are tested first individually and then also in combination.
The experiments are executed with both original and augmented data. In all experiments, using TADA improves the performance of the classifier when compared to training with only original data. Robustness against unseen devices and rooms also improves. Additional gain from performing combined augmentation starts to saturate only after augmenting the training data to 30 times the original amount. Based on the positive impact of TADA for the classification task, it is found that data augmentation with convolutional and additive noises is an effective combination for increasing robustness against environmental distortions and channel effects.Viime aikoina nopeasti yleistyneiden neuroverkkojen opettamiseksi tarvitaan suuria määriä dataa, jotta niistä saadaan luotettavia. Aineiston täydennys, eli lisäaineiston luominen suorittamalla luokkatunnuksen säilyttäviä muunnoksia olemassa olevalle aineistolle, on tehokas ratkaisu kyseiseen ongelmaan. Aineiston kasvattamisen lisäksi vaihteluiden lisääminen opetusdataan voi tehdä koneoppimismalleista robusteja kohinaista, todellista dataa kohtaan.
Tässä työssä käydään läpi äänen analyysissä käytettäviä aineiston täydennysmenetelmiä ja esitellään aineiston lisäämistä varten kehitetty täydennystyökalu. Työkaluun kehitetyt kolme erillistä aineiston täydennysmenetelmää ovat konvoluutio mobiililaitteiden mikrofonien impulssivasteiden kanssa, konvoluutio huoneimpulssivasteiden kanssa sekä taustakohinan lisäys. Työkalua testataan käyttämällä sitä lausumisvirheluokittelutehtävässä, jossa tarkoituksena on luokitella tyypillisiä suomalaisten tekemiä lausumisvirheitä englanninkielisissä sanoissa. Kaikki implementoidut menetelmät testataan aluksi erikseen ja lopuksi yhdessä.
Testit suoritetaan käyttämällä sekä alkuperäistä että täydennettyä testidataa. Kaikissa testeissä työkalua käyttämällä saadaan kasvatettua luokittelijan tarkkuutta verrattuna alkuperäisellä datalla opetettuun luokittelijaan. Robustius uusia mobiililaitteita ja huoneita kohtaan myös paranee. Tarkkuuden kasvu yhdistetyssä testissä saturoituu, kun opetusdata on täydennetty 30-kertaiseksi. Työkalun positiivisen vaikutuksen perusteella aineiston täydennys konvoluutioilla ja lisätyllä kohinalla osoittautuu tehokkaaksi menetelmäksi robustiuden lisäämiseksi ympäristön ja tallennusvälineiden aiheuttamia häiriöitä kohtaan
- …