133 research outputs found
Arabic digits speech recognition and speaker identification in noisy environment using a hybrid model of VQ and GMM
This paper presents an automatic speaker identification and speech recognition for Arabic digits in noisy environment. In this work, the proposed system is able to identify the speaker after saving his voice in the database and adding noise. The mel frequency cepstral coefficients (MFCC) is the best approach used in building a program in the Matlab platform; also, the quantization is used for generating the codebooks. The Gaussian mixture modelling (GMM) algorithms are used to generate template, feature-matching purpose. In this paper, we have proposed a system based on MFCC-GMM and MFCC-VQ Approaches on the one hand and by using the Hybrid Approach MFCC-VQ-GMM on the other hand for speaker modeling. The White Gaussian noise is added to the clean speech at several signal-to-noise ratio (SNR) levels to test the system in a noisy environment. The proposed system gives good results in recognition rate
Speech Recognition
Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes
Master of Science
thesisPresently, speech recognition is gaining worldwide popularity in applications like Google Voice, speech-to-text reporter (speech-to-text transcription, video captioning, real-time transcriptions), hands-free computing, and video games. Research has been done for several years and many speech recognizers have been built. However, most of the speech recognizers fail to recognize the speech accurately. Consider the well-known application of Google Voice, which aids in users search of the web using voice. Though Google Voice does a good job in transcribing the spoken words, it does not accurately recognize the words spoken with different accents. With the fact that several accents are evolving around the world, it is essential to train the speech recognizer to recognize accented speech. Accent classification is defined as the problem of classifying the accents in a given language. This thesis explores various methods to identify the accents. We introduce a new concept of clustering windows of a speech signal and learn a distance metric using specific distance measure over phonetic strings to classify the accents. A language structure is incorporated to learn this distance metric. We also show how kernel approximation algorithms help in learning a distance metric
Phonological Level wav2vec2-based Mispronunciation Detection and Diagnosis Method
The automatic identification and analysis of pronunciation errors, known as
Mispronunciation Detection and Diagnosis (MDD) plays a crucial role in Computer
Aided Pronunciation Learning (CAPL) tools such as Second-Language (L2) learning
or speech therapy applications. Existing MDD methods relying on analysing
phonemes can only detect categorical errors of phonemes that have an adequate
amount of training data to be modelled. With the unpredictable nature of the
pronunciation errors of non-native or disordered speakers and the scarcity of
training datasets, it is unfeasible to model all types of mispronunciations.
Moreover, phoneme-level MDD approaches have a limited ability to provide
detailed diagnostic information about the error made. In this paper, we propose
a low-level MDD approach based on the detection of speech attribute features.
Speech attribute features break down phoneme production into elementary
components that are directly related to the articulatory system leading to more
formative feedback to the learner. We further propose a multi-label variant of
the Connectionist Temporal Classification (CTC) approach to jointly model the
non-mutually exclusive speech attributes using a single model. The pre-trained
wav2vec2 model was employed as a core model for the speech attribute detector.
The proposed method was applied to L2 speech corpora collected from English
learners from different native languages. The proposed speech attribute MDD
method was further compared to the traditional phoneme-level MDD and achieved a
significantly lower False Acceptance Rate (FAR), False Rejection Rate (FRR),
and Diagnostic Error Rate (DER) over all speech attributes compared to the
phoneme-level equivalent
Exploring the impact of data poisoning attacks on machine learning model reliability
Recent years have seen the widespread adoption of Artificial Intelligence techniques in several domains, including healthcare, justice, assisted driving and Natural Language Processing (NLP) based applications (e.g., the Fake News detection). Those mentioned are just a few examples of some domains that are particularly critical and sensitive to the reliability of the adopted machine learning systems. Therefore, several Artificial Intelligence approaches were adopted as support to realize easy and reliable solutions aimed at improving the early diagnosis, personalized treatment, remote patient monitoring and better decision-making with a consequent reduction of healthcare costs. Recent studies have shown that these techniques are venerable to attacks by adversaries at phases of artificial intelligence. Poisoned data set are the most common attack to the reliability of Artificial Intelligence approaches. Noise, for example, can have a significant impact on the overall performance of a machine learning model. This study discusses the strength of impact of noise on classification algorithms. In detail, the reliability of several machine learning techniques to distinguish correctly pathological and healthy voices by analysing poisoning data was evaluated. Voice samples selected by available database, widely used in research sector, the Saarbruecken Voice Database, were processed and analysed to evaluate the resilience and classification accuracy of these techniques. All analyses are evaluated in terms of accuracy, specificity, sensitivity, F1-score and ROC area
Recommended from our members
Automatic Dialect and Accent Recognition and its Application to Speech Recognition
A fundamental challenge for current research on speech science and technology is understanding and modeling individual variation in spoken language. Individuals have their own speaking styles, depending on many factors, such as their dialect and accent as well as their socioeconomic background. These individual differences typically introduce modeling difficulties for large-scale speaker-independent systems designed to process input from any variant of a given language. This dissertation focuses on automatically identifying the dialect or accent of a speaker given a sample of their speech, and demonstrates how such a technology can be employed to improve Automatic Speech Recognition (ASR). In this thesis, we describe a variety of approaches that make use of multiple streams of information in the acoustic signal to build a system that recognizes the regional dialect and accent of a speaker. In particular, we examine frame-based acoustic, phonetic, and phonotactic features, as well as high-level prosodic features, comparing generative and discriminative modeling techniques. We first analyze the effectiveness of approaches to language identification that have been successfully employed by that community, applying them here to dialect identification. We next show how we can improve upon these techniques. Finally, we introduce several novel modeling approaches -- Discriminative Phonotactics and kernel-based methods. We test our best performing approach on four broad Arabic dialects, ten Arabic sub-dialects, American English vs. Indian English accents, American English Southern vs. Non-Southern, American dialects at the state level plus Canada, and three Portuguese dialects. Our experiments demonstrate that our novel approach, which relies on the hypothesis that certain phones are realized differently across dialects, achieves new state-of-the-art performance on most dialect recognition tasks. This approach achieves an Equal Error Rate (EER) of 4% for four broad Arabic dialects, an EER of 6.3% for American vs. Indian English accents, 14.6% for American English Southern vs. Non-Southern dialects, and 7.9% for three Portuguese dialects. Our framework can also be used to automatically extract linguistic knowledge, specifically the context-dependent phonetic cues that may distinguish one dialect form another. We illustrate the efficacy of our approach by demonstrating the correlation of our results with geographical proximity of the various dialects. As a final measure of the utility of our studies, we also show that, it is possible to improve ASR. Employing our dialect identification system prior to ASR to identify the Levantine Arabic dialect in mixed speech of a variety of dialects allows us to optimize the engine's language model and use Levantine-specific acoustic models where appropriate. This procedure improves the Word Error Rate (WER) for Levantine by 4.6% absolute; 9.3% relative. In addition, we demonstrate in this thesis that, using a linguistically-motivated pronunciation modeling approach, we can improve the WER of a state-of-the art ASR system by 2.2% absolute and 11.5% relative WER on Modern Standard Arabic
Spoken Word and Speaker Recognition Using MFCC and Multiple Recurrent Neural Networks
Identification of spoken word and speaker has been featured in many kinds of research. The problem or obstacle that persists is in the pronunciation of a particular word. So it is the noise that causes the difficulty of words to be identified. Furthermore, every human has different pronunciation habits and is influenced by several variables, such as amplitude, frequency, tempo, and rhythmic. This study proposed the identification of spoken sounds by using specific word input to determine the patterns of the speaker and spoken using Mel-frequency Cepstrum Coefficients (MFCC) and Multiple Recurrent Neural Networks (RNN). The Mel coefficient of MFCC is used as an input feature for identifying spoken words and speakers using RNN and Long Short Term Memory (LSTM). Multiple RNN works spoken word and speaker in parallel. The results obtained by multiple RNN have an accuracy of 87.74%, while single RNNs have 80.58% using Adam of new data. In order to test our model computational regularly, the experiment tested K-fold Cross-Validation of datasets for spoken and speakers with an average accuracy of 86.07%, which means the model to be able to learn on the dataset without being affected by the order or selection of test data
Multi-dialect Arabic broadcast speech recognition
Dialectal Arabic speech research suffers from the lack of labelled resources and
standardised orthography. There are three main challenges in dialectal Arabic
speech recognition: (i) finding labelled dialectal Arabic speech data, (ii) training
robust dialectal speech recognition models from limited labelled data and (iii)
evaluating speech recognition for dialects with no orthographic rules. This thesis
is concerned with the following three contributions:
Arabic Dialect Identification: We are mainly dealing with Arabic speech
without prior knowledge of the spoken dialect. Arabic dialects could be sufficiently
diverse to the extent that one can argue that they are different languages
rather than dialects of the same language. We have two contributions:
First, we use crowdsourcing to annotate a multi-dialectal speech corpus collected
from Al Jazeera TV channel. We obtained utterance level dialect labels for 57
hours of high-quality consisting of four major varieties of dialectal Arabic (DA),
comprised of Egyptian, Levantine, Gulf or Arabic peninsula, North African or
Moroccan from almost 1,000 hours. Second, we build an Arabic dialect identification
(ADI) system. We explored two main groups of features, namely acoustic
features and linguistic features. For the linguistic features, we look at a wide
range of features, addressing words, characters and phonemes. With respect to
acoustic features, we look at raw features such as mel-frequency cepstral coefficients
combined with shifted delta cepstra (MFCC-SDC), bottleneck features and
the i-vector as a latent variable. We studied both generative and discriminative
classifiers, in addition to deep learning approaches, namely deep neural network
(DNN) and convolutional neural network (CNN). In our work, we propose Arabic
as a five class dialect challenge comprising of the previously mentioned four
dialects as well as modern standard Arabic.
Arabic Speech Recognition: We introduce our effort in building Arabic automatic
speech recognition (ASR) and we create an open research community
to advance it. This section has two main goals: First, creating a framework for
Arabic ASR that is publicly available for research. We address our effort in building
two multi-genre broadcast (MGB) challenges. MGB-2 focuses on broadcast
news using more than 1,200 hours of speech and 130M words of text collected
from the broadcast domain. MGB-3, however, focuses on dialectal multi-genre
data with limited non-orthographic speech collected from YouTube, with special
attention paid to transfer learning. Second, building a robust Arabic ASR system
and reporting a competitive word error rate (WER) to use it as a potential
benchmark to advance the state of the art in Arabic ASR. Our overall system is
a combination of five acoustic models (AM): unidirectional long short term memory
(LSTM), bidirectional LSTM (BLSTM), time delay neural network (TDNN),
TDNN layers along with LSTM layers (TDNN-LSTM) and finally TDNN layers
followed by BLSTM layers (TDNN-BLSTM). The AM is trained using purely
sequence trained neural networks lattice-free maximum mutual information (LFMMI).
The generated lattices are rescored using a four-gram language model
(LM) and a recurrent neural network with maximum entropy (RNNME) LM.
Our official WER is 13%, which has the lowest WER reported on this task.
Evaluation: The third part of the thesis addresses our effort in evaluating dialectal
speech with no orthographic rules. Our methods learn from multiple
transcribers and align the speech hypothesis to overcome the non-orthographic
aspects. Our multi-reference WER (MR-WER) approach is similar to the BLEU
score used in machine translation (MT). We have also automated this process
by learning different spelling variants from Twitter data. We mine automatically
from a huge collection of tweets in an unsupervised fashion to build more than
11M n-to-m lexical pairs, and we propose a new evaluation metric: dialectal
WER (WERd). Finally, we tried to estimate the word error rate (e-WER) with
no reference transcription using decoding and language features. We show that
our word error rate estimation is robust for many scenarios with and without the
decoding features
Recommended from our members
Towards automatic assessment of spontaneous spoken English
With increasing global demand for learning English as a second language, there has been considerable interest in
methods of automatic assessment of spoken language proficiency for use in interactive electronic learning tools as
well as for grading candidates for formal qualifications. This paper presents an automatic system to address the
assessment of spontaneous spoken language. Prompts or questions requiring spontaneous speech responses elicit
more natural speech which better reflects a learner’s proficiency level than read speech. In addition to the challenges
of highly variable non-native, learner, speech and noisy real-world recording conditions, this requires any automatic
system to handle disfluent, non-grammatical, spontaneous speech with the underlying text unknown. To handle these,
a strong deep learning based speech recognition system is applied in combination with a Gaussian Process (GP)
grader. A range of features derived from the audio using the recognition hypothesis are investigated for their efficacy
in the automatic grader. The proposed system is shown to predict grades at a similar level to the original examiner
graders on real candidate entries. Interpolation with the examiner grades further boosts performance. The ability to
reject poorly estimated grades is also important and measures are proposed to evaluate the performance of rejection
schemes. The GP variance is used to decide which automatic grades should be rejected. Back-off to an expert grader
for the least confident grades gives gains.Cambridge Assessment Englis
- …