360 research outputs found
A hybrid neural network based speech recognition system for pervasive environments
One of the major drawbacks to using speech as the input to any pervasive environment is the requirement to balance accuracy with the high processing overheads involved. This paper presents an Arabic speech recognition system (called UbiqRec), which address this issue by providing a natural and intuitive way of communicating within ubiquitous environments, while balancing processing time, memory and recognition accuracy. A hybrid approach has been used which incorporates spectrographic information, singular value decomposition, concurrent self-organizing maps (CSOM) and pitch contours for Arabic phoneme recognition. The approach employs separate self-organizing maps (SOM) for each Arabic phoneme joined in parallel to form a CSOM. The performance results confirm that with suitable preprocessing of data, including extraction of distinct power spectral densities (PSD) and singular value decomposition, the training time for CSOM was reduced by 89%. The empirical results also proved that overall recognition accuracy did not fall below 91%
Deep Transfer Learning for Automatic Speech Recognition: Towards Better Generalization
Automatic speech recognition (ASR) has recently become an important challenge
when using deep learning (DL). It requires large-scale training datasets and
high computational and storage resources. Moreover, DL techniques and machine
learning (ML) approaches in general, hypothesize that training and testing data
come from the same domain, with the same input feature space and data
distribution characteristics. This assumption, however, is not applicable in
some real-world artificial intelligence (AI) applications. Moreover, there are
situations where gathering real data is challenging, expensive, or rarely
occurring, which can not meet the data requirements of DL models. deep transfer
learning (DTL) has been introduced to overcome these issues, which helps
develop high-performing models using real datasets that are small or slightly
different but related to the training data. This paper presents a comprehensive
survey of DTL-based ASR frameworks to shed light on the latest developments and
helps academics and professionals understand current challenges. Specifically,
after presenting the DTL background, a well-designed taxonomy is adopted to
inform the state-of-the-art. A critical analysis is then conducted to identify
the limitations and advantages of each framework. Moving on, a comparative
study is introduced to highlight the current challenges before deriving
opportunities for future research
Speech Recognition
Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes
Central Kurdish Automatic Speech Recognition using Deep Learning
Automatic Speech Recognition (ASR) as an interesting field of speech processing, is nowadays utilized in real applications which are implemented using various techniques. Amongst them, the artificial neural network is the most popular one. Increasing the performance and making these systems robust to noise are among the current challenges. This paper addresses the development of an ASR system for the Central Kurdish language (CKB) using a transfer learning of Deep Neural Networks (DNN). The combination of Mel-Frequency Cepstral Coefficients (MFCCs) for extracting features of speech signals, Long Short-Term Memory (LSTM) with Connectionist Temporal Classification (CTC) output layer is used to create an Acoustic Model (AM) on the AsoSoft CKB speech dataset. Also, we have used the N-gram language model on the collected large text dataset which includes about 300 million tokens. The text corpus is also used to extract a dynamic lexicon model that contains over 2.5 million CKB words. The obtained results show that the use of a DNN improves the results compared to classical statistics modules. The proposed method achieves a 0.22%-word error rate by combining transfer learning and language model adaptation. This result is superior to the best-reported result for the CKB
Automatic Speech Recognition for Low-resource Languages and Accents Using Multilingual and Crosslingual Information
This thesis explores methods to rapidly bootstrap automatic speech recognition systems for languages, which lack resources for speech and language processing. We focus on finding approaches which allow using data from multiple languages to improve the performance for those languages on different levels, such as feature extraction, acoustic modeling and language modeling. Under application aspects, this thesis also includes research work on non-native and Code-Switching speech
Understanding the phonetics of neutralisation: a variability-field account of vowel/zero alternations in a Hijazi dialect of Arabic
This thesis throws new light on issues debated in the experimental literature on
neutralisation. They concern the extent of phonetic merger (the completeness
question) and the empirical validity of the phonetic effect (the genuineness
question). Regarding the completeness question, I present acoustic and perceptual
analyses of vowel/zero alternations in Bedouin Hijazi Arabic (BHA) that appear to
result in neutralisation. The phonology of these alternations exemplifies two
neutralisation scenarios bearing on the completeness question. Until now, these
scenarios have been investigated separately within small-scale studies. Here I look
more closely at both, testing hypotheses involving the acoustics-perception
relation and the phonetics-phonology relation.
I then discuss the genuineness question from an experimental and statistical
perspective. Experimentally, I devise a paradigm that manipulates important
variables claimed to influence the phonetics of neutralisation. Statistically, I reanalyse
neutralisation data reported in the literature from Turkish and Polish. I
apply different pre-analysis procedures which, I argue, can partly explain the
mixed results in the literature.
My inquiry into these issues leads me to challenge some of the discipline’s
accepted standards for characterising the phonetics of neutralisation. My
assessment draws on insights from different research fields including statistics,
cognition, neurology, and psychophysics. I suggest alternative measures that are
both cognitively and phonetically more plausible. I implement these within a new
model of lexical representation and phonetic processing, the Variability Field
Model (VFM). According to VFM, phonetic data are examined as jnd-based
intervals rather than as single data points. This allows for a deeper understanding
of phonetic variability. The model combines prototypical and episodic schemes
and integrates linguistic, paralinguistic, and extra-linguistic effects. The thesis also
offers a VFM-based analysis of a set of neutralisation data from BHA. In striving for a better understanding of the phonetics of neutralisation, the thesis
raises important issues pertaining to the way we approach phonetic questions,
generate and analyse data, and interpret and evaluate findings
- …