360 research outputs found

    A hybrid neural network based speech recognition system for pervasive environments

    Get PDF
    One of the major drawbacks to using speech as the input to any pervasive environment is the requirement to balance accuracy with the high processing overheads involved. This paper presents an Arabic speech recognition system (called UbiqRec), which address this issue by providing a natural and intuitive way of communicating within ubiquitous environments, while balancing processing time, memory and recognition accuracy. A hybrid approach has been used which incorporates spectrographic information, singular value decomposition, concurrent self-organizing maps (CSOM) and pitch contours for Arabic phoneme recognition. The approach employs separate self-organizing maps (SOM) for each Arabic phoneme joined in parallel to form a CSOM. The performance results confirm that with suitable preprocessing of data, including extraction of distinct power spectral densities (PSD) and singular value decomposition, the training time for CSOM was reduced by 89%. The empirical results also proved that overall recognition accuracy did not fall below 91%

    Deep Transfer Learning for Automatic Speech Recognition: Towards Better Generalization

    Full text link
    Automatic speech recognition (ASR) has recently become an important challenge when using deep learning (DL). It requires large-scale training datasets and high computational and storage resources. Moreover, DL techniques and machine learning (ML) approaches in general, hypothesize that training and testing data come from the same domain, with the same input feature space and data distribution characteristics. This assumption, however, is not applicable in some real-world artificial intelligence (AI) applications. Moreover, there are situations where gathering real data is challenging, expensive, or rarely occurring, which can not meet the data requirements of DL models. deep transfer learning (DTL) has been introduced to overcome these issues, which helps develop high-performing models using real datasets that are small or slightly different but related to the training data. This paper presents a comprehensive survey of DTL-based ASR frameworks to shed light on the latest developments and helps academics and professionals understand current challenges. Specifically, after presenting the DTL background, a well-designed taxonomy is adopted to inform the state-of-the-art. A critical analysis is then conducted to identify the limitations and advantages of each framework. Moving on, a comparative study is introduced to highlight the current challenges before deriving opportunities for future research

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes

    Central Kurdish Automatic Speech Recognition using Deep Learning

    Get PDF
    Automatic Speech Recognition (ASR) as an interesting field of speech processing, is nowadays utilized in real applications which are implemented using various techniques. Amongst them, the artificial neural network is the most popular one. Increasing the performance and making these systems robust to noise are among the current challenges. This paper addresses the development of an ASR system for the Central Kurdish language (CKB) using a transfer learning of Deep Neural Networks (DNN). The combination of Mel-Frequency Cepstral Coefficients (MFCCs) for extracting features of speech signals, Long Short-Term Memory (LSTM) with Connectionist Temporal Classification (CTC) output layer is used to create an Acoustic Model (AM) on the AsoSoft CKB speech dataset.  Also, we have used the N-gram language model on the collected large text dataset which includes about 300 million tokens. The text corpus is also used to extract a dynamic lexicon model that contains over 2.5 million CKB words. The obtained results show that the use of a DNN improves the results compared to classical statistics modules. The proposed method achieves a 0.22%-word error rate by combining transfer learning and language model adaptation. This result is superior to the best-reported result for the CKB

    Automatic Speech Recognition for Low-resource Languages and Accents Using Multilingual and Crosslingual Information

    Get PDF
    This thesis explores methods to rapidly bootstrap automatic speech recognition systems for languages, which lack resources for speech and language processing. We focus on finding approaches which allow using data from multiple languages to improve the performance for those languages on different levels, such as feature extraction, acoustic modeling and language modeling. Under application aspects, this thesis also includes research work on non-native and Code-Switching speech

    Understanding the phonetics of neutralisation: a variability-field account of vowel/zero alternations in a Hijazi dialect of Arabic

    Get PDF
    This thesis throws new light on issues debated in the experimental literature on neutralisation. They concern the extent of phonetic merger (the completeness question) and the empirical validity of the phonetic effect (the genuineness question). Regarding the completeness question, I present acoustic and perceptual analyses of vowel/zero alternations in Bedouin Hijazi Arabic (BHA) that appear to result in neutralisation. The phonology of these alternations exemplifies two neutralisation scenarios bearing on the completeness question. Until now, these scenarios have been investigated separately within small-scale studies. Here I look more closely at both, testing hypotheses involving the acoustics-perception relation and the phonetics-phonology relation. I then discuss the genuineness question from an experimental and statistical perspective. Experimentally, I devise a paradigm that manipulates important variables claimed to influence the phonetics of neutralisation. Statistically, I reanalyse neutralisation data reported in the literature from Turkish and Polish. I apply different pre-analysis procedures which, I argue, can partly explain the mixed results in the literature. My inquiry into these issues leads me to challenge some of the discipline’s accepted standards for characterising the phonetics of neutralisation. My assessment draws on insights from different research fields including statistics, cognition, neurology, and psychophysics. I suggest alternative measures that are both cognitively and phonetically more plausible. I implement these within a new model of lexical representation and phonetic processing, the Variability Field Model (VFM). According to VFM, phonetic data are examined as jnd-based intervals rather than as single data points. This allows for a deeper understanding of phonetic variability. The model combines prototypical and episodic schemes and integrates linguistic, paralinguistic, and extra-linguistic effects. The thesis also offers a VFM-based analysis of a set of neutralisation data from BHA. In striving for a better understanding of the phonetics of neutralisation, the thesis raises important issues pertaining to the way we approach phonetic questions, generate and analyse data, and interpret and evaluate findings
    corecore