635 research outputs found

    Arabic digits speech recognition and speaker identification in noisy environment using a hybrid model of VQ and GMM

    Get PDF
    This paper presents an automatic speaker identification and speech recognition for Arabic digits in noisy environment. In this work, the proposed system is able to identify the speaker after saving his voice in the database and adding noise. The mel frequency cepstral coefficients (MFCC) is the best approach used in building a program in the Matlab platform; also, the quantization is used for generating the codebooks. The Gaussian mixture modelling (GMM) algorithms are used to generate template, feature-matching purpose. In this paper, we have proposed a system based on MFCC-GMM and MFCC-VQ Approaches on the one hand and by using the Hybrid Approach MFCC-VQ-GMM on the other hand for speaker modeling. The White Gaussian noise is added to the clean speech at several signal-to-noise ratio (SNR) levels to test the system in a noisy environment. The proposed system gives good results in recognition rate

    Saudi Accented Arabic Voice Bank

    Get PDF
    AbstractThe aim of this paper is to present an Arabic speech database that represents Arabic native speakers from all the cities of Saudi Arabia. The database is called the Saudi Accented Arabic Voice Bank (SAAVB). Preparing the prompt sheets, selecting the right speakers and transcribing their speech are some of the challenges that faced the project team. The procedures that meet these challenges are highlighted. SAAVB consists of 1033 speakers speak in Modern Standard Arabic with a Saudi accent. The SAAVB content is analyzed and the results are illustrated. The content was verified internally and externally by IBM Cairo and can be used to train speech engines such as automatic speech recognition and speaker verification systems

    Automatic Gender Detection Based on Characteristics of Vocal Folds for Mobile Healthcare System

    Get PDF
    An automatic gender detection may be useful in some cases of a mobile healthcare system. For example, there are some pathologies, such as vocal fold cyst, which mainly occur in female patients. If there is an automatic method for gender detection embedded into the system, it is easy for a healthcare professional to assess and prescribe appropriate medication to the patient. In human voice production system, contribution of the vocal folds is very vital. The length of the vocal folds is gender dependent; a male speaker has longer vocal folds than a female speaker. Due to longer vocal folds, the voice of a male becomes heavy and, therefore, contains more voice intensity. Based on this idea, a new type of time domain acoustic feature for automatic gender detection system is proposed in this paper. The proposed feature measures the voice intensity by calculating the area under the modified voice contour to make the differentiation between males and females. Two different databases are used to show that the proposed feature is independent of text, spoken language, dialect region, recording system, and environment. The obtained results for clean and noisy speech are 98.27% and 96.55%, respectively

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes

    Recent Trends in Computational Intelligence

    Get PDF
    Traditional models struggle to cope with complexity, noise, and the existence of a changing environment, while Computational Intelligence (CI) offers solutions to complicated problems as well as reverse problems. The main feature of CI is adaptability, spanning the fields of machine learning and computational neuroscience. CI also comprises biologically-inspired technologies such as the intellect of swarm as part of evolutionary computation and encompassing wider areas such as image processing, data collection, and natural language processing. This book aims to discuss the usage of CI for optimal solving of various applications proving its wide reach and relevance. Bounding of optimization methods and data mining strategies make a strong and reliable prediction tool for handling real-life applications

    Code-Switched Urdu ASR for Noisy Telephonic Environment using Data Centric Approach with Hybrid HMM and CNN-TDNN

    Full text link
    Call Centers have huge amount of audio data which can be used for achieving valuable business insights and transcription of phone calls is manually tedious task. An effective Automated Speech Recognition system can accurately transcribe these calls for easy search through call history for specific context and content allowing automatic call monitoring, improving QoS through keyword search and sentiment analysis. ASR for Call Center requires more robustness as telephonic environment are generally noisy. Moreover, there are many low-resourced languages that are on verge of extinction which can be preserved with help of Automatic Speech Recognition Technology. Urdu is the 10th10^{th} most widely spoken language in the world, with 231,295,440 worldwide still remains a resource constrained language in ASR. Regional call-center conversations operate in local language, with a mix of English numbers and technical terms generally causing a "code-switching" problem. Hence, this paper describes an implementation framework of a resource efficient Automatic Speech Recognition/ Speech to Text System in a noisy call-center environment using Chain Hybrid HMM and CNN-TDNN for Code-Switched Urdu Language. Using Hybrid HMM-DNN approach allowed us to utilize the advantages of Neural Network with less labelled data. Adding CNN with TDNN has shown to work better in noisy environment due to CNN's additional frequency dimension which captures extra information from noisy speech, thus improving accuracy. We collected data from various open sources and labelled some of the unlabelled data after analysing its general context and content from Urdu language as well as from commonly used words from other languages, primarily English and were able to achieve WER of 5.2% with noisy as well as clean environment in isolated words or numbers as well as in continuous spontaneous speech.Comment: 32 pages, 19 figures, 2 tables, preprin

    Constructing and Norming Arabic Screening Tool of Auditory Processing Disorders: Evaluation in a Group of Children at Risk for Learning Disability

    Get PDF
    The purposes of this study were to develop and provide the normative data of Arabic screening tool for screening the children with auditory processing disorders: an Arabic version of Adaptive Auditory Speech Test (AAST) in quiet for screening the peripheral hearing in dB SPL units; an Arabic AAST in binaural noise for screening the temporal interaction deficit: listening speech in binaural noise in dB SNR units, then teetaatoo test with a five subtests for screening the Modern Standard Arabic language phonemes identification ability. Participants included 338 children aged from 5 to 7 years old (138 males, 200 females; mean age = 6.08 years with standard deviation = 0.8) from a regular nursery school which called Baroot Summer Club in Beni-Suef in Egypt were recruited to participate in the study. According to the calculated Norms of AAST in quiet and through a meeting with the teachers of children in the nursery school, 129 children were sift out with no hearing loss, negative histories of neurological disorders, head trauma or surgery, dizziness, and attention deficit disorder/attention deficit hyperactivity disorder. 129 children were screened for listening in binaural noise using the Arabic AAST in binaural noise, then the left 94 children, because 35 children couldn`t complete the testing, was screened for phonemes identification ability using teetaatoo test(the five sub tests). For the AAST in quiet, 21 to 33 dB SPL is the normal range of the hearing peripheral loss, especially, for the AAST in binaural noise, there are three different norms; -9 to -13 dB SNR is the normal range of children aged 5 years old, -10 to -13 dB SNR is the normal range of children aged 6 years old, and -10 to -14 dB SNR is the normal range of children aged 7 years old. Finally, for the five subtests (teetaatoo): > 85% (correct answers) is the normal percentage of the Cons-A, >62& (correct answers) is the normal percentage of the Cons-B1, >76% is the normal percentage of the Cons-B2, >63% (correct answers) is the normal percentage of the Cons-B3, and 84% (correct answers) is the normal percentage of the Vow-A. Further, according to the previous norms, 23 children represent 17,8% from the whole sample (N=129) with a normal speech recognition threshold have scored abnormally on the speech listening in bin-noise (AAST in bin-noise) or on at least one subtest from teetaatoo subtests and were considered at risk for learning disability because of their scores on a SIFTER

    A categorization of robust speech processing datasets

    Get PDF
    Speech and audio signal processing research is a tale of data collection efforts and evaluation campaigns. While large datasets for automatic speech recognition (ASR) in clean environments with various speaking styles are available, the landscape is not as picture- perfect when it comes to robust ASR in realistic environments, much less so for evaluation of source separation and speech enhancement methods. Many data collection efforts have been conducted, moving along towards more and more realistic conditions, each mak- ing different compromises between mostly antagonistic factors: financial and human cost; amount of collected data; availability and quality of annotations and ground truth; natural- ness of mixing conditions; naturalness of speech content and speaking style; naturalness of the background noise; etc. In order to better understand what directions need to be explored to build datasets that best support the development and evaluation of algorithms for recognition, separation or localization that can be used in real-world applications, we present here a study of existing datasets in terms of their key attributes

    Faked Speech Detection with Zero Knowledge

    Full text link
    Audio is one of the most used ways of human communication, but at the same time it can be easily misused to trick people. With the revolution of AI, the related technologies are now accessible to almost everyone thus making it simple for the criminals to commit crimes and forgeries. In this work, we introduce a neural network method to develop a classifier that will blindly classify an input audio as real or mimicked; the word 'blindly' refers to the ability to detect mimicked audio without references or real sources. The proposed model was trained on a set of important features extracted from a large dataset of audios to get a classifier that was tested on the same set of features from different audios. The data was extracted from two raw datasets, especially composed for this work; an all English dataset and a mixed dataset (Arabic plus English). These datasets have been made available, in raw form, through GitHub for the use of the research community at https://github.com/SaSs7/Dataset. For the purpose of comparison, the audios were also classified through human inspection with the subjects being the native speakers. The ensued results were interesting and exhibited formidable accuracy.Comment: 14 pages, 4 figures (6 if you count subfigures), 2 table
    corecore