606 research outputs found

    Evaluation of preprocessors for neural network speaker verification

    Get PDF

    Study of Speaker Verification Methods

    Get PDF
    Speaker verification is a process to accept or reject the identity claim of a speaker by comparing a set of measurements of the speakerƃĀ¢Ć¢ā€šĀ¬Ć¢ā€žĀ¢s utterances with a reference set of measurements of the utterance of the person whose identity is claimed.. In speaker verification, a person makes an identity claim. There are two main stages in this technique, feature extraction and feature matching. Feature extraction is the process in which we extract some useful data which can later to be used to represent the speaker. Feature matching involves identification of the unknown speaker by comparing the feature extracted from the voice with the enrolled voices of known speakers

    Evaluation of the Vulnerability of Speaker Verification to Synthetic Speech

    Get PDF
    In this paper, we evaluate the vulnerability of a speaker verification (SV) system to synthetic speech. Although this problem was first examined over a decade ago, dramatic improvements in both SV and speech synthesis have renewed interest in this problem. We use a HMM-based speech synthesizer, which creates synthetic speech for a targeted speaker through adaptation of a background model and a GMM-UBM-based SV system. Using 283 speakers from the Wall-Street Journal (WSJ) corpus, our SV system has a 0.4% EER. When the system is tested with synthetic speech generated from speaker models derived from the WSJ journal corpus, 90% of the matched claims are accepted. This result suggests a possible vulnerability in SV systems to synthetic speech. In order to detect synthetic speech prior to recognition, we investigate the use of an automatic speech recognizer (ASR), dynamic-timewarping (DTW) distance of mel-frequency cepstral coefficients (MFCC), and previously-proposed average inter-frame difference of log-likelihood (IFDLL). Overall, while SV systems have impressive accuracy, even with the proposed detector, high-quality synthetic speech can lead to an unacceptably high acceptance rate of synthetic speakers

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes

    The effects of child language development on the performance of automatic speech recognition

    Get PDF
    In comparison to adultsā€™, childrenā€™s ASR appears to be more challenging and yields inferior results. It has been suggested that for this issue to be addressed, linguistic understanding of childrenā€™s speech development needs to be employed to either provide a solution or an explanation. The present work aims to explore the influence of phonological effects associated with language acquisition (PEALA) in childrenā€™s ASR and investigate whether they can be detected in systematic patterns of ASR phone confusion errors or they can be evidenced in systematic patterns of acoustic feature structure. Findings from speech development research are used as the framework upon which a set of predictable error patterns is defined and guides the analysis of the experimental results reported. Several ASR experiments are conducted involving both childrenā€™s and adultsā€™ speech. ASR phone confusion matrices are extracted and analysed according to a statistical significance test, proposed for the purposes of this work. A mathematical model is introduced to interpret the emerging results. Additionally, bottleneck features and i-vectors representing the acoustic features in one of the systems developed, are extracted and visualised using linear discriminant analysis (LDA). A qualitative analysis is conducted with reference to patterns that can be predicted through PEALA

    Asynchronous factorisation of speaker and background with feature transforms in speech recognition

    Get PDF
    This paper presents a novel approach to separate the effects of speaker and background conditions by application of featuretransform based adaptation for Automatic Speech Recognition (ASR). So far factorisation has been shown to yield improvements in the case of utterance-synchronous environments. In this paper we show successful separation of conditions asynchronous with speech, such as background music. Our work takes account of the asynchronous nature of the background, by estimation of condition-specific Constrained Maximum Likelihood Linear Regression (CMLLR) transforms. In addition, speaker adaptation is performed, allowing to factorise speaker and background effects. Equally, background transforms are used asynchronously in the decoding process, using a modified Hidden Markov Model (HMM) topology which applies the optimal transform for each frame. Experimental results are presented on the WSJCAM0 corpus of British English speech, modified to contain controlled sections of background music. This addition of music degrades the baseline Word Error Rate (WER) from 10.1% to 26.4%. While synchronous factorisation with CMLLR transforms provides 28% relative improvement in WER over the baseline, our asynchronous approach increases this reduction to 33%

    A Framework For Enhancing Speaker Age And Gender Classification By Using A New Feature Set And Deep Neural Network Architectures

    Get PDF
    Speaker age and gender classification is one of the most challenging problems in speech processing. Recently with developing technologies, identifying a speaker age and gender has become a necessity for speaker verification and identification systems such as identifying suspects in criminal cases, improving human-machine interaction, and adapting music for awaiting people queue. Although many studies have been carried out focusing on feature extraction and classifier design for improvement, classification accuracies are still not satisfactory. The key issue in identifying speakerā€™s age and gender is to generate robust features and to design an in-depth classifier. Age and gender information is concealed in speakerā€™s speech, which is liable for many factors such as, background noise, speech contents, and phonetic divergences. In this work, different methods are proposed to enhance the speaker age and gender classification based on the deep neural networks (DNNs) as a feature extractor and classifier. First, a model for generating new features from a DNN is proposed. The proposed method uses the Hidden Markov Model toolkit (HTK) tool to find tied-state triphones for all utterances, which are used as labels for the output layer in the DNN. The DNN with a bottleneck layer is trained in an unsupervised manner for calculating the initial weights between layers, then it is trained and tuned in a supervised manner to generate transformed mel-frequency cepstral coefficients (T-MFCCs). Second, the shared class labels method is introduced among misclassified classes to regularize the weights in DNN. Third, DNN-based speakers models using the SDC feature set is proposed. The speakers-aware model can capture the characteristics of the speaker age and gender more effectively than a model that represents a group of speakers. In addition, AGender-Tune system is proposed to classify the speaker age and gender by jointly fine-tuning two DNN models; the first model is pre-trained to classify the speaker age, and second model is pre-trained to classify the speaker gender. Moreover, the new T-MFCCs feature set is used as the input of a fusion model of two systems. The first system is the DNN-based class model and the second system is the DNN-based speaker model. Utilizing the T-MFCCs as input and fusing the final score with the score of a DNN-based class model enhanced the classification accuracies. Finally, the DNN-based speaker models are embedded into an AGender-Tune system to exploit the advantages of each method for a better speaker age and gender classification. The experimental results on a public challenging database showed the effectiveness of the proposed methods for enhancing the speaker age and gender classification and achieved the state of the art on this database
    • ā€¦
    corecore