267 research outputs found

    Evaluation of preprocessors for neural network speaker verification

    Get PDF

    Speech and crosstalk detection in multichannel audio

    Get PDF
    The analysis of scenarios in which a number of microphones record the activity of speakers, such as in a round-table meeting, presents a number of computational challenges. For example, if each participant wears a microphone, speech from both the microphone's wearer (local speech) and from other participants (crosstalk) is received. The recorded audio can be broadly classified in four ways: local speech, crosstalk plus local speech, crosstalk alone and silence. We describe two experiments related to the automatic classification of audio into these four classes. The first experiment attempted to optimize a set of acoustic features for use with a Gaussian mixture model (GMM) classifier. A large set of potential acoustic features were considered, some of which have been employed in previous studies. The best-performing features were found to be kurtosis, "fundamentalness," and cross-correlation metrics. The second experiment used these features to train an ergodic hidden Markov model classifier. Tests performed on a large corpus of recorded meetings show classification accuracies of up to 96%, and automatic speech recognition performance close to that obtained using ground truth segmentation

    Text-Independent Automatic Speaker Identification Using Partitioned Neural Networks

    Get PDF
    This dissertation introduces a binary partitioned approach to statistical pattern classification which is applied to talker identification using neural networks. In recent years artificial neural networks have been shown to work exceptionally well for small but difficult pattern classification tasks. However, their application to large tasks (i.e., having more than ten to 20 categories) is limited by a dramatic increase in required training time. The time required to train a single network to perform N-way classification is nearly proportional to the exponential of N. In contrast, the binary partitioned approach requires training times on the order of N2. Besides partitioning, other related issues were investigated such as acoustic feature selection for speaker identification and neural network optimization. The binary partitioned approach was used to develop an automatic speaker identification system for 120 male and 130 female speakers of a standard speech data base. The system performs with 100% accuracy in a text-independent mode when trained with about nine to 14 seconds of speech and tested with six to eight seconds of speech

    SYNTHESIZING DYSARTHRIC SPEECH USING MULTI-SPEAKER TTS FOR DSYARTHRIC SPEECH RECOGNITION

    Get PDF
    Dysarthria is a motor speech disorder often characterized by reduced speech intelligibility through slow, uncoordinated control of speech production muscles. Automatic Speech recognition (ASR) systems may help dysarthric talkers communicate more effectively. However, robust dysarthria-specific ASR requires a significant amount of training speech is required, which is not readily available for dysarthric talkers. In this dissertation, we investigate dysarthric speech augmentation and synthesis methods. To better understand differences in prosodic and acoustic characteristics of dysarthric spontaneous speech at varying severity levels, a comparative study between typical and dysarthric speech was conducted. These characteristics are important components for dysarthric speech modeling, synthesis, and augmentation. For augmentation, prosodic transformation and time-feature masking have been proposed. For dysarthric speech synthesis, this dissertation has introduced a modified neural multi-talker TTS by adding a dysarthria severity level coefficient and a pause insertion model to synthesize dysarthric speech for varying severity levels. In addition, we have extended this work by using a label propagation technique to create more meaningful control variables such as a continuous Respiration, Laryngeal and Tongue (RLT) parameter, even for datasets that only provide discrete dysarthria severity level information. This approach increases the controllability of the system, so we are able to generate more dysarthric speech with a broader range. To evaluate their effectiveness for synthesis of training data, dysarthria-specific speech recognition was used. Results show that a DNN-HMM model trained on additional synthetic dysarthric speech achieves WER improvement of 12.2% compared to the baseline, and that the addition of the severity level and pause insertion controls decrease WER by 6.5%, showing the effectiveness of adding these parameters. Overall results on the TORGO database demonstrate that using dysarthric synthetic speech to increase the amount of dysarthric-patterned speech for training has a significant impact on the dysarthric ASR systems

    Semi-continuous hidden Markov models for automatic speaker verification

    Get PDF

    Identification of Non-Linguistic Speech Features

    Get PDF
    Over the last decade technological advances have been made which enable us to envision real-world applications of speech technologies. It is possible to foresee applications where the spoken query is to be recognized without even prior knowledge of the language being spoken, for example, information centers in public places such as train stations and airports. Other applications may require accurate identification of the speaker for security reasons, including control of access to confidential information or for telephone-based transactions. Ideally, the speaker's identity can be verified continually during the transaction, in a manner completely transparent to the user. With these views in mind, this paper presents a unified approach to identifying non-linguistic speech features from the recorded signal using phone-based acoustic likelihoods. This technique is shown to be effective for text-independent language, sex, and speaker identification and can enable better and more friendly human-machine interaction. With 2s of speech, the language can be identified with better than 99 % accuracy. Error in sex-identification is about 1% on a per-sentence basis, and speaker identification accuracies of 98.5 % on TIMIT (168 speakers) and 99.2 % on BREF (65 speakers), were obtained with one utterance per speaker, and 100 % with 2 utterances for both corpora. An experiment using unsupervised adaptation for speaker identification on the 168 TIMIT speakers had the same identification accuracies obtained with supervised adaptation

    Single-Microphone Speech Enhancement and Separation Using Deep Learning

    Get PDF
    The cocktail party problem comprises the challenging task of understanding a speech signal in a complex acoustic environment, where multiple speakers and background noise signals simultaneously interfere with the speech signal of interest. A signal processing algorithm that can effectively increase the speech intelligibility and quality of speech signals in such complicated acoustic situations is highly desirable. Especially for applications involving mobile communication devices and hearing assistive devices. Due to the re-emergence of machine learning techniques, today, known as deep learning, the challenges involved with such algorithms might be overcome. In this PhD thesis, we study and develop deep learning-based techniques for two sub-disciplines of the cocktail party problem: single-microphone speech enhancement and single-microphone multi-talker speech separation. Specifically, we conduct in-depth empirical analysis of the generalizability capability of modern deep learning-based single-microphone speech enhancement algorithms. We show that performance of such algorithms is closely linked to the training data, and good generalizability can be achieved with carefully designed training data. Furthermore, we propose uPIT, a deep learning-based algorithm for single-microphone speech separation and we report state-of-the-art results on a speaker-independent multi-talker speech separation task. Additionally, we show that uPIT works well for joint speech separation and enhancement without explicit prior knowledge about the noise type or number of speakers. Finally, we show that deep learning-based speech enhancement algorithms designed to minimize the classical short-time spectral amplitude mean squared error leads to enhanced speech signals which are essentially optimal in terms of STOI, a state-of-the-art speech intelligibility estimator.Comment: PhD Thesis. 233 page

    Single-Microphone Speech Enhancement and Separation Using Deep Learning

    Get PDF
    corecore