1,145 research outputs found

    Reconstruction of Phonated Speech from Whispers Using Formant-Derived Plausible Pitch Modulation

    Get PDF
    Whispering is a natural, unphonated, secondary aspect of speech communications for most people. However, it is the primary mechanism of communications for some speakers who have impaired voice production mechanisms, such as partial laryngectomees, as well as for those prescribed voice rest, which often follows surgery or damage to the larynx. Unlike most people, who choose when to whisper and when not to, these speakers may have little choice but to rely on whispers for much of their daily vocal interaction. Even though most speakers will whisper at times, and some speakers can only whisper, the majority of today’s computational speech technology systems assume or require phonated speech. This article considers conversion of whispers into natural-sounding phonated speech as a noninvasive prosthetic aid for people with voice impairments who can only whisper. As a by-product, the technique is also useful for unimpaired speakers who choose to whisper. Speech reconstruction systems can be classified into those requiring training and those that do not. Among the latter, a recent parametric reconstruction framework is explored and then enhanced through a refined estimation of plausible pitch from weighted formant differences. The improved reconstruction framework, with proposed formant-derived artificial pitch modulation, is validated through subjective and objective comparison tests alongside state-of-the-art alternatives

    Models and analysis of vocal emissions for biomedical applications: 5th International Workshop: December 13-15, 2007, Firenze, Italy

    Get PDF
    The MAVEBA Workshop proceedings, held on a biannual basis, collect the scientific papers presented both as oral and poster contributions, during the conference. The main subjects are: development of theoretical and mechanical models as an aid to the study of main phonatory dysfunctions, as well as the biomedical engineering methods for the analysis of voice signals and images, as a support to clinical diagnosis and classification of vocal pathologies. The Workshop has the sponsorship of: Ente Cassa Risparmio di Firenze, COST Action 2103, Biomedical Signal Processing and Control Journal (Elsevier Eds.), IEEE Biomedical Engineering Soc. Special Issues of International Journals have been, and will be, published, collecting selected papers from the conference

    Enhancement in Speaker Identification through Feature Fusion using Advanced Dilated Convolution Neural Network

    Get PDF
    There are various challenges in identifying the speakers accurately. The Extraction of discriminative features is a vital task for accurate identification in the speaker identification task. Nowadays, speaker identification is widely investigated using deep learning. The complex and noisy speech data affects the performance of Mel Frequency Cepstral Coefficients (MFCC); hence, MFCC fails to represent the speaker characteristics accurately. In this proposed work, a novel text-independent speaker identification system is developed to enhance the performance by fusion of Log-MelSpectrum and excitation features. The excitation information is obtained due to the vibration of vocal folds, and it is represented using Linear Prediction (LP) residual. The various types of features extracted from the excitation are residual phase, sharpness, Energy of Excitation (EoE), and Strength of Excitation (SoE). The extracted features were processed with the dilated convolution neural network (dilated CNN) to fulfill the identification task. The extensive evaluation showed that the fusion of excitation features gives better results than the existing methods. The accuracy reaches 94.12% for 11 complex classes and 91.34% for 80 speakers, and Equal Error Rate (EER) is reduced to 1.16% for the proposed model. The proposed model is tested with the Librispeech corpus using Matlab 2021b tool, outperforming the existing baseline models. The proposed model achieves an accuracy improvement of 1.34% compared to the baseline system

    Reducing mismatch in training of DNN-based glottal excitation models in a statistical parametric text-to-speech system

    Get PDF
    Neural network-based models that generate glottal excitation waveforms from acoustic features have been found to give improved quality in statistical parametric speech synthesis. Until now, however, these models have been trained separately from the acoustic model. This creates mismatch between training and synthesis, as the synthesized acoustic features used for the excitation model input differ from the original inputs, with which the model was trained on. Furthermore, due to the errors in predicting the vocal tract filter, the original excitation waveforms do not provide perfect reconstruction of the speech waveform even if predicted without error. To address these issues and to make the excitation model more robust against errors in acoustic modeling, this paper proposes two modifications to the excitation model training scheme. First, the excitation model is trained in a connected manner, with inputs generated by the acoustic model. Second, the target glottal waveforms are re-estimated by performing glottal inverse filtering with the predicted vocal tract filters. The results show that both of these modifications improve performance measured in MSE and MFCC distortion, and slightly improve the subjective quality of the synthetic speech.Peer reviewe

    HMM-Based Speech Synthesis Utilizing Glottal Inverse Filtering

    Get PDF

    An investigation into glottal waveform based speech coding

    Get PDF
    Coding of voiced speech by extraction of the glottal waveform has shown promise in improving the efficiency of speech coding systems. This thesis describes an investigation into the performance of such a system. The effect of reverberation on the radiation impedance at the lips is shown to be negligible under normal conditions. Also, the accuracy of the Image Method for adding artificial reverberation to anechoic speech recordings is established. A new algorithm, Pre-emphasised Maximum Likelihood Epoch Detection (PMLED), for Glottal Closure Instant detection is proposed. The algorithm is tested on natural speech and is shown to be both accurate and robust. Two techniques for giottai waveform estimation, Closed Phase Inverse Filtering (CPIF) and Iterative Adaptive Inverse Filtering (IAIF), are compared. In tandem with an LF model fitting procedure, both techniques display a high degree of accuracy However, IAIF is found to be slightly more robust. Based on these results, a Glottal Excited Linear Predictive (GELP) coding system for voiced speech is proposed and tested. Using a differential LF parameter quantisation scheme, the system achieves speech quality similar to that of U S Federal Standard 1016 CELP at a lower mean bit rate while incurring no extra delay

    The automatic detection of heart failure using speech signals

    Get PDF
    Heart failure (HF) is a major global health concern and is increasing in prevalence. It affects the larynx and breathing - thereby the quality of speech. In this article, we propose an approach for the automatic detection of people with HF using the speech signal. The proposed method explores mel-frequency cepstral coefficient (MFCC) features, glottal features, and their combination to distinguish HF from healthy speech. The glottal features were extracted from the voice source signal estimated using glottal inverse filtering. Four machine learning algorithms, namely, support vector machine, Extra Tree, AdaBoost, and feed-forward neural network (FFNN), were trained separately for individual features and their combination. It was observed that the MFCC features yielded higher classification accuracies compared to glottal features. Furthermore, the complementary nature of glottal features was investigated by combining these features with the MFCC features. Our results show that the FFNN classifier trained using a reduced set of glottal + MFCC features achieved the best overall performance in both speaker-dependent and speaker-independent scenarios. (C) 2021 The Author(s). Published by Elsevier Ltd.Peer reviewe
    corecore