367 research outputs found

    Optimization Techniques for a Physical Model of Human Vocalisation

    Full text link
    We present a non-supervised approach to optimize and evaluate the synthesis of non-speech audio effects from a speech production model. We use the Pink Trombone synthesizer as a case study of a simplified production model of the vocal tract to target non-speech human audio signals --yawnings. We selected and optimized the control parameters of the synthesizer to minimize the difference between real and generated audio. We validated the most common optimization techniques reported in the literature and a specifically designed neural network. We evaluated several popular quality metrics as error functions. These include both objective quality metrics and subjective-equivalent metrics. We compared the results in terms of total error and computational demand. Results show that genetic and swarm optimizers outperform least squares algorithms at the cost of executing slower and that specific combinations of optimizers and audio representations offer significantly different results. The proposed methodology could be used in benchmarking other physical models and audio types.Comment: Accepted to DAFx 202

    Estimation of Subglottal Pressure, Vocal Fold Collision Pressure, and Intrinsic Laryngeal Muscle Activation From Neck-Surface Vibration Using a Neural Network Framework and a Voice Production Model

    Get PDF
    The ambulatory assessment of vocal function can be significantly enhanced by having access to physiologically based features that describe underlying pathophysiological mechanisms in individuals with voice disorders. This type of enhancement can improve methods for the prevention, diagnosis, and treatment of behaviorally based voice disorders. Unfortunately, the direct measurement of important vocal features such as subglottal pressure, vocal fold collision pressure, and laryngeal muscle activation is impractical in laboratory and ambulatory settings. In this study, we introduce a method to estimate these features during phonation from a neck-surface vibration signal through a framework that integrates a physiologically relevant model of voice production and machine learning tools. The signal from a neck-surface accelerometer is first processed using subglottal impedance-based inverse filtering to yield an estimate of the unsteady glottal airflow. Seven aerodynamic and acoustic features are extracted from the neck surface accelerometer and an optional microphone signal. A neural network architecture is selected to provide a mapping between the seven input features and subglottal pressure, vocal fold collision pressure, and cricothyroid and thyroarytenoid muscle activation. This non-linear mapping is trained solely with 13,000 Monte Carlo simulations of a voice production model that utilizes a symmetric triangular body-cover model of the vocal folds. The performance of the method was compared against laboratory data from synchronous recordings of oral airflow, intraoral pressure, microphone, and neck-surface vibration in 79 vocally healthy female participants uttering consecutive /pæ/ syllable strings at comfortable, loud, and soft levels. The mean absolute error and root-mean-square error for estimating the mean subglottal pressure were 191 Pa (1.95 cm H2O) and 243 Pa (2.48 cm H2O), respectively, which are comparable with previous studies but with the key advantage of not requiring subject-specific training and yielding more output measures. The validation of vocal fold collision pressure and laryngeal muscle activation was performed with synthetic values as reference. These initial results provide valuable insight for further vocal fold model refinement and constitute a proof of concept that the proposed machine learning method is a feasible option for providing physiologically relevant measures for laboratory and ambulatory assessment of vocal function.Fil: Ibarra, Emiro J.. Universidad Tecnica Federico Santa Maria.; ChileFil: Parra, Jesús A.. Universidad Tecnica Federico Santa Maria.; ChileFil: Alzamendi, Gabriel Alejandro. Universidad Nacional de Entre Ríos. Instituto de Investigación y Desarrollo en Bioingeniería y Bioinformática - Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Santa Fe. Instituto de Investigación y Desarrollo en Bioingeniería y Bioinformática; ArgentinaFil: Cortés, Juan P.. Universidad Tecnica Federico Santa Maria.; ChileFil: Espinoza, Víctor M.. Universidad de Chile; ChileFil: Mehta, Daryush D.. Center For Laryngeal Surgery And Voice Rehabilitation; Estados UnidosFil: Hillman, Robert E.. Center For Laryngeal Surgery And Voice Rehabilitation; Estados UnidosFil: Zañartu, Matías. Universidad Tecnica Federico Santa Maria.; Chil

    High Security and Capacity of Image Steganography for Hiding Human Speech Based on Spatial and Cepstral Domains

    Get PDF
    A new technique of hiding a speech signal clip inside a digital color image is proposed in this paper to improve steganography security and loading capacity. The suggested technique of image steganography is achieved using both spatial and cepstral domains, where the Mel-frequency cepstral coefficients (MFCCs) are adopted, as very efficient features of the speech signal. The presented technique in this paper contributes to improving the image steganography features through two approaches. First is to support the hiding capacity by the usage of the extracted MFCCs features and pitches extracted from the speech signal and embed them inside the cover color image rather than directly hiding the whole samples of the digitized speech signal. Second is to improve the data security by hiding the secret data (MFCCs features) anywhere in the host image rather than directly using the least significant bits substitution of the cover image. At the recovering side, the proposed approach recovers these hidden features and using them to reconstruct the speech waveform again by inverting the steps of MFCCs extraction to recover an approximated vocal tract response and combine it with recovered pitch based excitation signal. The results show a peak signal to noise ratio of 52.4 dB of the stego-image, which reflect a very good quality and a reduction ratio of embedded data to about (6%–25%). In addition, the results show a speech reconstruction degree of about 94.24% correlation with the original speech signal

    Singing voice resynthesis using concatenative-based techniques

    Get PDF
    Tese de Doutoramento. Engenharia Informática. Faculdade de Engenharia. Universidade do Porto. 201

    Estimating underlying articulatory targets of Thai vowels by using deep learning based on generating synthetic samples from a 3D vocal tract model and data augmentation

    Get PDF
    Representation learning is one of the fundamental issues in modeling articulatory-based speech synthesis using target-driven models. This paper proposes a computational strategy for learning underlying articulatory targets from a 3D articulatory speech synthesis model using a bi-directional long short-term memory recurrent neural network based on a small set of representative seed samples. From a seeding set, a larger training set was generated that provided richer contextual variations for the model to learn. The deep learning model for acoustic-to-target mapping was then trained to model the inverse relation of the articulation process. This method allows the trained model to map the given acoustic data onto the articulatory target parameters which can then be used to identify the distribution based on linguistic contexts. The model was evaluated based on its effectiveness in mapping acoustics to articulation, and the perceptual accuracy of speech reproduced from the estimated articulation. The results indicate that the model can accurately imitate speech with a high degree of phonemic precision

    Spoken Word Recognition Using Hidden Markov Model

    Get PDF
    The main aim of this project is to develop isolated spoken word recognition system using Hidden Markov Model (HMM) with a good accuracy at all the possible frequency range of human voice. Here ten different words are recorded by different speakers including male and female and results are compared with different feature extraction methods. Earlier work includes recognition of seven small utterances using HMM with the use only one feature extraction method. This spoken word recognition system mainly divided into two major blocks. First includes recording data base and feature extraction of recorded signals. Here we use Mel frequency cepstral coefficients, linear cepstral coefficients and fundamental frequency as feature extraction methods. To obtain Mel frequency cepstral coefficients signal should go through the following: pre emphasis, framing, applying window function, Fast Fourier transform, filter bank and then discrete cosine transform, where as a linear frequency cepstral coefficients does not use Mel frequency. Second part describes HMM used for modeling and recognizing the spoken words. All the raining samples are clustered using K-means algorithm. Gaussian mixture containing mean, variance and weight are modeling parameters. Here Baum Welch algorithm is used for training the samples and re-estimate the parameters. Finally Viterbi algorithm recognizes best sequence that exactly matches for given sequence there is given spoken utterance to be recognized. Here all the simulations are done by the MATLAB tool and Microsoft window 7 operating system

    Intelligibility Evaluation of Pathological Speech through Multigranularity Feature Extraction and Optimization

    Get PDF
    Pathological speech usually refers to speech distortion resulting from illness or other biological insults. The assessment of pathological speech plays an important role in assisting the experts, while automatic evaluation of speech intelligibility is difficult because it is usually nonstationary and mutational. In this paper, we carry out an independent innovation of feature extraction and reduction, and we describe a multigranularity combined feature scheme which is optimized by the hierarchical visual method. A novel method of generating feature set based on S-transform and chaotic analysis is proposed. There are BAFS (430, basic acoustics feature), local spectral characteristics MSCC (84, Mel S-transform cepstrum coefficients), and chaotic features (12). Finally, radar chart and F-score are proposed to optimize the features by the hierarchical visual fusion. The feature set could be optimized from 526 to 96 dimensions based on NKI-CCRT corpus and 104 dimensions based on SVD corpus. The experimental results denote that new features by support vector machine (SVM) have the best performance, with a recognition rate of 84.4% on NKI-CCRT corpus and 78.7% on SVD corpus. The proposed method is thus approved to be effective and reliable for pathological speech intelligibility evaluation

    A Review of the Assessment Methods of Voice Disorders in the Context of Parkinson's Disease

    Get PDF
    In recent years, a significant progress in the field of research dedicated to the treatment of disabilities has been witnessed. This is particularly true for neurological diseases, which generally influence the system that controls the execution of learned motor patterns. In addition to its importance for communication with the outside world and interaction with others, the voice is a reflection of our personality, moods and emotions. It is a way to provide information on health status, shape, intentions, age and even the social environment. It is also a working tool for many, but an important element of life for all. Patients with Parkinson’s disease (PD) are numerous and they suffer from hypokinetic dysarthria, which is manifested in all aspects of speech production: respiration, phonation, articulation, nasalization and prosody. This paper provides a review of the methods of the assessment of speech disorders in the context of PD and also discusses the limitations
    corecore