7,857 research outputs found

    Towards Automatic Speech Identification from Vocal Tract Shape Dynamics in Real-time MRI

    Full text link
    Vocal tract configurations play a vital role in generating distinguishable speech sounds, by modulating the airflow and creating different resonant cavities in speech production. They contain abundant information that can be utilized to better understand the underlying speech production mechanism. As a step towards automatic mapping of vocal tract shape geometry to acoustics, this paper employs effective video action recognition techniques, like Long-term Recurrent Convolutional Networks (LRCN) models, to identify different vowel-consonant-vowel (VCV) sequences from dynamic shaping of the vocal tract. Such a model typically combines a CNN based deep hierarchical visual feature extractor with Recurrent Networks, that ideally makes the network spatio-temporally deep enough to learn the sequential dynamics of a short video clip for video classification tasks. We use a database consisting of 2D real-time MRI of vocal tract shaping during VCV utterances by 17 speakers. The comparative performances of this class of algorithms under various parameter settings and for various classification tasks are discussed. Interestingly, the results show a marked difference in the model performance in the context of speech classification with respect to generic sequence or video classification tasks.Comment: To appear in the INTERSPEECH 2018 Proceeding

    Evaluation of the neo-glottal closure based on the source description in esophageal voice

    Get PDF
    The characteristics of esophageal voice render its study by traditional acoustic means to be limited and complicate. These limitations are even stronger when working with patients lacking minimal skills to control the required technique. Nevertheless the speech therapist needs to know the performance and mechanics developed by the patient in producing esophageal voice, as the specific techniques required in this case are not as universal and well-known as the ones for normal voicing. Each patient develops different strategies for producing esophageal voice due to the anatomical changes affecting the crico-pharyngeal sphincter (CPS) and the functional losses resulting from surgery. Therefore it is of fundamental relevance that practitioners could count on new instruments to evaluate esophageal voice quality, which on its turn could help in the enhancement of the CPS dynamics. The present work carries out a description of the voice of four patients after undergoing laryngectomy on data obtained from the study of the neo-glottal wave profile. Results obtained after analyzing the open-close phases and the tension of the muscular body on the CPS are shown

    A Hybrid Parameterization Technique for Speaker Identification

    Get PDF
    Classical parameterization techniques for Speaker Identification use the codification of the power spectral density of raw speech, not discriminating between articulatory features produced by vocal tract dynamics (acoustic-phonetics) from glottal source biometry. Through the present paper a study is conducted to separate voicing fragments of speech into vocal and glottal components, dominated respectively by the vocal tract transfer function estimated adaptively to track the acoustic-phonetic sequence of the message, and by the glottal characteristics of the speaker and the phonation gesture. The separation methodology is based in Joint Process Estimation under the un-correlation hypothesis between vocal and glottal spectral distributions. Its application on voiced speech is presented in the time and frequency domains. The parameterization methodology is also described. Speaker Identification experiments conducted on 245 speakers are shown comparing different parameterization strategies. The results confirm the better performance of decoupled parameterization compared against approaches based on plain speech parameterization

    Palate-referenced Articulatory Features for Acoustic-to-Articulator Inversion

    Get PDF
    The selection of effective articulatory features is an important component of tasks such as acoustic-to-articulator inversion and articulatory synthesis. Although it is common to use direct articulatory sensor measurements as feature variables, this approach fails to incorporate important physiological information such as palate height and shape and thus is not as representative of vocal tract cross section as desired. We introduce a set of articulator feature variables that are palate referenced and normalized with respect to the articulatory working space in order to improve the quality of the vocal tract representation. These features include normalized horizontal positions plus the normalized palatal height of two midsagittal and one lateral tongue sensor, as well as normalized lip separation and lip protrusion. The quality of the feature representation is evaluated subjectively by comparing the variances and vowel separation in the working space and quantitatively through measurement of acoustic-to-articulator inversion error. Results indicate that the palate-referenced features have reduced variance and increased separation between vowels spaces and substantially lower inversion error than direct sensor measures

    Testing the assumptions of linear prediction analysis in normal vowels

    Get PDF
    This paper develops an improved surrogate data test to show experimental evidence, for all the simple vowels of US English, for both male and female speakers, that Gaussian linear prediction analysis, a ubiquitous technique in current speech technologies, cannot be used to extract all the dynamical structure of real speech time series. The test provides robust evidence undermining the validity of these linear techniques, supporting the assumptions of either dynamical nonlinearity and/or non-Gaussianity common to more recent, complex, efforts at dynamical modelling speech time series. However, an additional finding is that the classical assumptions cannot be ruled out entirely, and plausible evidence is given to explain the success of the linear Gaussian theory as a weak approximation to the true, nonlinear/non-Gaussian dynamics. This supports the use of appropriate hybrid linear/nonlinear/non-Gaussian modelling. With a calibrated calculation of statistic and particular choice of experimental protocol, some of the known systematic problems of the method of surrogate data testing are circumvented to obtain results to support the conclusions to a high level of significance

    Exploiting Nonlinear Recurrence and Fractal Scaling Properties for Voice Disorder Detection

    Get PDF
    Background: Voice disorders affect patients profoundly, and acoustic tools can potentially measure voice function objectively. Disordered sustained vowels exhibit wide-ranging phenomena, from nearly periodic to highly complex, aperiodic vibrations, and increased "breathiness". Modelling and surrogate data studies have shown significant nonlinear and non-Gaussian random properties in these sounds. Nonetheless, existing tools are limited to analysing voices displaying near periodicity, and do not account for this inherent biophysical nonlinearity and non-Gaussian randomness, often using linear signal processing methods insensitive to these properties. They do not directly measure the two main biophysical symptoms of disorder: complex nonlinear aperiodicity, and turbulent, aeroacoustic, non-Gaussian randomness. Often these tools cannot be applied to more severe disordered voices, limiting their clinical usefulness.

Methods: This paper introduces two new tools to speech analysis: recurrence and fractal scaling, which overcome the range limitations of existing tools by addressing directly these two symptoms of disorder, together reproducing a "hoarseness" diagram. A simple bootstrapped classifier then uses these two features to distinguish normal from disordered voices.

Results: On a large database of subjects with a wide variety of voice disorders, these new techniques can distinguish normal from disordered cases, using quadratic discriminant analysis, to overall correct classification performance of 91.8% plus or minus 2.0%. The true positive classification performance is 95.4% plus or minus 3.2%, and the true negative performance is 91.5% plus or minus 2.3% (95% confidence). This is shown to outperform all combinations of the most popular classical tools.

Conclusions: Given the very large number of arbitrary parameters and computational complexity of existing techniques, these new techniques are far simpler and yet achieve clinically useful classification performance using only a basic classification technique. They do so by exploiting the inherent nonlinearity and turbulent randomness in disordered voice signals. They are widely applicable to the whole range of disordered voice phenomena by design. These new measures could therefore be used for a variety of practical clinical purposes.
&#xa

    Bio-inspired broad-class phonetic labelling

    Get PDF
    Recent studies have shown that the correct labeling of phonetic classes may help current Automatic Speech Recognition (ASR) when combined with classical parsing automata based on Hidden Markov Models (HMM).Through the present paper a method for Phonetic Class Labeling (PCL) based on bio-inspired speech processing is described. The methodology is based in the automatic detection of formants and formant trajectories after a careful separation of the vocal and glottal components of speech and in the operation of CF (Characteristic Frequency) neurons in the cochlear nucleus and cortical complex of the human auditory apparatus. Examples of phonetic class labeling are given and the applicability of the method to Speech Processing is discussed
    • 

    corecore