784 research outputs found

    Psychophysical and signal-processing aspects of speech representation

    Get PDF

    Models and analysis of vocal emissions for biomedical applications

    Get PDF
    This book of Proceedings collects the papers presented at the 3rd International Workshop on Models and Analysis of Vocal Emissions for Biomedical Applications, MAVEBA 2003, held 10-12 December 2003, Firenze, Italy. The workshop is organised every two years, and aims to stimulate contacts between specialists active in research and industrial developments, in the area of voice analysis for biomedical applications. The scope of the Workshop includes all aspects of voice modelling and analysis, ranging from fundamental research to all kinds of biomedical applications and related established and advanced technologies

    DISSOCIABLE MECHANISMS OF CONCURRENT SPEECH IDENTIFICATION IN NOISE AT CORTICAL AND SUBCORTICAL LEVELS.

    Get PDF
    When two vowels with different fundamental frequencies (F0s) are presented concurrently, listeners often hear two voices producing different vowels on different pitches. Parsing of this simultaneous speech can also be affected by the signal-to-noise ratio (SNR) in the auditory scene. The extraction and interaction of F0 and SNR cues may occur at multiple levels of the auditory system. The major aims of this dissertation are to elucidate the neural mechanisms and time course of concurrent speech perception in clean and in degraded listening conditions and its behavioral correlates. In two complementary experiments, electrical brain activity (EEG) was recorded at cortical (EEG Study #1) and subcortical (FFR Study #2) levels while participants heard double-vowel stimuli whose fundamental frequencies (F0s) differed by zero and four semitones (STs) presented in either clean or noise degraded (+5 dB SNR) conditions. Behaviorally, listeners were more accurate in identifying both vowels for larger F0 separations (i.e., 4ST; with pitch cues), and this F0-benefit was more pronounced at more favorable SNRs. Time-frequency analysis of cortical EEG oscillations (i.e., brain rhythms) revealed a dynamic time course for concurrent speech processing that depended on both extrinsic (SNR) and intrinsic (pitch) acoustic factors. Early high frequency activity reflected pre-perceptual encoding of acoustic features (~200 ms) and the quality (i.e., SNR) of the speech signal (~250-350ms), whereas later-evolving low-frequency rhythms (~400-500ms) reflected post-perceptual, cognitive operations that covaried with listening effort and task demands. Analysis of subcortical responses indicated that while FFRs provided a high-fidelity representation of double vowel stimuli and the spectro-temporal nonlinear properties of the peripheral auditory system. FFR activity largely reflected the neural encoding of stimulus features (exogenous coding) rather than perceptual outcomes, but timbre (F1) could predict the speed in noise conditions. Taken together, results of this dissertation suggest that subcortical auditory processing reflects mostly exogenous (acoustic) feature encoding in stark contrast to cortical activity, which reflects perceptual and cognitive aspects of concurrent speech perception. By studying multiple brain indices underlying an identical task, these studies provide a more comprehensive window into the hierarchy of brain mechanisms and time-course of concurrent speech processing

    Articulatory Feature Detection based on Cognitive Speech Perception

    Get PDF
    Cognitive Speech Perception is a field of growing interest as far as studies in cognitive sciences have advanced during the last decades helping in providing better descriptions on neural processes taking place in sound processing by the Auditory System and the Auditory Cortex. This knowledge may be applied to design new bio-inspired paradigms in the processing of speech sounds in Speech Sciences, especially in Articulatory Phonetics, but in many others as well, as Emotion Detection, Speaker’s Characterization, etc. The present paper reviews some basic facts already established in Speech Perception and the corresponding paradigms under which these may be used in designing new algorithms to detect Articulatory (Phonetic) Features in speech sounds which may be later used in Speech Labelling, Phonetic Characterization or other similar tasks

    Features of hearing: applications of machine learning to uncover the building blocks of hearing

    Get PDF
    Recent advances in machine learning have instigated a renewed interest in using machine learning approaches to better understand human sensory processing. This line of research is particularly interesting for speech research since speech comprehension is uniquely human, which complicates obtaining detailed neural recordings. In this thesis, I explore how machine learning can be used to uncover new knowledge about the auditory system, with a focus on discovering robust auditory features. The resulting increased understanding of the noise robustness of human hearing may help to better assist those with hearing loss and improve Automatic Speech Recognition (ASR) systems. First, I show how computational neuroscience and machine learning can be combined to generate hypotheses about auditory features. I introduce a neural feature detection model with a modest number of parameters that is compatible with auditory physiology. By testing feature detector variants in a speech classification task, I confirm the importance of both well-studied and lesser-known auditory features. Second, I investigate whether ASR software is a good candidate model of the human auditory system. By comparing several state-of-the-art ASR systems to the results from humans on a range of psychometric experiments, I show that these ASR systems diverge markedly from humans in at least some psychometric tests. This implies that none of these systems act as a strong proxy for human speech recognition, although some may be useful when asking more narrowly defined questions. For neuroscientists, this thesis exemplifies how machine learning can be used to generate new hypotheses about human hearing, while also highlighting the caveats of investigating systems that may work fundamentally differently from the human brain. For machine learning engineers, I point to tangible directions for improving ASR systems. To motivate the continued cross-fertilization between these fields, a toolbox that allows researchers to assess new ASR systems has been released.Open Acces

    Speech Decomposition and Enhancement

    Get PDF
    The goal of this study is to investigate the roles of steady-state speech sounds and transitions between these sounds in the intelligibility of speech. The motivation for this approach is that the auditory system may be particularly sensitive to time-varying frequency edges, which in speech are produced primarily by transitions between vowels and consonants and within vowels. The possibility that selectively amplifying these edges may enhance speech intelligibility is examined. Computer algorithms to decompose speech into two different components were developed. One component, which is defined as a tonal component, was intended to predominately include formant activity. The second component, which is defined as a non-tonal component, was intended to predominately include transitions between and within formants.The approach to the decomposition is to use a set of time-varying filters whose center frequencies and bandwidths are controlled to identify the strongest formant components in speech. Each center frequency and bandwidth is estimated based on FM and AM information of each formant component. The tonal component is composed of the sum of the filter outputs. The non-tonal component is defined as the difference between the original speech signal and the tonal component.The relative energy and intelligibility of the tonal and non-tonal components were compared to the original speech. Psychoacoustic growth functions were used to assess the intelligibility. Most of the speech energy was in the tonal component, but this component had a significantly lower maximum word recognition than the original and non-tonal component had. The non-tonal component averaged 2% of the original speech energy, but this component had almost equal maximum word recognition as the original speech. The non-tonal component was amplified and recombined with the original speech to generate enhanced speech. The energy of the enhanced speech was adjusted to be equal to the original speech, and the intelligibility of the enhanced speech was compared to the original speech in background noise. The enhanced speech showed higher recognition scores at lower SNRs, and the differences were significant. The original and enhanced speech showed similar recognition scores at higher SNRs. These results suggest that amplification of transient information can enhance the speech in noise and this enhancement method is more effective at severe noise conditions

    Characterization of Arabic sibilant consonants

    Get PDF
    The aim of this study is to develop an automatic speech recognition system in order to classify sibilant Arabic consonants into two groups: alveolar consonants and post-alveolar consonants. The proposed method is based on the use of the energy distribution, in a consonant-vowel type syllable, as an acoustic cue. The application of this method on our own corpus reveals that the amount of energy included in a vocal signal is a very important parameter in the characterization of Arabic sibilant consonants. For consonants classifications, the accuracy achieved to identify consonants as alveolar or post-alveolar is 100%. For post-alveolar consonants, the rate is 96% and for alveolar consonants, the rate is over 94%. Our classification technique outperformed existing algorithms based on support vector machines and neural networks in terms of classification rate

    Techniques for the enhancement of linear predictive speech coding in adverse conditions

    Get PDF

    Models and Analysis of Vocal Emissions for Biomedical Applications

    Get PDF
    The International Workshop on Models and Analysis of Vocal Emissions for Biomedical Applications (MAVEBA) came into being in 1999 from the particularly felt need of sharing know-how, objectives and results between areas that until then seemed quite distinct such as bioengineering, medicine and singing. MAVEBA deals with all aspects concerning the study of the human voice with applications ranging from the neonate to the adult and elderly. Over the years the initial issues have grown and spread also in other aspects of research such as occupational voice disorders, neurology, rehabilitation, image and video analysis. MAVEBA takes place every two years always in Firenze, Italy. This edition celebrates twenty years of uninterrupted and succesfully research in the field of voice analysis
    corecore