52 research outputs found

    Physiological and psychoacoustical correlates of perceiving natural and modified speech

    Get PDF

    Probabilistic models of contextual effects in Auditory Pitch Perception

    Get PDF
    Perception was recognised by Helmholtz as an inferential process whereby learned expectations about the environment combine with sensory experience to give rise to percepts. Expectations are flexible, built from past experiences over multiple time-scales. What is the nature of perceptual expectations? How are they learned? How do they affect perception? These are the questions I propose to address in this thesis. I focus on two important yet simple perceptual attributes of sounds whose perception is widely regarded as effortless and automatic : pitch and frequency. In a first study, I aim to propose a definition of pitch as the solution of a computational goal. Pitch is a fundamental and salient perceptual attribute of many behaviourally important sounds including speech and music. The effortless nature of its perception has led to the search for a direct physical correlate of pitch and for mechanisms to extract pitch from peripheral neural responses. I propose instead that pitch is the outcome of a probabilistic inference of an underlying periodicity in sounds given a learned statistical prior over naturally pitch-evoking sounds, explaining in a single model a wide range of psychophysical results. In two other psychophysical studies I study how and at what time-scales recent sensory history affects the perception of frequency shifts and pitch shifts. (1) When subjects are presented with ambiguous pitch shifts (using octave ambiguous Shepard tone pairs), I show that sensory history is used to leverage the ambiguity in a way that reflects expectations of spectro-temporal continuity of auditory scenes. (2) In delayed 2 tone frequency discrimination tasks, I explore the contraction bias : when asked to report which of two tones separated by brief silence is higher, subjects behave as though they hear the earlier tone ’contracted’ in frequency towards a combination of recently presented stimulus frequencies, and the mean of the overall distribution of tones used in the experiment. I propose that expectations - the statistical learning of the sampled stimulus distribution - are built online and combined with sensory evidence in a statistically optimal fashion. Models derived in the thesis embody the concept of perception as unconscious inference. The results support the view that even apparently primitive acoustic percepts may derive from subtle statistical inference, suggesting that such inferential processes operate at all levels across our sensory systems

    EMG-to-Speech: Direct Generation of Speech from Facial Electromyographic Signals

    Get PDF
    The general objective of this work is the design, implementation, improvement and evaluation of a system that uses surface electromyographic (EMG) signals and directly synthesizes an audible speech output: EMG-to-speech

    The sound of rolling objects : perception of size and speed

    Get PDF

    The Effects of the Interaural Parameters of the Background Noise on Dichotic Pitch Detection

    Get PDF

    An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation

    Get PDF
    Speech enhancement and speech separation are two related tasks, whose purpose is to extract either one or more target speech signals, respectively, from a mixture of sounds generated by several sources. Traditionally, these tasks have been tackled using signal processing and machine learning techniques applied to the available acoustic signals. Since the visual aspect of speech is essentially unaffected by the acoustic environment, visual information from the target speakers, such as lip movements and facial expressions, has also been used for speech enhancement and speech separation systems. In order to efficiently fuse acoustic and visual information, researchers have exploited the flexibility of data-driven approaches, specifically deep learning, achieving strong performance. The ceaseless proposal of a large number of techniques to extract features and fuse multimodal information has highlighted the need for an overview that comprehensively describes and discusses audio-visual speech enhancement and separation based on deep learning. In this paper, we provide a systematic survey of this research topic, focusing on the main elements that characterise the systems in the literature: acoustic features; visual features; deep learning methods; fusion techniques; training targets and objective functions. In addition, we review deep-learning-based methods for speech reconstruction from silent videos and audio-visual sound source separation for non-speech signals, since these methods can be more or less directly applied to audio-visual speech enhancement and separation. Finally, we survey commonly employed audio-visual speech datasets, given their central role in the development of data-driven approaches, and evaluation methods, because they are generally used to compare different systems and determine their performance

    Sound Object Recognition

    Get PDF
    Humans are constantly exposed to a variety of acoustic stimuli ranging from music and speech to more complex acoustic scenes like a noisy marketplace. The human auditory perception mechanism is able to analyze these different kinds of sounds and extract meaningful information suggesting that the same processing mechanism is capable of representing different sound classes. In this thesis, we test this hypothesis by proposing a high dimensional sound object representation framework, that captures the various modulations of sound by performing a multi-resolution mapping. We then show that this model is able to capture a wide variety of sound classes (speech, music, soundscapes) by applying it to the tasks of speech recognition, speaker verification, musical instrument recognition and acoustic soundscape recognition. We propose a multi-resolution analysis approach that captures the detailed variations in the spectral characterists as a basis for recognizing sound objects. We then show how such a system can be fine tuned to capture both the message information (speech content) and the messenger information (speaker identity). This system is shown to outperform state-of-art system for noise robustness at both automatic speech recognition and speaker verification tasks. The proposed analysis scheme with the included ability to analyze temporal modulations was used to capture musical sound objects. We showed that using a model of cortical processing, we were able to accurately replicate the human perceptual similarity judgments and also were able to get a good classification performance on a large set of musical instruments. We also show that neither just the spectral feature or the marginals of the proposed model are sufficient to capture human perception. Moreover, we were able to extend this model to continuous musical recordings by proposing a new method to extract notes from the recordings. Complex acoustic scenes like a sports stadium have multiple sources producing sounds at the same time. We show that the proposed representation scheme can not only capture these complex acoustic scenes, but provides a flexible mechanism to adapt to target sources of interest. The human auditory perception system is known to be a complex system where there are both bottom-up analysis pathways and top-down feedback mechanisms. The top-down feedback enhances the output of the bottom-up system to better realize the target sounds. In this thesis we propose an implementation of top-down attention module which is complimentary to the high dimensional acoustic feature extraction mechanism. This attention module is a distributed system operating at multiple stages of representation, effectively acting as a retuning mechanism, that adapts the same system to different tasks. We showed that such an adaptation mechanism is able to tremendously improve the performance of the system at detecting the target source in the presence of various distracting background sources
    • …
    corecore