484 research outputs found

    Adverse conditions improve distinguishability of auditory, motor and perceptuo-motor theories of speech perception: an exploratory Bayesian modeling study

    Get PDF
    Special Issue: Speech Recognition in Adverse ConditionsInternational audienceIn this paper, we put forward a computational framework for the comparison between motor, auditory, and perceptuo-motor theories of speech communication. We first recall the basic arguments of these three sets of theories, either applied to speech perception or to speech production. Then we expose a unifying Bayesian model able to express each theory in a probabilistic way. Focusing on speech perception, we demonstrate that under two hypotheses, regarding communication noise and inter-speaker variability, providing perfect conditions for speech communication, motor, and auditory theories are indistinguishable. We then degrade successively each hypothesis to study the distinguish- ability of the different theories in ''adverse'' conditions. We first present simulations on a simplified implementation of the model with mono-dimensional sensory and motor variables, and secondly we consider a simulation of the human vocal tract providing more realistic auditory and articulatory variables. Simulation results allow us to emphasise the respective roles of motor and auditory knowledge in various conditions of speech perception in adverse conditions, and to suggest some guidelines for future studies aiming at assessing the role of motor knowledge in speech perception

    The role of auditory information in audiovisual speech integration

    Get PDF
    Communication between two people involves collecting and integrating information from different senses. An example in speech perception is when a listener relies on auditory inputs to hear spoken words and on visual input to read lips, making it easier to communicate in a noisy environment. Listeners are able to make use of visual cues to fill in missing auditory information when the auditory signal has been compromised in some way (e.g., hearing loss or noisy environment). Interestingly, listeners integrate auditory and visual information during the perception of speech, even when one of those senses proves to be more than sufficient. Grant and Seitz (1998) found a great deal of variability in the performance of listeners on perception tasks of auditory-visual speech. These discoveries have posed a number of questions about why and how multi-sensory integration occurs. Research in “optimal integration” suggests the possibility that listener, talker, or acoustic characteristics may influence auditory-visual integration. The present study focused on characteristics of the auditory signal that might promote auditory-visual integration, specifically looking at whether removal of information from the signal would produce greater use of the visual input and thus greater integration. CVC syllables from 5 talkers were degraded by selectively removing spectral fine-structure but maintaining temporal envelope characteristics of the waveform. The resulting stimuli were output through 2-.4-, 6-, and 8-channel bandpass filters. Results for 10 normal-hearing listeners showed auditory-visual integration for all conditions, but the amount of integration did not vary across different auditory signal manipulations. In addition, substantial across-talker differences were observed in auditory intelligibility in the 2-channel condition. Interestingly, the degree of audiovisual integration produced by different talkers was unrelated to auditory intelligibility. Implications of these results for our understanding of the processes underlying auditory-visual integration are discussed. Advisor: Janet M. WeisenbergerArts and Sciences Collegiate Undergraduate ScholarshipSocial and Behavioral Sciences Undergraduate Research Scholarshi

    Multi-Level Audio-Visual Interactions in Speech and Language Perception

    Get PDF
    That we perceive our environment as a unified scene rather than individual streams of auditory, visual, and other sensory information has recently provided motivation to move past the long-held tradition of studying these systems separately. Although they are each unique in their transduction organs, neural pathways, and cortical primary areas, the senses are ultimately merged in a meaningful way which allows us to navigate the multisensory world. Investigating how the senses are merged has become an increasingly wide field of research in recent decades, with the introduction and increased availability of neuroimaging techniques. Areas of study range from multisensory object perception to cross-modal attention, multisensory interactions, and integration. This thesis focuses on audio-visual speech perception, with special focus on facilitatory effects of visual information on auditory processing. When visual information is concordant with auditory information, it provides an advantage that is measurable in behavioral response times and evoked auditory fields (Chapter 3) and in increased entrainment to multisensory periodic stimuli reflected by steady-state responses (Chapter 4). When the audio-visual information is incongruent, the combination can often, but not always, combine to form a third, non-physically present percept (known as the McGurk effect). This effect is investigated (Chapter 5) using real word stimuli. McGurk percepts were not robustly elicited for a majority of stimulus types, but patterns of responses suggest that the physical and lexical properties of the auditory and visual stimulus may affect the likelihood of obtaining the illusion. Together, these experiments add to the growing body of knowledge that suggests that audio-visual interactions occur at multiple stages of processing

    Learning weakly supervised multimodal phoneme embeddings

    Full text link
    Recent works have explored deep architectures for learning multimodal speech representation (e.g. audio and images, articulation and audio) in a supervised way. Here we investigate the role of combining different speech modalities, i.e. audio and visual information representing the lips movements, in a weakly supervised way using Siamese networks and lexical same-different side information. In particular, we ask whether one modality can benefit from the other to provide a richer representation for phone recognition in a weakly supervised setting. We introduce mono-task and multi-task methods for merging speech and visual modalities for phone recognition. The mono-task learning consists in applying a Siamese network on the concatenation of the two modalities, while the multi-task learning receives several different combinations of modalities at train time. We show that multi-task learning enhances discriminability for visual and multimodal inputs while minimally impacting auditory inputs. Furthermore, we present a qualitative analysis of the obtained phone embeddings, and show that cross-modal visual input can improve the discriminability of phonological features which are visually discernable (rounding, open/close, labial place of articulation), resulting in representations that are closer to abstract linguistic features than those based on audio only

    Examining the McGurk illusion using high-field 7 Tesla functional MRI

    Get PDF
    In natural communication speech perception is profoundly influenced by observable mouth movements. The additional visual information can greatly facilitate intelligibility but incongruent visual information may also lead to novel percepts that neither match the auditory nor the visual information as evidenced by the McGurk effect. Recent models of audiovisual (AV) speech perception accentuate the role of speech motor areas and the integrative brain sites in the vicinity of the superior temporal sulcus (STS) for speech perception. In this event-related 7 Tesla fMRI study we used three naturally spoken syllable pairs with matching AV information and one syllable pair designed to elicit the McGurk illusion. The data analysis focused on brain sites involved in processing and fusing of AV speech and engaged in the analysis of auditory and visual differences within AV presented speech. Successful fusion of AV speech is related to activity within the STS of both hemispheres. Our data supports and extends the audio-visual-motor model of speech perception by dissociating areas involved in perceptual fusion from areas more generally related to the processing of AV incongruence

    Audio-visual speech processing system for Polish applicable to human-computer interaction

    Get PDF
    This paper describes audio-visual speech recognition system for Polish language and a set of performance tests under various acoustic conditions. We first present the overall structure of AVASR systems with three main areas: audio features extraction, visual features extraction and subsequently, audiovisual speech integration. We present MFCC features for audio stream with standard HMM modeling technique, then we describe appearance and shape based visual features. Subsequently we present two feature integration techniques, feature concatenation and model fusion. We also discuss the results of a set of experiments conducted to select best system setup for Polish, under noisy audio conditions. Experiments are simulating human-computer interaction in computer control case with voice commands in difficult audio environments. With Active Appearance Model (AAM) and multistream Hidden Markov Model (HMM) we can improve system accuracy by reducing Word Error Rate for more than 30%, comparing to audio-only speech recognition, when Signal-to-Noise Ratio goes down to 0dB
    corecore