431 research outputs found

    Multi-Level Audio-Visual Interactions in Speech and Language Perception

    Get PDF
    That we perceive our environment as a unified scene rather than individual streams of auditory, visual, and other sensory information has recently provided motivation to move past the long-held tradition of studying these systems separately. Although they are each unique in their transduction organs, neural pathways, and cortical primary areas, the senses are ultimately merged in a meaningful way which allows us to navigate the multisensory world. Investigating how the senses are merged has become an increasingly wide field of research in recent decades, with the introduction and increased availability of neuroimaging techniques. Areas of study range from multisensory object perception to cross-modal attention, multisensory interactions, and integration. This thesis focuses on audio-visual speech perception, with special focus on facilitatory effects of visual information on auditory processing. When visual information is concordant with auditory information, it provides an advantage that is measurable in behavioral response times and evoked auditory fields (Chapter 3) and in increased entrainment to multisensory periodic stimuli reflected by steady-state responses (Chapter 4). When the audio-visual information is incongruent, the combination can often, but not always, combine to form a third, non-physically present percept (known as the McGurk effect). This effect is investigated (Chapter 5) using real word stimuli. McGurk percepts were not robustly elicited for a majority of stimulus types, but patterns of responses suggest that the physical and lexical properties of the auditory and visual stimulus may affect the likelihood of obtaining the illusion. Together, these experiments add to the growing body of knowledge that suggests that audio-visual interactions occur at multiple stages of processing

    Visual and Auditory Characteristics of Talkers in Multimodal Integration

    Get PDF
    3rd place at 2009 Denman Undergraduate Research ForumIn perceiving speech, there are three different elements of the interaction that can affect how the signal is interpreted: the talker, the signal (both the visual and auditory) and the listener. Each of these elements inherently contains substantial variability, which will, in turn, affect the audio-visual speech percept. Since the work of McGurk in the 1960s, which showed that speech perception is a multimodal process that incorporates both auditory and visual cues, there have been numerous investigations on the impact of these elements on multimodal integration of speech. The impact of talker characteristics on audio-visual integration has received the least amount of attention to date. A recent study by Andrews (2007) provided an initial look at talker characteristics. In her study, audiovisual integration produced by 14 talkers was examined, and substantial differences across talkers were found in both auditory and audiovisual intelligibility. However, talker characteristics that promoted audiovisual integration were not specifically identified. The present study began to address this question by analyzing audiovisual integration performance using two types of reduced-information speech syllables produced by five talkers. In one reduction, fine-structure information was replaced with band-limited noise but the temporal envelope was retained, and in the other, the syllables were reduced to a set of three sine waves that followed the formant structure of the syllable (sine-wave speech). Syllables were presented under audio-visual conditions to 10 listeners. Results indicated substantial across-talker differences, with the pattern of talker differences not affected by the type of reduction of the auditory signal. Analysis of confusion matrices provided directions for further analysis of specific auditory and visual speech tokens.College of the Arts and Sciences Undergraduate ScholarshipSocial and Behavioral Sciences Undergraduate Research ScholarshipNo embarg

    The role of auditory information in audiovisual speech integration

    Get PDF
    Communication between two people involves collecting and integrating information from different senses. An example in speech perception is when a listener relies on auditory inputs to hear spoken words and on visual input to read lips, making it easier to communicate in a noisy environment. Listeners are able to make use of visual cues to fill in missing auditory information when the auditory signal has been compromised in some way (e.g., hearing loss or noisy environment). Interestingly, listeners integrate auditory and visual information during the perception of speech, even when one of those senses proves to be more than sufficient. Grant and Seitz (1998) found a great deal of variability in the performance of listeners on perception tasks of auditory-visual speech. These discoveries have posed a number of questions about why and how multi-sensory integration occurs. Research in “optimal integration” suggests the possibility that listener, talker, or acoustic characteristics may influence auditory-visual integration. The present study focused on characteristics of the auditory signal that might promote auditory-visual integration, specifically looking at whether removal of information from the signal would produce greater use of the visual input and thus greater integration. CVC syllables from 5 talkers were degraded by selectively removing spectral fine-structure but maintaining temporal envelope characteristics of the waveform. The resulting stimuli were output through 2-.4-, 6-, and 8-channel bandpass filters. Results for 10 normal-hearing listeners showed auditory-visual integration for all conditions, but the amount of integration did not vary across different auditory signal manipulations. In addition, substantial across-talker differences were observed in auditory intelligibility in the 2-channel condition. Interestingly, the degree of audiovisual integration produced by different talkers was unrelated to auditory intelligibility. Implications of these results for our understanding of the processes underlying auditory-visual integration are discussed. Advisor: Janet M. WeisenbergerArts and Sciences Collegiate Undergraduate ScholarshipSocial and Behavioral Sciences Undergraduate Research Scholarshi

    The Natural Statistics of Audiovisual Speech

    Get PDF
    Humans, like other animals, are exposed to a continuous stream of signals, which are dynamic, multimodal, extended, and time varying in nature. This complex input space must be transduced and sampled by our sensory systems and transmitted to the brain where it can guide the selection of appropriate actions. To simplify this process, it's been suggested that the brain exploits statistical regularities in the stimulus space. Tests of this idea have largely been confined to unimodal signals and natural scenes. One important class of multisensory signals for which a quantitative input space characterization is unavailable is human speech. We do not understand what signals our brain has to actively piece together from an audiovisual speech stream to arrive at a percept versus what is already embedded in the signal structure of the stream itself. In essence, we do not have a clear understanding of the natural statistics of audiovisual speech. In the present study, we identified the following major statistical features of audiovisual speech. First, we observed robust correlations and close temporal correspondence between the area of the mouth opening and the acoustic envelope. Second, we found the strongest correlation between the area of the mouth opening and vocal tract resonances. Third, we observed that both area of the mouth opening and the voice envelope are temporally modulated in the 2–7 Hz frequency range. Finally, we show that the timing of mouth movements relative to the onset of the voice is consistently between 100 and 300 ms. We interpret these data in the context of recent neural theories of speech which suggest that speech communication is a reciprocally coupled, multisensory event, whereby the outputs of the signaler are matched to the neural processes of the receiver

    Interactions between Auditory and Visual Semantic Stimulus Classes: Evidence for Common Processing Networks for Speech and Body Actions

    Get PDF
    Incongruencies between auditory and visual signals negatively affect human performance and cause selective activation in neuroimaging studies; therefore, they are increasingly used to probe audiovisual integration mechanisms. An open question is whether the increased BOLD response reflects computational demands in integrating mismatching low-level signals or reflects simultaneous unimodal conceptual representations of the competing signals. To address this question, we explore the effect of semantic congruency within and across three signal categories (speech, body actions, and unfamiliar patterns) for signals with matched low-level statistics. In a localizer experiment, unimodal (auditory and visual) and bimodal stimuli were used to identify ROIs. All three semantic categories cause overlapping activation patterns. We find no evidence for areas that show greater BOLD response to bimodal stimuli than predicted by the sum of the two unimodal responses. Conjunction analysis of the unimodal responses in each category identifies a network including posterior temporal, inferior frontal, and premotor areas. Semantic congruency effects are measured in the main experiment. We find that incongruent combinations of two meaningful stimuli (speech and body actions) but not combinations of meaningful with meaningless stimuli lead to increased BOLD response in the posterior STS (pSTS) bilaterally, the left SMA, the inferior frontal gyrus, the inferior parietal lobule, and the anterior insula. These interactions are not seen in premotor areas. Our findings are consistent with the hypothesis that pSTS and frontal areas form a recognition network that combines sensory categorical representations (in pSTS) with action hypothesis generation in inferior frontal gyrus/premotor areas. We argue that the same neural networks process speech and body actions

    Development of audiovisual comprehension skills in prelingually deaf children with cochlear implants

    Get PDF
    Objective: The present study investigated the development of audiovisual comprehension skills in prelingually deaf children who received cochlear implants. Design: We analyzed results obtained with the Common Phrases (Robbins et al., 1995) test of sentence comprehension from 80 prelingually deaf children with cochlear implants who were enrolled in a longitudinal study, from pre-implantation to 5 years after implantation. Results: The results revealed that prelingually deaf children with cochlear implants performed better under audiovisual (AV) presentation compared with auditory-alone (A-alone) or visual-alone (V-alone) conditions. AV sentence comprehension skills were found to be strongly correlated with several clinical outcome measures of speech perception, speech intelligibility, and language. Finally, pre-implantation V-alone performance on the Common Phrases test was strongly correlated with 3-year postimplantation performance on clinical outcome measures of speech perception, speech intelligibility, and language skills. Conclusions: The results suggest that lipreading skills and AV speech perception reflect a common source of variance associated with the development of phonological processing skills that is shared among a wide range of speech and language outcome measures

    About Face: Seeing the Talker Improves Spoken Word Recognition But Increases Listening Effort

    Get PDF
    It is widely accepted that seeing a talker improves a listener’s ability to understand what a talker is saying in background noise (e.g., Erber, 1969; Sumby & Pollack, 1954). The literature is mixed, however, regarding the influence of the visual modality on the listening effort required to recognize speech (e.g., Fraser, Gagné, Alepins, & Dubois, 2010; Sommers & Phelps, 2016). Here, we present data showing that even when the visual modality robustly benefits recognition, processing audiovisual speech can still result in greater cognitive load than processing speech in the auditory modality alone. We show using a dual-task paradigm that the costs associated with audiovisual speech processing are more pronounced in easy listening conditions, in which speech can be recognized at high rates in the auditory modality alone—indeed, effort did not differ between audiovisual and audio-only conditions when the background noise was presented at a more difficult level. Further, we show that though these effects replicate with different stimuli and participants, they do not emerge when effort is assessed with a recall paradigm rather than a dual-task paradigm. Together, these results suggest that the widely cited audiovisual recognition benefit may come at a cost under more favorable listening conditions, and add to the growing body of research suggesting that various measures of effort may not be tapping into the same underlying construct (Strand et al., 2018)
    corecore