13,579 research outputs found

    Increased sub-clinical levels of autistic traits are associated with reduced multisensory integration of audiovisual speech

    Get PDF
    Recent studies suggest that sub-clinical levels of autistic symptoms may be related to reduced processing of artificial audiovisual stimuli. It is unclear whether these findings extent to more natural stimuli such as audiovisual speech. The current study examined the relationship between autistic traits measured by the Autism spectrum Quotient and audiovisual speech processing in a large non-clinical population using a battery of experimental tasks assessing audiovisual perceptual binding, visual enhancement of speech embedded in noise and audiovisual temporal processing. Several associations were found between autistic traits and audiovisual speech processing. Increased autistic-like imagination was related to reduced perceptual binding measured by the McGurk illusion. Increased overall autistic symptomatology was associated with reduced visual enhancement of speech intelligibility in noise. Participants reporting increased levels of rigid and restricted behaviour were more likely to bind audiovisual speech stimuli over longer temporal intervals, while an increased tendency to focus on local aspects of sensory inputs was related to a more narrow temporal binding window. These findings demonstrate that increased levels of autistic traits may be related to alterations in audiovisual speech processing, and are consistent with the notion of a spectrum of autistic traits that extends to the general population

    End-to-end Audiovisual Speech Activity Detection with Bimodal Recurrent Neural Models

    Full text link
    Speech activity detection (SAD) plays an important role in current speech processing systems, including automatic speech recognition (ASR). SAD is particularly difficult in environments with acoustic noise. A practical solution is to incorporate visual information, increasing the robustness of the SAD approach. An audiovisual system has the advantage of being robust to different speech modes (e.g., whisper speech) or background noise. Recent advances in audiovisual speech processing using deep learning have opened opportunities to capture in a principled way the temporal relationships between acoustic and visual features. This study explores this idea proposing a \emph{bimodal recurrent neural network} (BRNN) framework for SAD. The approach models the temporal dynamic of the sequential audiovisual data, improving the accuracy and robustness of the proposed SAD system. Instead of estimating hand-crafted features, the study investigates an end-to-end training approach, where acoustic and visual features are directly learned from the raw data during training. The experimental evaluation considers a large audiovisual corpus with over 60.8 hours of recordings, collected from 105 speakers. The results demonstrate that the proposed framework leads to absolute improvements up to 1.2% under practical scenarios over a VAD baseline using only audio implemented with deep neural network (DNN). The proposed approach achieves 92.7% F1-score when it is evaluated using the sensors from a portable tablet under noisy acoustic environment, which is only 1.0% lower than the performance obtained under ideal conditions (e.g., clean speech obtained with a high definition camera and a close-talking microphone).Comment: Submitted to Speech Communicatio

    About Face: Seeing the Talker Improves Spoken Word Recognition But Increases Listening Effort

    Get PDF
    It is widely accepted that seeing a talker improves a listener’s ability to understand what a talker is saying in background noise (e.g., Erber, 1969; Sumby & Pollack, 1954). The literature is mixed, however, regarding the influence of the visual modality on the listening effort required to recognize speech (e.g., Fraser, Gagné, Alepins, & Dubois, 2010; Sommers & Phelps, 2016). Here, we present data showing that even when the visual modality robustly benefits recognition, processing audiovisual speech can still result in greater cognitive load than processing speech in the auditory modality alone. We show using a dual-task paradigm that the costs associated with audiovisual speech processing are more pronounced in easy listening conditions, in which speech can be recognized at high rates in the auditory modality alone—indeed, effort did not differ between audiovisual and audio-only conditions when the background noise was presented at a more difficult level. Further, we show that though these effects replicate with different stimuli and participants, they do not emerge when effort is assessed with a recall paradigm rather than a dual-task paradigm. Together, these results suggest that the widely cited audiovisual recognition benefit may come at a cost under more favorable listening conditions, and add to the growing body of research suggesting that various measures of effort may not be tapping into the same underlying construct (Strand et al., 2018)

    Impact of Noise and Working Memory on Speech Processing in Adults With and Without ADHD

    Get PDF
    Auditory processing of speech is influenced by internal (i.e., attention, working memory) and external factors (i.e., background noise, visual information). This study examined the interplay among these factors in individuals with and without ADHD. All participants completed a listening in noise task, two working memory capacity tasks, and two short-term memory tasks. The listening in noise task had both an auditory and an audiovisual condition. Participants included 38 young adults between the ages of 18-35 without ADHD and 25 young adults between the ages of 18-35 with ADHD. Results indicated that diagnosis, modality, and signal-to-noise ratio all have a main effect on a person\u27s ability to process speech in noise. In addition, the interaction between the diagnosis of ADHD, the presence of visual cues, and the level of noise had an effect on a person\u27s ability to process speech in noise. In fact, young adults with ADHD benefit less from visual information during noise than young adults without ADHD, an effect influenced by working memory abilities. These speech processing results are discussed in relation to theoretical models of stochastic resonance and working memory capacity. Implications for speech-language pathologists and educators are also discussed

    Deficits in audiovisual speech perception in normal aging emerge at the level of whole-word recognition.

    Get PDF
    Over the next 2 decades, a dramatic shift in the demographics of society will take place, with a rapid growth in the population of older adults. One of the most common complaints with healthy aging is a decreased ability to successfully perceive speech, particularly in noisy environments. In such noisy environments, the presence of visual speech cues (i.e., lip movements) provide striking benefits for speech perception and comprehension, but previous research suggests that older adults gain less from such audiovisual integration than their younger peers. To determine at what processing level these behavioral differences arise in healthy-aging populations, we administered a speech-in-noise task to younger and older adults. We compared the perceptual benefits of having speech information available in both the auditory and visual modalities and examined both phoneme and whole-word recognition across varying levels of signal-to-noise ratio. For whole-word recognition, older adults relative to younger adults showed greater multisensory gains at intermediate SNRs but reduced benefit at low SNRs. By contrast, at the phoneme level both younger and older adults showed approximately equivalent increases in multisensory gain as signal-to-noise ratio decreased. Collectively, the results provide important insights into both the similarities and differences in how older and younger adults integrate auditory and visual speech cues in noisy environments and help explain some of the conflicting findings in previous studies of multisensory speech perception in healthy aging. These novel findings suggest that audiovisual processing is intact at more elementary levels of speech perception in healthy-aging populations and that deficits begin to emerge only at the more complex word-recognition level of speech signals

    Audiovisual Speech-In-Noise (SIN) Performance of Young Adults with ADHD

    Full text link
    Adolescents with Attention-deficit/hyperactivity disorder (ADHD) have difficulty processing speech with background noise due to reduced inhibitory control and working memory capacity (WMC). This paper presents a pilot study of an audiovisual Speech-In-Noise (SIN) task for young adults with ADHD compared to age-matched controls using eye-tracking measures. The audiovisual SIN task consists of varying six levels of background babble, accompanied by visual cues. A significant difference between ADHD and neurotypical (NT) groups was observed at 15 dB signal-to-noise ratio (SNR). These results contribute to the literature of young adults with ADHD.Comment: To be published in Symposium on Eye Tracking Research and Applications (ETRA '20 Short Papers), 6 pages, 3 figures, 2 table

    The shadow of a doubt? Evidence for perceptuo-motor linkage during auditory and audiovisual close-shadowing

    Get PDF
    One classical argument in favor of a functional role of the motor system in speech perception comes from the close shadowing task in which a subject has to identify and to repeat as quickly as possible an auditory speech stimulus. The fact that close shadowing can occur very rapidly and much faster than manual identification of the speech target is taken to suggest that perceptually-induced speech representations are already shaped in a motor-compatible format. Another argument is provided by audiovisual interactions often interpreted as referring to a multisensory-motor framework. In this study, we attempted to combine these two paradigms by testing whether the visual modality could speed motor response in a close-shadowing task. To this aim, both oral and manual responses were evaluated during the perception of auditory and audio-visual speech stimuli, clear or embedded in white noise. Overall, oral responses were faster than manual ones, but it also appeared that they were less accurate in noise, which suggests that motor representations evoked by the speech input could be rough at a first processing stage. In the presence of acoustic noise, the audiovisual modality led to both faster and more accurate responses than the auditory modality. No interaction was however observed between modality and response. Altogether, these results are interpreted within a two-stage sensory-motor framework, in which the auditory and visual streams are integrated together and with internally generated motor representations before a final decision may be available

    Changes in Audiovisual Word Perception During Mid-Childhood: An ERP Study

    Get PDF
    Throughout school-age years, speech perception is an important skill that often relies on the child’s ability to combine both auditory and visual information from the speaker. In order to better understand the development of multisensory speech perception during mid-childhood, we analyzed audiovisual word perception in three groups of participants: 8-9-year olds, 11-12-year olds, and adults. Participants matched visually-perceived articulatory movements with corresponding auditory words. In “congruent” trials, the auditory word matched the subsequently presented silent visual articulation. In “incongruent” trials, the words presented differed on the initial phoneme. From this task, we evaluated specific neural components —the N400 and the Late Positive Complex (LPC) — which index the phoneme and whole word level of audiovisual processing, respectively. The results of this experiment were then related to a real-life behavioral speech perception skill, namely, listening to speech-in-noise. Our results suggest that while the LPC becomes adultlike by the age of 11 or 12, the N400 is not fully matured until later in development. In addition, the relation of the LPC to listening to speech-in-noise is stronger earlier in childhood while the relation of the N400 is stronger during later school years and adulthood. Overall, we show that audiovisual processes related to the whole-word level mature earlier in childhood than processes related to the phonological level

    The Neural Substrates of Multisensory Speech Perception

    Get PDF
    Comprehending speech is one of the most important human behaviors, but we are only beginning to understand how the brain accomplishes this difficult task. One key to speech perception seems to be that the brain integrates the independent sources of information available in the auditory and visual modalities in a process known as multisensory integration. This allows speech perception to be accurate, even in environments in which one modality or the other is ambiguous in the context of noise. Previous electrophysiological and functional magnetic resonance imaging (fMRI) experiments have implicated the posterior superior temporal sulcus (STS) in auditory-visual integration of both speech and non-speech stimuli. While evidence from prior imaging studies have found increases in STS activity for audiovisual speech compared with unisensory auditory or visual speech, these studies do not provide a clear mechanism as to how the STS communicates with early sensory areas to integrate the two streams of information into a coherent audiovisual percept. Furthermore, it is currently unknown if the activity within the STS is directly correlated with strength of audiovisual perception. In order to better understand the cortical mechanisms that underlie audiovisual speech perception, we first studied the STS activity and connectivity during the perception of speech with auditory and visual components of varying intelligibility. By studying fMRI activity during these noisy audiovisual speech stimuli, we found that STS connectivity with auditory and visual cortical areas mirrored perception; when the information from one modality is unreliable and noisy, the STS interacts less with the cortex processing that modality and more with the cortex processing the reliable information. We next characterized the role of STS activity during a striking audiovisual speech illusion, the McGurk effect, to determine if activity within the STS predicts how strongly a person integrates auditory and visual speech information. Subjects with greater susceptibility to the McGurk effect exhibited stronger fMRI activation of the STS during perception of McGurk syllables, implying a direct correlation between strength of audiovisual integration of speech and activity within an the multisensory STS

    Audiovisual speech perception in cochlear implant patients

    Get PDF
    Hearing with a cochlear implant (CI) is very different compared to a normal-hearing (NH) experience, as the CI can only provide limited auditory input. Nevertheless, the central auditory system is capable of learning how to interpret such limited auditory input such that it can extract meaningful information within a few months after implant switch-on. The capacity of the auditory cortex to adapt to new auditory stimuli is an example of intra-modal plasticity — changes within a sensory cortical region as a result of altered statistics of the respective sensory input. However, hearing deprivation before implantation and restoration of hearing capacities after implantation can also induce cross-modal plasticity — changes within a sensory cortical region as a result of altered statistics of a different sensory input. Thereby, a preserved cortical region can, for example, support a deprived cortical region, as in the case of CI users which have been shown to exhibit cross-modal visual-cortex activation for purely auditory stimuli. Before implantation, during the period of hearing deprivation, CI users typically rely on additional visual cues like lip-movements for understanding speech. Therefore, it has been suggested that CI users show a pronounced binding of the auditory and visual systems, which may allow them to integrate auditory and visual speech information more efficiently. The projects included in this thesis investigate auditory, and particularly audiovisual speech processing in CI users. Four event-related potential (ERP) studies approach the matter from different perspectives, each with a distinct focus. The first project investigates how audiovisually presented syllables are processed by CI users with bilateral hearing loss compared to NH controls. Previous ERP studies employing non-linguistic stimuli and studies using different neuroimaging techniques found distinct audiovisual interactions in CI users. However, the precise timecourse of cross-modal visual-cortex recruitment and enhanced audiovisual interaction for speech related stimuli is unknown. With our ERP study we fill this gap, and we present differences in the timecourse of audiovisual interactions as well as in cortical source configurations between CI users and NH controls. The second study focuses on auditory processing in single-sided deaf (SSD) CI users. SSD CI patients experience a maximally asymmetric hearing condition, as they have a CI on one ear and a contralateral NH ear. Despite the intact ear, several behavioural studies have demonstrated a variety of beneficial effects of restoring binaural hearing, but there are only few ERP studies which investigate auditory processing in SSD CI users. Our study investigates whether the side of implantation affects auditory processing and whether auditory processing via the NH ear of SSD CI users works similarly as in NH controls. Given the distinct hearing conditions of SSD CI users, the question arises whether there are any quantifiable differences between CI user with unilateral hearing loss and bilateral hearing loss. In general, ERP studies on SSD CI users are rather scarce, and there is no study on audiovisual processing in particular. Furthermore, there are no reports on lip-reading abilities of SSD CI users. To this end, in the third project we extend the first study by including SSD CI users as a third experimental group. The study discusses both differences and similarities between CI users with bilateral hearing loss and CI users with unilateral hearing loss as well as NH controls and provides — for the first time — insights into audiovisual interactions in SSD CI users. The fourth project investigates the influence of background noise on audiovisual interactions in CI users and whether a noise-reduction algorithm can modulate these interactions. It is known that in environments with competing background noise listeners generally rely more strongly on visual cues for understanding speech and that such situations are particularly difficult for CI users. As shown in previous auditory behavioural studies, the recently introduced noise-reduction algorithm "ForwardFocus" can be a useful aid in such cases. However, the questions whether employing the algorithm is beneficial in audiovisual conditions as well and whether using the algorithm has a measurable effect on cortical processing have not been investigated yet. In this ERP study, we address these questions with an auditory and audiovisual syllable discrimination task. Taken together, the projects included in this thesis contribute to a better understanding of auditory and especially audiovisual speech processing in CI users, revealing distinct processing strategies employed to overcome the limited input provided by a CI. The results have clinical implications, as they suggest that clinical hearing assessments, which are currently purely auditory, should be extended to audiovisual assessments. Furthermore, they imply that rehabilitation including audiovisual training methods may be beneficial for all CI user groups for quickly achieving the most effective CI implantation outcome
    • …
    corecore