898 research outputs found

    An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation

    Get PDF
    Speech enhancement and speech separation are two related tasks, whose purpose is to extract either one or more target speech signals, respectively, from a mixture of sounds generated by several sources. Traditionally, these tasks have been tackled using signal processing and machine learning techniques applied to the available acoustic signals. Since the visual aspect of speech is essentially unaffected by the acoustic environment, visual information from the target speakers, such as lip movements and facial expressions, has also been used for speech enhancement and speech separation systems. In order to efficiently fuse acoustic and visual information, researchers have exploited the flexibility of data-driven approaches, specifically deep learning, achieving strong performance. The ceaseless proposal of a large number of techniques to extract features and fuse multimodal information has highlighted the need for an overview that comprehensively describes and discusses audio-visual speech enhancement and separation based on deep learning. In this paper, we provide a systematic survey of this research topic, focusing on the main elements that characterise the systems in the literature: acoustic features; visual features; deep learning methods; fusion techniques; training targets and objective functions. In addition, we review deep-learning-based methods for speech reconstruction from silent videos and audio-visual sound source separation for non-speech signals, since these methods can be more or less directly applied to audio-visual speech enhancement and separation. Finally, we survey commonly employed audio-visual speech datasets, given their central role in the development of data-driven approaches, and evaluation methods, because they are generally used to compare different systems and determine their performance

    Towards An Intelligent Fuzzy Based Multimodal Two Stage Speech Enhancement System

    Get PDF
    This thesis presents a novel two stage multimodal speech enhancement system, making use of both visual and audio information to filter speech, and explores the extension of this system with the use of fuzzy logic to demonstrate proof of concept for an envisaged autonomous, adaptive, and context aware multimodal system. The design of the proposed cognitively inspired framework is scalable, meaning that it is possible for the techniques used in individual parts of the system to be upgraded and there is scope for the initial framework presented here to be expanded. In the proposed system, the concept of single modality two stage filtering is extended to include the visual modality. Noisy speech information received by a microphone array is first pre-processed by visually derived Wiener filtering employing the novel use of the Gaussian Mixture Regression (GMR) technique, making use of associated visual speech information, extracted using a state of the art Semi Adaptive Appearance Models (SAAM) based lip tracking approach. This pre-processed speech is then enhanced further by audio only beamforming using a state of the art Transfer Function Generalised Sidelobe Canceller (TFGSC) approach. This results in a system which is designed to function in challenging noisy speech environments (using speech sentences with different speakers from the GRID corpus and a range of noise recordings), and both objective and subjective test results (employing the widely used Perceptual Evaluation of Speech Quality (PESQ) measure, a composite objective measure, and subjective listening tests), showing that this initial system is capable of delivering very encouraging results with regard to filtering speech mixtures in difficult reverberant speech environments. Some limitations of this initial framework are identified, and the extension of this multimodal system is explored, with the development of a fuzzy logic based framework and a proof of concept demonstration implemented. Results show that this proposed autonomous,adaptive, and context aware multimodal framework is capable of delivering very positive results in difficult noisy speech environments, with cognitively inspired use of audio and visual information, depending on environmental conditions. Finally some concluding remarks are made along with proposals for future work

    Audio-Visual Speech Enhancement Based on Deep Learning

    Get PDF

    Neural entrainment to continuous speech and language processing in the early years of life

    Get PDF
    This thesis aimed to explore the neural mechanisms of language processing in infants under 12 months of age by using EEG measures of speech processing. More specifically, I wanted to investigate if infants are able to engage in the auditory neural tracking of continuous speech and how this processing can be modulated by infant attention and different linguistic environments. Limited research has investigated this phenomenon of neural tracking in infants and the potential effects that this may have on later language development. Experiment 1 set the groundwork for the thesis by establishing a reliable method to measure cortical entrainment by 36 infants to the amplitude envelope of continuous speech. The results demonstrated that infants have entrainment to speech much like has been found in adults. Additionally, infants show a reliable elicitation of the Acoustic Change Complex (ACC). Follow up language assessments were conducted with these infants approximately two years later; however, no significant predictors of coherence on later language outcomes were found. The aim of Experiment 2 was to discover how neural entrainment can be modulated by infant attention. Twenty infants were measured on their ability to selectively attend to a target speaker while in the presence of a distractor of matching acoustic intensity. Coherence values were found for the target, the distractor and for the dual signal (both target and distractor together). Thus, it seems that infant attention may be fluctuating between the two speech signals leading to them entraining to both simultaneously. However, the results were not clear. Thus, Experiment 3 expanded on from Experiment 2. However, now EEG was recorded from 30 infants who listened to speech with no acoustic interference and speech-in-noise with a signal-to-noise ratio of 10dB. Additionally, it was investigated whether bilingualism has any potential effects on this process. Similar coherence values were observed when infants listened to speech in both conditions (quiet and noise), suggesting that infants successfully inhibited the disruptive effects of the masker. No effects of bilingualism on neural entrainment were present. For the fourth study we wanted to continue investigating infant auditory-neural entrainment when exposed to more varying levels of background noise. However, due to the COVID-19 pandemic all testing was moved online. Thus, for Experiment 4 we developed a piece of online software (the memory card game) that could be used remotely. Seventy three children ranging from 4 to 12 years old participated in the online experiment in order to explore how the demands of a speech recognition task interact with masker type and language and how this changes with age during childhood. Results showed that performance on the memory card game improved with age but was not affected by masker type or language background. This improvement with age is most likely a result of improved speech perception capabilities. Overall, this thesis provides a reliable methodology for measuring neural entrainment in infants and a greater understanding of the mechanisms of speech processing in infancy and beyond

    Robust visual speech recognition using optical flow analysis and rotation invariant features

    Get PDF
    The focus of this thesis is to develop computer vision algorithms for visual speech recognition system to identify the visemes. The majority of existing speech recognition systems is based on audio-visual signals and has been developed for speech enhancement and is prone to acoustic noise. Considering this problem, aim of this research is to investigate and develop a visual only speech recognition system which should be suitable for noisy environments. Potential applications of such a system include the lip-reading mobile phones, human computer interface (HCI) for mobility-impaired users, robotics, surveillance, improvement of speech based computer control in a noisy environment and for the rehabilitation of the persons who have undergone a laryngectomy surgery. In the literature, there are several models and algorithms available for visual feature extraction. These features are extracted from static mouth images and characterized as appearance and shape based features. However, these methods rarely incorporate the time dependent information of mouth dynamics. This dissertation presents two optical flow based approaches of visual feature extraction, which capture the mouth motions in an image sequence. The motivation for using motion features is, because the human perception of lip-reading is concerned with the temporal dynamics of mouth motion. The first approach is based on extraction of features from the optical flow vertical component. The optical flow vertical component is decomposed into multiple non-overlapping fixed scale blocks and statistical features of each block are computed for successive video frames of an utterance. To overcome the issue of large variation in speed of speech, each utterance is normalized using simple linear interpolation method. In the second approach, four directional motion templates based on optical flow are developed, each representing the consolidated motion information in an utterance in four directions (i.e.,up, down, left and right). This approach is an evolution of a view based approach known as motion history image (MHI). One of the main issues with the MHI method is its motion overwriting problem because of self-occlusion. DMHIs seem to solve this issue of overwriting. Two types of image descriptors, Zernike moments and Hu moments are used to represent each image of DMHIs. A support vector machine (SVM) classifier was used to classify the features obtained from the optical flow vertical component, Zernike and Hu moments separately. For identification of visemes, a multiclass SVM approach was employed. A video speech corpus of seven subjects was used for evaluating the efficiency of the proposed methods for lip-reading. The experimental results demonstrate the promising performance of the optical flow based mouth movement representations. Performance comparison between DMHI and MHI based on Zernike moments, shows that the DMHI technique outperforms the MHI technique. A video based adhoc temporal segmentation method is proposed in the thesis for isolated utterances. It has been used to detect the start and the end frame of an utterance from an image sequence. The technique is based on a pair-wise pixel comparison method. The efficiency of the proposed technique was tested on the available data set with short pauses between each utterance

    The use of acoustic cues in phonetic perception: Effects of spectral degradation, limited bandwidth and background noise

    Get PDF
    Hearing impairment, cochlear implantation, background noise and other auditory degradations result in the loss or distortion of sound information thought to be critical to speech perception. In many cases, listeners can still identify speech sounds despite degradations, but understanding of how this is accomplished is incomplete. Experiments presented here tested the hypothesis that listeners would utilize acoustic-phonetic cues differently if one or more cues were degraded by hearing impairment or simulated hearing impairment. Results supported this hypothesis for various listening conditions that are directly relevant for clinical populations. Analysis included mixed-effects logistic modeling of contributions of individual acoustic cues for various contrasts. Listeners with cochlear implants (CIs) or normal-hearing (NH) listeners in CI simulations showed increased use of acoustic cues in the temporal domain and decreased use of cues in the spectral domain for the tense/lax vowel contrast and the word-final fricative voicing contrast. For the word-initial stop voicing contrast, NH listeners made less use of voice-onset time and greater use of voice pitch in conditions that simulated high-frequency hearing impairment and/or masking noise; influence of these cues was further modulated by consonant place of articulation. A pair of experiments measured phonetic context effects for the "s/sh" contrast, replicating previously observed effects for NH listeners and generalizing them to CI listeners as well, despite known deficiencies in spectral resolution for CI listeners. For NH listeners in CI simulations, these context effects were absent or negligible. Audio-visual delivery of this experiment revealed enhanced influence of visual lip-rounding cues for CI listeners and NH listeners in CI simulations. Additionally, CI listeners demonstrated that visual cues to gender influence phonetic perception in a manner consistent with gender-related voice acoustics. All of these results suggest that listeners are able to accommodate challenging listening situations by capitalizing on the natural (multimodal) covariance in speech signals. Additionally, these results imply that there are potential differences in speech perception by NH listeners and listeners with hearing impairment that would be overlooked by traditional word recognition or consonant confusion matrix analysis

    Electrophysiologic assessment of (central) auditory processing disorder in children with non-syndromic cleft lip and/or palate

    Get PDF
    Session 5aPP - Psychological and Physiological Acoustics: Auditory Function, Mechanisms, and Models (Poster Session)Cleft of the lip and/or palate is a common congenital craniofacial malformation worldwide, particularly non-syndromic cleft lip and/or palate (NSCL/P). Though middle ear deficits in this population have been universally noted in numerous studies, other auditory problems including inner ear deficits or cortical dysfunction are rarely reported. A higher prevalence of educational problems has been noted in children with NSCL/P compared to craniofacially normal children. These high level cognitive difficulties cannot be entirely attributed to peripheral hearing loss. Recently it has been suggested that children with NSCLP may be more prone to abnormalities in the auditory cortex. The aim of the present study was to investigate whether school age children with (NSCL/P) have a higher prevalence of indications of (central) auditory processing disorder [(C)APD] compared to normal age matched controls when assessed using auditory event-related potential (ERP) techniques. School children (6 to 15 years) with NSCL/P and normal controls with matched age and gender were recruited. Auditory ERP recordings included auditory brainstem response and late event-related potentials, including the P1-N1-P2 complex and P300 waveforms. Initial findings from the present study are presented and their implications for further research in this area —and clinical intervention—are outlined. © 2012 Acoustical Society of Americapublished_or_final_versio

    Functional Brain Differences Predict Challenging Auditory Speech Comprehension in Older Adults

    Get PDF
    abstract: Older adults often experience communication difficulties, including poorer comprehension of auditory speech when it contains complex sentence structures or occurs in noisy environments. Previous work has linked cognitive abilities and the engagement of domain-general cognitive resources, such as the cingulo-opercular and frontoparietal brain networks, in response to challenging speech. However, the degree to which these networks can support comprehension remains unclear. Furthermore, how hearing loss may be related to the cognitive resources recruited during challenging speech comprehension is unknown. This dissertation investigated how hearing, cognitive performance, and functional brain networks contribute to challenging auditory speech comprehension in older adults. Experiment 1 characterized how age and hearing loss modulate resting-state functional connectivity between Heschl’s gyrus and several sensory and cognitive brain networks. The results indicate that older adults exhibit decreased functional connectivity between Heschl’s gyrus and sensory and attention networks compared to younger adults. Within older adults, greater hearing loss was associated with increased functional connectivity between right Heschl’s gyrus and the cingulo-opercular and language networks. Experiments 2 and 3 investigated how hearing, working memory, attentional control, and fMRI measures predict comprehension of complex sentence structures and speech in noisy environments. Experiment 2 utilized resting-state functional magnetic resonance imaging (fMRI) and behavioral measures of working memory and attentional control. Experiment 3 used activation-based fMRI to examine the brain regions recruited in response to sentences with both complex structures and in noisy background environments as a function of hearing and cognitive abilities. The results suggest that working memory abilities and the functionality of the frontoparietal and language networks support the comprehension of speech in multi-speaker environments. Conversely, attentional control and the cingulo-opercular network were shown to support comprehension of complex sentence structures. Hearing loss was shown to decrease activation within right Heschl’s gyrus in response to all sentence conditions and increase activation within frontoparietal and cingulo-opercular regions. Hearing loss also was associated with poorer sentence comprehension in energetic, but not informational, masking. Together, these three experiments identify the unique contributions of cognition and brain networks that support challenging auditory speech comprehension in older adults, further probing how hearing loss affects these relationships.Dissertation/ThesisDoctoral Dissertation Neuroscience 201
    • …
    corecore