518 research outputs found

    Seeing a talking face matters to infants, children and adults : behavioural and neurophysiological studies

    Get PDF
    Everyday conversations typically occur face-to-face. Over and above auditory information, visual information from a speaker’s face, e.g., lips, eyebrows, contributes to speech perception and comprehension. The facilitation that visual speech cues bring— termed the visual speech benefit—are experienced by infants, children and adults. Even so, studies on speech perception have largely focused on auditory-only speech leaving a relative paucity of research on the visual speech benefit. Central to this thesis are the behavioural and neurophysiological manifestations of the visual speech benefit. As the visual speech benefit assumes that a listener is attending to a speaker’s talking face, the investigations are conducted in relation to the possible modulating effects that gaze behaviour brings. Three investigations were conducted. Collectively, these studies demonstrate that visual speech information facilitates speech perception, and this has implications for individuals who do not have clear access to the auditory speech signal. The results, for instance the enhancement of 5-month-olds’ cortical tracking by visual speech cues, and the effect of idiosyncratic differences in gaze behaviour on speech processing, expand knowledge of auditory-visual speech processing, and provide firm bases for new directions in this burgeoning and important area of research

    Towards a Multimodal Silent Speech Interface for European Portuguese

    Get PDF
    Automatic Speech Recognition (ASR) in the presence of environmental noise is still a hard problem to tackle in speech science (Ng et al., 2000). Another problem well described in the literature is the one concerned with elderly speech production. Studies (Helfrich, 1979) have shown evidence of a slower speech rate, more breaks, more speech errors and a humbled volume of speech, when comparing elderly with teenagers or adults speech, on an acoustic level. This fact makes elderly speech hard to recognize, using currently available stochastic based ASR technology. To tackle these two problems in the context of ASR for HumanComputer Interaction, a novel Silent Speech Interface (SSI) in European Portuguese (EP) is envisioned.info:eu-repo/semantics/acceptedVersio

    Neural entrainment to continuous speech and language processing in the early years of life

    Get PDF
    This thesis aimed to explore the neural mechanisms of language processing in infants under 12 months of age by using EEG measures of speech processing. More specifically, I wanted to investigate if infants are able to engage in the auditory neural tracking of continuous speech and how this processing can be modulated by infant attention and different linguistic environments. Limited research has investigated this phenomenon of neural tracking in infants and the potential effects that this may have on later language development. Experiment 1 set the groundwork for the thesis by establishing a reliable method to measure cortical entrainment by 36 infants to the amplitude envelope of continuous speech. The results demonstrated that infants have entrainment to speech much like has been found in adults. Additionally, infants show a reliable elicitation of the Acoustic Change Complex (ACC). Follow up language assessments were conducted with these infants approximately two years later; however, no significant predictors of coherence on later language outcomes were found. The aim of Experiment 2 was to discover how neural entrainment can be modulated by infant attention. Twenty infants were measured on their ability to selectively attend to a target speaker while in the presence of a distractor of matching acoustic intensity. Coherence values were found for the target, the distractor and for the dual signal (both target and distractor together). Thus, it seems that infant attention may be fluctuating between the two speech signals leading to them entraining to both simultaneously. However, the results were not clear. Thus, Experiment 3 expanded on from Experiment 2. However, now EEG was recorded from 30 infants who listened to speech with no acoustic interference and speech-in-noise with a signal-to-noise ratio of 10dB. Additionally, it was investigated whether bilingualism has any potential effects on this process. Similar coherence values were observed when infants listened to speech in both conditions (quiet and noise), suggesting that infants successfully inhibited the disruptive effects of the masker. No effects of bilingualism on neural entrainment were present. For the fourth study we wanted to continue investigating infant auditory-neural entrainment when exposed to more varying levels of background noise. However, due to the COVID-19 pandemic all testing was moved online. Thus, for Experiment 4 we developed a piece of online software (the memory card game) that could be used remotely. Seventy three children ranging from 4 to 12 years old participated in the online experiment in order to explore how the demands of a speech recognition task interact with masker type and language and how this changes with age during childhood. Results showed that performance on the memory card game improved with age but was not affected by masker type or language background. This improvement with age is most likely a result of improved speech perception capabilities. Overall, this thesis provides a reliable methodology for measuring neural entrainment in infants and a greater understanding of the mechanisms of speech processing in infancy and beyond

    CORTICAL DYNAMICS OF AUDITORY-VISUAL SPEECH: A FORWARD MODEL OF MULTISENSORY INTEGRATION.

    Get PDF
    In noisy settings, seeing the interlocutor's face helps to disambiguate what is being said. For this to happen, the brain must integrate auditory and visual information. Three major problems are (1) bringing together separate sensory streams of information, (2) extracting auditory and visual speech information, and (3) identifying this information as a unified auditory-visual percept. In this dissertation, a new representational framework for auditory visual (AV) speech integration is offered. The experimental work (psychophysics and electrophysiology (EEG)) suggests specific neural mechanisms for solving problems (1), (2), and (3) that are consistent with a (forward) 'analysis-by-synthesis' view of AV speech integration. In Chapter I, multisensory perception and integration are reviewed. A unified conceptual framework serves as background for the study of AV speech integration. In Chapter II, psychophysics testing the perception of desynchronized AV speech inputs show the existence of a ~250ms temporal window of integration in AV speech integration. In Chapter III, an EEG study shows that visual speech modulates early on the neural processing of auditory speech. Two functionally independent modulations are (i) a ~250ms amplitude reduction of auditory evoked potentials (AEPs) and (ii) a systematic temporal facilitation of the same AEPs as a function of the saliency of visual speech. In Chapter IV, an EEG study of desynchronized AV speech inputs shows that (i) fine-grained (gamma, ~25ms) and (ii) coarse-grained (theta, ~250ms) neural mechanisms simultaneously mediate the processing of AV speech. In Chapter V, a new illusory effect is proposed, where non-speech visual signals modify the perceptual quality of auditory objects. EEG results show very different patterns of activation as compared to those observed in AV speech integration. An MEG experiment is subsequently proposed to test hypotheses on the origins of these differences. In Chapter VI, the 'analysis-by-synthesis' model of AV speech integration is contrasted with major speech theories. From a Cognitive Neuroscience perspective, the 'analysis-by-synthesis' model is argued to offer the most sensible representational system for AV speech integration. This thesis shows that AV speech integration results from both the statistical nature of stimulation and the inherent predictive capabilities of the nervous system

    Unsupervised mining of audiovisually consistent segments in videos with application to structure analysis

    Get PDF
    International audienceIn this paper, a multimodal event mining technique is proposed to discover repeating video segments exhibiting audio and visual consistency in a totally unsupervised manner. The mining strategy first exploits independent audio and visual cluster analysis to provide segments which are consistent in both their visual and audio modalities, thus likely corresponding to a unique underlying event. A subsequent modeling stage using discriminative models enables accurate detection of the underlying event throughout the video. Event mining is applied to unsupervised video structure analysis, using simple heuristics on occurrence patterns of the events discovered to select those relevant to the video structure. Results on TV programs ranging from news to talk shows and games, show that structurally relevant events are discovered with precisions ranging from 87% to 98% and recalls from 59% to 94%

    Exploiting visual saliency for assessing the impact of car commercials upon viewers

    Get PDF
    Content based video indexing and retrieval (CBVIR) is a lively area of research which focuses on automating the indexing, retrieval and management of videos. This area has a wide spectrum of promising applications where assessing the impact of audiovisual productions emerges as a particularly interesting and motivating one. In this paper we present a computational model capable to predict the impact (i.e. positive or negative) upon viewers of car advertisements videos by using a set of visual saliency descriptors. Visual saliency provides information about parts of the image perceived as most important, which are instinctively targeted by humans when looking at a picture or watching a video. For this reason we propose to exploit visual information, introducing it as a new feature which reflects high-level semantics objectively, to improve the video impact categorization results. The suggested salience descriptors are inspired by the mechanisms that underlie the attentional abilities of the human visual system and organized into seven distinct families according to different measurements over the identified salient areas in the video frames, namely population, size, location, geometry, orientation, movement and photographic composition. Proposed approach starts by computing saliency maps for all the video frames, where two different visual saliency detection frameworks have been considered and evaluated: the popular graph based visual saliency (GBVS) algorithm, and a state-of-the-art DNN-based approach.This work has been partially supported by the National Grants RTC-2016-5305-7 and TEC2014-53390-P of the Spanish Ministry of Economy and Competitiveness.Publicad
    corecore