94 research outputs found

    Introducing Crossmodal Biometrics:Person Identification from Distinct Audio & Visual Streams

    Get PDF
    Person identification using audio or visual biometrics is a well-studied problem in pattern recognition. In this scenario, both training and testing are done on the same modalities. However, there can be situations where this condition is not valid, i.e. training and testing has to be done on different modalities. This could arise, for example, in covert surveillance. Is there any person specific information common to both the audio and visual (video-only) modalities which could be exploited to identify a person in such a constrained situation? In this work, we investigate this question in a principled way and propose a framework which can perform this task consistently better than chance, suggesting that such crossmodal biometric information exists

    CORTICAL DYNAMICS OF AUDITORY-VISUAL SPEECH: A FORWARD MODEL OF MULTISENSORY INTEGRATION.

    Get PDF
    In noisy settings, seeing the interlocutor's face helps to disambiguate what is being said. For this to happen, the brain must integrate auditory and visual information. Three major problems are (1) bringing together separate sensory streams of information, (2) extracting auditory and visual speech information, and (3) identifying this information as a unified auditory-visual percept. In this dissertation, a new representational framework for auditory visual (AV) speech integration is offered. The experimental work (psychophysics and electrophysiology (EEG)) suggests specific neural mechanisms for solving problems (1), (2), and (3) that are consistent with a (forward) 'analysis-by-synthesis' view of AV speech integration. In Chapter I, multisensory perception and integration are reviewed. A unified conceptual framework serves as background for the study of AV speech integration. In Chapter II, psychophysics testing the perception of desynchronized AV speech inputs show the existence of a ~250ms temporal window of integration in AV speech integration. In Chapter III, an EEG study shows that visual speech modulates early on the neural processing of auditory speech. Two functionally independent modulations are (i) a ~250ms amplitude reduction of auditory evoked potentials (AEPs) and (ii) a systematic temporal facilitation of the same AEPs as a function of the saliency of visual speech. In Chapter IV, an EEG study of desynchronized AV speech inputs shows that (i) fine-grained (gamma, ~25ms) and (ii) coarse-grained (theta, ~250ms) neural mechanisms simultaneously mediate the processing of AV speech. In Chapter V, a new illusory effect is proposed, where non-speech visual signals modify the perceptual quality of auditory objects. EEG results show very different patterns of activation as compared to those observed in AV speech integration. An MEG experiment is subsequently proposed to test hypotheses on the origins of these differences. In Chapter VI, the 'analysis-by-synthesis' model of AV speech integration is contrasted with major speech theories. From a Cognitive Neuroscience perspective, the 'analysis-by-synthesis' model is argued to offer the most sensible representational system for AV speech integration. This thesis shows that AV speech integration results from both the statistical nature of stimulation and the inherent predictive capabilities of the nervous system

    The integration of paralinguistic information from the face and the voice

    Get PDF
    We live in a world which bombards us with a huge amount of sensory information, even if we are not always aware of it. To successfully navigate, function and ultimately survive in our environment we use all of the cues available to us. Furthermore, we actually combine this information: doing so allows us not only to construct a richer percept of the objects around us, but actually increases the reliability of our decisions and sensory estimates. However, at odds with our naturally multisensory awareness of our surroundings, the literature addressing unisensory processes has always far exceeded that which examines the multimodal nature of perception. Arguably the most salient and relevant stimuli in our environment are other people. Our species is not designed to operate alone, and so we have evolved to be especially skilled in all those things which enable effective social interaction – this could be engaging in conversation, but equally as well recognising a family member, or understanding the current emotional state of a friend, and adjusting our behaviour appropriately. In particular, the face and the voice both provide us with a wealth of hugely relevant social information - linguistic, but also non-linguistic. In line with work conducted in other fields of multisensory perception, research on face and voice perception has mainly concentrated on each of these modalities independently, particularly face perception. Furthermore, the work that has addressed integration of these two sources by and large has concentrated on the audiovisual nature of speech perception. The work in this thesis is based on a theoretical model of voice perception which not only proposed a serial processing pathway of vocal information, but also emphasised the similarities between face and voice processing, suggesting that this information may interact. Significantly, these interactions were not just confined to speech processing, but rather encompassed all forms of information processing, whether this was linguistic or paralinguistic. Therefore, in this thesis, I concentrate on the interactions between, and integration of face-voice paralinguistic information. In Chapter 3 we conducted a general investigation of neural face-voice integration. A number of studies have attempted to identify the cerebral regions in which information from the face and voice combines; however, in addition to a large number of regions being proposed as integration sites, it is not known whether these regions are selective in the binding of these socially relevant stimuli. We identified firstly regions in the bilateral superior temporal sulcus (STS) which showed an increased response to person-related information – whether this was faces, voices, or faces and voices combined – in comparison to information from objects. A subsection of this region in the right posterior superior temporal sulcus (pSTS) also produced a significantly stronger response to audiovisual as compared to unimodal information. We therefore propose this as a potential people-selective, integrative region. Furthermore, a large portion of the right pSTS was also observed to be people-selective and heteromodal: that is, both auditory and visual information provoked a significant response above baseline. These results underline the importance of the STS region in social communication. Chapter 4 moved on to study the audiovisual perception of gender. Using a set of novel stimuli – which were not only dynamic but also morphed in both modalities – we investigated whether different combinations of gender information in the face and voice could affect participants’ perception of gender. We found that participants indeed combined both sources of information when categorising gender, with their decision being reflective of information contained in both modalities. However, this combination was not entirely equal: in this experiment, gender information from the voice appeared to dominate over that from the face, exerting a stronger modulating effect on categorisation. This result was supported by the findings from conditions which directed to attention, where we observed participants were able to ignore face but not voice information; and also reaction times results, where latencies were generally a reflection of voice morph. Overall, these results support interactions between face and voice in gender perception, but demonstrate that (due to a number of probable factors) one modality can exert more influence than another. Finally, in Chapter 5 we investigated the proposed interactions between affective content in the face and voice. Specifically, we used a ‘continuous carry-over’ design – again in conjunction with dynamic, morphed stimuli – which allowed us to investigate not only ‘direct’ effects of different sets of audiovisual stimuli (e.g., congruent, incongruent), but also adaptation effects (in particular, the effect of emotion expressed in one modality upon the response to emotion expressed in another modality). Parallel to behavioural results, which showed that the crossmodal context affected the time taken to categorise emotion, we observed a significant crossmodal effect in the right pSTS, which was independent of any within-modality adaptation. We propose that this result provides strong evidence that this region may be composed of similarly multisensory neurons, as opposed to two sets of interdigitised neurons responsive to information from one modality or the other. Furthermore, an analysis investigating stimulus congruence showed that the degree of incongruence modulated activity across the right STS, further inferring neural response in this region can be altered depending on the particular combination of affective information contained within the face and voice. Overall, both behavioural and cerebral results from this study suggested that participants integrated emotion from the face and voice

    Neural reflections of meaning in gesture, language, and action

    Get PDF

    The anatomical substrates of feature integration during object processing.

    Get PDF
    Objects can be identified from a number of perceptual attributes, including visual, auditory and tactile sensory input. The integration of these perceptual attributes constitutes our semantic knowledge of an object representation. This research uses functional neuroimaging to investigate the brain areas that integrate perceptual features into an object representation, and how these regions are modulated by stimulus- and task-specific features. A series of experiments are reported that utilise different types of perceptual integration, both within and across sensory modalities. These include 1) the integration of visual form with colour, 2) the integration of visual and auditory object features, and 3) the integration of visual and tactile abstract shapes. Across these experiments I have also manipulated additional factors, including the meaning of the perceptual information (meaningful objects versus meaningless shapes), the verbal or non-verbal nature of the perceptual inputs (e.g. spoken words versus environmental sounds) and the congruency of crossmodal inputs. These experiments have identified a network of brain regions both common to, and selective for, different types of object feature integration. For instance, I have identified a common bilateral network involved in the integration and association of crossmodal audiovisual objects and intra-modal auditory or visual object pairs. However, I have also determined that activation in response to the same concepts can be modulated by the type of stimulus input (verbal versus nonverbal), the timing of those inputs (simultaneous versus sequential presentation), and the congruency of stimulus pairs (congruent versus incongruent). Taken together, the results from these experiments demonstrate modulations of neuronal activation by different object attributes at multiple different levels of the object processing hierarchy, from early sensory processing through to stored object representations. Critically, these differential effects have even been observed with the same conceptual stimuli. Together these findings highlight the need for a model of object feature processing that can account for the functional demands that elicit these anatomical differences

    Multi-Level Audio-Visual Interactions in Speech and Language Perception

    Get PDF
    That we perceive our environment as a unified scene rather than individual streams of auditory, visual, and other sensory information has recently provided motivation to move past the long-held tradition of studying these systems separately. Although they are each unique in their transduction organs, neural pathways, and cortical primary areas, the senses are ultimately merged in a meaningful way which allows us to navigate the multisensory world. Investigating how the senses are merged has become an increasingly wide field of research in recent decades, with the introduction and increased availability of neuroimaging techniques. Areas of study range from multisensory object perception to cross-modal attention, multisensory interactions, and integration. This thesis focuses on audio-visual speech perception, with special focus on facilitatory effects of visual information on auditory processing. When visual information is concordant with auditory information, it provides an advantage that is measurable in behavioral response times and evoked auditory fields (Chapter 3) and in increased entrainment to multisensory periodic stimuli reflected by steady-state responses (Chapter 4). When the audio-visual information is incongruent, the combination can often, but not always, combine to form a third, non-physically present percept (known as the McGurk effect). This effect is investigated (Chapter 5) using real word stimuli. McGurk percepts were not robustly elicited for a majority of stimulus types, but patterns of responses suggest that the physical and lexical properties of the auditory and visual stimulus may affect the likelihood of obtaining the illusion. Together, these experiments add to the growing body of knowledge that suggests that audio-visual interactions occur at multiple stages of processing

    Neurocognitive mechanisms of audiovisual speech perception

    Get PDF
    Face-to-face communication involves both hearing and seeing speech. Heard and seen speech inputs interact during audiovisual speech perception. Specifically, seeing the speaker's mouth and lip movements improves identification of acoustic speech stimuli, especially in noisy conditions. In addition, visual speech may even change the auditory percept. This occurs when mismatching auditory speech is dubbed onto visual articulation. Research on the brain mechanisms of audiovisual perception aims at revealing where, when and how inputs from different modalities interact. In this thesis, functional magnetic resonance imaging (fMRI), magnetoencephalography (MEG) and behavioral methods were used to study the neurocognitive mechanisms of audiovisual speech perception. The results suggest that interactions during audiovisual and visual speech perception have an effect on auditory speech processing at early levels of processing hierarchy. The results also suggest that auditory and visual speech inputs interact in the motor cortical areas involved in speech production. Some of these regions are part of the "mirror neuron system" (MNS). The MNS performs a specialized primate cerebral function of coupling two fundamental processes – motor action execution and perception – together. It is suggested that this action-perception coupling mechanism might be involved in audiovisual integration of speech.reviewe
    • …
    corecore