8,412 research outputs found
An Object-Based Interpretation of Audiovisual Processing
Visual cues help listeners follow conversation in a complex acoustic environment. Many audiovisual research studies focus on how sensory cues are combined to optimize perception, either in terms of minimizing the uncertainty in the sensory estimate or maximizing intelligibility, particularly in speech understanding. From an auditory perception perspective, a fundamental question that has not been fully addressed is how visual information aids the ability to select and focus on one auditory object in the presence of competing sounds in a busy auditory scene. In this chapter, audiovisual integration is presented from an object-based attention viewpoint. In particular, it is argued that a stricter delineation of the concepts of multisensory integration versus binding would facilitate a deeper understanding of the nature of how information is combined across senses. Furthermore, using an object-based theoretical framework to distinguish binding as a distinct form of multisensory integration generates testable hypotheses with behavioral predictions that can account for different aspects of multisensory interactions. In this chapter, classic multisensory illusion paradigms are revisited and discussed in the context of multisensory binding. The chapter also describes multisensory experiments that focus on addressing how visual stimuli help listeners parse complex auditory scenes. Finally, it concludes with a discussion of the potential mechanisms by which audiovisual processing might resolve competition between concurrent sounds in order to solve the cocktail party problem
Egocentric Auditory Attention Localization in Conversations
In a noisy conversation environment such as a dinner party, people often
exhibit selective auditory attention, or the ability to focus on a particular
speaker while tuning out others. Recognizing who somebody is listening to in a
conversation is essential for developing technologies that can understand
social behavior and devices that can augment human hearing by amplifying
particular sound sources. The computer vision and audio research communities
have made great strides towards recognizing sound sources and speakers in
scenes. In this work, we take a step further by focusing on the problem of
localizing auditory attention targets in egocentric video, or detecting who in
a camera wearer's field of view they are listening to. To tackle the new and
challenging Selective Auditory Attention Localization problem, we propose an
end-to-end deep learning approach that uses egocentric video and multichannel
audio to predict the heatmap of the camera wearer's auditory attention. Our
approach leverages spatiotemporal audiovisual features and holistic reasoning
about the scene to make predictions, and outperforms a set of baselines on a
challenging multi-speaker conversation dataset. Project page:
https://fkryan.github.io/saa
CHORUS Deliverable 2.2: Second report - identification of multi-disciplinary key issues for gap analysis toward EU multimedia search engines roadmap
After addressing the state-of-the-art during the first year of Chorus and establishing the existing landscape in
multimedia search engines, we have identified and analyzed gaps within European research effort during our second year.
In this period we focused on three directions, notably technological issues, user-centred issues and use-cases and socio-
economic and legal aspects. These were assessed by two central studies: firstly, a concerted vision of functional breakdown
of generic multimedia search engine, and secondly, a representative use-cases descriptions with the related discussion on
requirement for technological challenges. Both studies have been carried out in cooperation and consultation with the
community at large through EC concertation meetings (multimedia search engines cluster), several meetings with our
Think-Tank, presentations in international conferences, and surveys addressed to EU projects coordinators as well as
National initiatives coordinators. Based on the obtained feedback we identified two types of gaps, namely core
technological gaps that involve research challenges, and āenablersā, which are not necessarily technical research
challenges, but have impact on innovation progress. New socio-economic trends are presented as well as emerging legal
challenges
CHORUS Deliverable 2.1: State of the Art on Multimedia Search Engines
Based on the information provided by European projects and national initiatives related to multimedia search as well as domains experts that participated in the CHORUS Think-thanks and workshops, this document reports on the state of the art related to multimedia content search from, a technical, and socio-economic perspective.
The technical perspective includes an up to date view on content based indexing and retrieval technologies, multimedia search in the context of mobile devices and peer-to-peer networks, and an overview of current evaluation and benchmark inititiatives to measure the performance of multimedia search engines.
From a socio-economic perspective we inventorize the impact and legal consequences of these technical advances and point out future directions of research
The listening talker: A review of human and algorithmic context-induced modifications of speech
International audienceSpeech output technology is finding widespread application, including in scenarios where intelligibility might be compromised - at least for some listeners - by adverse conditions. Unlike most current algorithms, talkers continually adapt their speech patterns as a response to the immediate context of spoken communication, where the type of interlocutor and the environment are the dominant situational factors influencing speech production. Observations of talker behaviour can motivate the design of more robust speech output algorithms. Starting with a listener-oriented categorisation of possible goals for speech modification, this review article summarises the extensive set of behavioural findings related to human speech modification, identifies which factors appear to be beneficial, and goes on to examine previous computational attempts to improve intelligibility in noise. The review concludes by tabulating 46 speech modifications, many of which have yet to be perceptually or algorithmically evaluated. Consequently, the review provides a roadmap for future work in improving the robustness of speech output
Dissociable neural correlates of multisensory coherence and selective attention
Previous work has demonstrated that performance in an auditory selective attention task can be enhanced or impaired, depending on whether a task-irrelevant visual stimulus is temporally coherent with a target auditory stream or with a competing distractor. However, it remains unclear how audiovisual (AV) temporal coherence and auditory selective attention interact at the neurophysiological level. Here, we measured neural activity using electroencephalography (EEG) while human participants (men and women) performed an auditory selective attention task, detecting deviants in a target audio stream. The amplitude envelope of the two competing auditory streams changed independently, while the radius of a visual disc was manipulated to control the audiovisual coherence. Analysis of the neural responses to the sound envelope demonstrated that auditory responses were enhanced independently of the attentional condition: both target and masker stream responses were enhanced when temporally coherent with the visual stimulus. In contrast, attention enhanced the event-related response (ERP) evoked by the transient deviants, independently of AV coherence. Finally, in an exploratory analysis, we identified a spatiotemporal component of ERP, in which temporal coherence enhanced the deviant-evoked responses only in the unattended stream. These results provide evidence for dissociable neural signatures of bottom-up (coherence) and top-down (attention) effects in AV object formation.Significance StatementTemporal coherence between auditory stimuli and task-irrelevant visual stimuli can enhance behavioral performance in auditory selective attention tasks. However, how audiovisual temporal coherence and attention interact at the neural level has not been established. Here, we measured EEG during a behavioral task designed to independently manipulate AV coherence and auditory selective attention. While some auditory features (sound envelope) could be coherent with visual stimuli, other features (timbre) were independent of visual stimuli. We find that audiovisual integration can be observed independently of attention for sound envelopes temporally coherent with visual stimuli, while the neural responses to unexpected timbre changes are most strongly modulated by attention. Our results provide evidence for dissociable neural mechanisms of bottom-up (coherence) and top-down (attention) effects on AV object formation
Multimedia information technology and the annotation of video
The state of the art in multimedia information technology has not progressed to the point where a single solution is available to meet all reasonable needs of documentalists and users of video archives. In general, we do not have an optimistic view of the usability of new technology in this domain, but digitization and digital power can be expected to cause a small revolution in the area of video archiving. The volume of data leads to two views of the future: on the pessimistic side, overload of data will cause lack of annotation capacity, and on the optimistic side, there will be enough data from which to learn selected concepts that can be deployed to support automatic annotation. At the threshold of this interesting era, we make an attempt to describe the state of the art in technology. We sample the progress in text, sound, and image processing, as well as in machine learning
Binaural sound source localisation using a Bayesian-network-based blackboard system and hypothesis-driven feedback
An essential aspect of Auditory Scene Analysis is the localisation of sound sources in relation to the
position of the listener in the surrounding environment. The human auditory system is capable of
precisely locating and separating different sound sources, even in noisy and reverberant environments,
whereas mimicking this ability by computational means is still a challenging task. In this work, we
investigate a Bayesian-network-based approach in the context of binaural sound source localisation.
We extend existing solutions towards a Bayesian network based blackboard system that includes expert
knowledge inspired by insights into the human auditory system. In order to improve estimation
of source positions and reduce uncertainty caused by front-back ambiguities, hypothesis-driven feedback
is used. This is accomplished by triggering head movements based on inference results provided
by the Bayesian network. We evaluate the performance of our approach in comparison to existing
solutions in a sound-source localisation task within a virtual acoustic environment
A Visionary Approach to Listening: Determining The Role Of Vision In Auditory Scene Analysis
To recognize and understand the auditory environment, the listener must first separate sounds that arise from different sources and capture each event. This process is known as auditory scene analysis. The aim of this thesis is to investigate whether and how visual information can influence auditory scene analysis. The thesis consists of four chapters. Firstly, I reviewed the literature to give a clear framework about the impact of visual information on the analysis of complex acoustic environments. In chapter II, I examined psychophysically whether temporal coherence between auditory and visual stimuli was sufficient to promote auditory stream segregation in a mixture. I have found that listeners were better able to report brief deviants in an amplitude modulated target stream when a visual stimulus changed in size in a temporally coherent manner than when the visual stream was coherent with the non-target auditory stream. This work demonstrates that temporal coherence between auditory and visual features can influence the way people analyse an auditory scene. In chapter III, the integration of auditory and visual features in auditory cortex was examined by recording neuronal responses in awake and anaesthetised ferret auditory cortex in response to the modified stimuli used in Chapter II. I demonstrated that temporal coherence between auditory and visual stimuli enhances the neural representation of a sound and influences which sound a neuron represents in a sound mixture. Visual stimuli elicited reliable changes in the phase of the local field potential which provides mechanistic insight into this finding. Together these findings provide evidence that early cross modal integration underlies the behavioural effects in chapter II. Finally, in chapter IV, I investigated whether training can influence the ability of listeners to utilize visual cues for auditory stream analysis and showed that this ability improved by training listeners to detect auditory-visual temporal coherence
- ā¦