874 research outputs found
Co-Localization of Audio Sources in Images Using Binaural Features and Locally-Linear Regression
This paper addresses the problem of localizing audio sources using binaural
measurements. We propose a supervised formulation that simultaneously localizes
multiple sources at different locations. The approach is intrinsically
efficient because, contrary to prior work, it relies neither on source
separation, nor on monaural segregation. The method starts with a training
stage that establishes a locally-linear Gaussian regression model between the
directional coordinates of all the sources and the auditory features extracted
from binaural measurements. While fixed-length wide-spectrum sounds (white
noise) are used for training to reliably estimate the model parameters, we show
that the testing (localization) can be extended to variable-length
sparse-spectrum sounds (such as speech), thus enabling a wide range of
realistic applications. Indeed, we demonstrate that the method can be used for
audio-visual fusion, namely to map speech signals onto images and hence to
spatially align the audio and visual modalities, thus enabling to discriminate
between speaking and non-speaking faces. We release a novel corpus of real-room
recordings that allow quantitative evaluation of the co-localization method in
the presence of one or two sound sources. Experiments demonstrate increased
accuracy and speed relative to several state-of-the-art methods.Comment: 15 pages, 8 figure
Cross-modal Generative Model for Visual-Guided Binaural Stereo Generation
Binaural stereo audio is recorded by imitating the way the human ear receives
sound, which provides people with an immersive listening experience. Existing
approaches leverage autoencoders and directly exploit visual spatial
information to synthesize binaural stereo, resulting in a limited
representation of visual guidance. For the first time, we propose a visually
guided generative adversarial approach for generating binaural stereo audio
from mono audio. Specifically, we develop a Stereo Audio Generation Model
(SAGM), which utilizes shared spatio-temporal visual information to guide the
generator and the discriminator to work separately. The shared visual
information is updated alternately in the generative adversarial stage,
allowing the generator and discriminator to deliver their respective guided
knowledge while visually sharing. The proposed method learns bidirectional
complementary visual information, which facilitates the expression of visual
guidance in generation. In addition, spatial perception is a crucial attribute
of binaural stereo audio, and thus the evaluation of stereo spatial perception
is essential. However, previous metrics failed to measure the spatial
perception of audio. To this end, a metric to measure the spatial perception of
audio is proposed for the first time. The proposed metric is capable of
measuring the magnitude and direction of spatial perception in the temporal
dimension. Further, considering its function, it is feasible to utilize it
instead of demanding user studies to some extent. The proposed method achieves
state-of-the-art performance on 2 datasets and 5 evaluation metrics.
Qualitative experiments and user studies demonstrate that the method generates
space-realistic stereo audio
Inhibiting the inhibition
The precedence effect describes the phenomenon whereby echoes are spatially fused to the location of an initial sound by selectively suppressing the directional information of lagging sounds (echo suppression). Echo suppression is a prerequisite for faithful sound localization in natural environments but can break down depending on the behavioral context. To date, the neural mechanisms that suppress echo directional information without suppressing the perception of echoes themselves are not understood. We performed in vivo recordings in Mongolian gerbils of neurons of the dorsal nucleus of the lateral lemniscus (DNLL), a GABAergic brainstem nucleus that targets the auditory midbrain, and show that these DNLL neurons exhibit inhibition that persists tens of milliseconds beyond the stimulus offset, so-called persistent inhibition (PI). Using in vitro recordings, we demonstrate that PI stems from GABAergic projections from the opposite DNLL. Furthermore, these recordings show that PI is attributable to intrinsic features of this GABAergic innervation. Implementation of these physiological findings into a neuronal model of the auditory brainstem demonstrates that, on a circuit level, PI creates an enhancement of responsiveness to lagging sounds in auditory midbrain cells. Moreover, the model revealed that such response enhancement is a sufficient cue for an ideal observer to identify echoes and to exhibit echo suppression, which agrees closely with the percepts of human subjects
Semantic Object Prediction and Spatial Sound Super-Resolution with Binaural Sounds
Humans can robustly recognize and localize objects by integrating visual and
auditory cues. While machines are able to do the same now with images, less
work has been done with sounds. This work develops an approach for dense
semantic labelling of sound-making objects, purely based on binaural sounds. We
propose a novel sensor setup and record a new audio-visual dataset of street
scenes with eight professional binaural microphones and a 360 degree camera.
The co-existence of visual and audio cues is leveraged for supervision
transfer. In particular, we employ a cross-modal distillation framework that
consists of a vision `teacher' method and a sound `student' method -- the
student method is trained to generate the same results as the teacher method.
This way, the auditory system can be trained without using human annotations.
We also propose two auxiliary tasks namely, a) a novel task on Spatial Sound
Super-resolution to increase the spatial resolution of sounds, and b) dense
depth prediction of the scene. We then formulate the three tasks into one
end-to-end trainable multi-tasking network aiming to boost the overall
performance. Experimental results on the dataset show that 1) our method
achieves promising results for semantic prediction and the two auxiliary tasks;
and 2) the three tasks are mutually beneficial -- training them together
achieves the best performance and 3) the number and orientations of microphones
are both important. The data and code will be released to facilitate the
research in this new direction.Comment: Project page:
https://www.trace.ethz.ch/publications/2020/sound_perception/index.htm
BatVision: Learning to See 3D Spatial Layout with Two Ears
Many species have evolved advanced non-visual perception while artificial
systems fall behind. Radar and ultrasound complement camera-based vision but
they are often too costly and complex to set up for very limited information
gain. In nature, sound is used effectively by bats, dolphins, whales, and
humans for navigation and communication. However, it is unclear how to best
harness sound for machine perception. Inspired by bats' echolocation mechanism,
we design a low-cost BatVision system that is capable of seeing the 3D spatial
layout of space ahead by just listening with two ears. Our system emits short
chirps from a speaker and records returning echoes through microphones in an
artificial human pinnae pair. During training, we additionally use a stereo
camera to capture color images for calculating scene depths. We train a model
to predict depth maps and even grayscale images from the sound alone. During
testing, our trained BatVision provides surprisingly good predictions of 2D
visual scenes from two 1D audio signals. Such a sound to vision system would
benefit robot navigation and machine vision, especially in low-light or
no-light conditions. Our code and data are publicly available
Musical notes classification with Neuromorphic Auditory System using FPGA and a Convolutional Spiking Network
In this paper, we explore the capabilities of a sound
classification system that combines both a novel FPGA cochlear
model implementation and a bio-inspired technique based on a
trained convolutional spiking network. The neuromorphic
auditory system that is used in this work produces a form of
representation that is analogous to the spike outputs of the
biological cochlea. The auditory system has been developed using
a set of spike-based processing building blocks in the frequency
domain. They form a set of band pass filters in the spike-domain
that splits the audio information in 128 frequency channels, 64
for each of two audio sources. Address Event Representation
(AER) is used to communicate the auditory system with the
convolutional spiking network. A layer of convolutional spiking
network is developed and trained on a computer with the ability
to detect two kinds of sound: artificial pure tones in the presence
of white noise and electronic musical notes. After the training
process, the presented system is able to distinguish the different
sounds in real-time, even in the presence of white noise.Ministerio de Economía y Competitividad TEC2012-37868-C04-0
- …