670 research outputs found
Polyphonic Sound Event Detection by using Capsule Neural Networks
Artificial sound event detection (SED) has the aim to mimic the human ability
to perceive and understand what is happening in the surroundings. Nowadays,
Deep Learning offers valuable techniques for this goal such as Convolutional
Neural Networks (CNNs). The Capsule Neural Network (CapsNet) architecture has
been recently introduced in the image processing field with the intent to
overcome some of the known limitations of CNNs, specifically regarding the
scarce robustness to affine transformations (i.e., perspective, size,
orientation) and the detection of overlapped images. This motivated the authors
to employ CapsNets to deal with the polyphonic-SED task, in which multiple
sound events occur simultaneously. Specifically, we propose to exploit the
capsule units to represent a set of distinctive properties for each individual
sound event. Capsule units are connected through a so-called "dynamic routing"
that encourages learning part-whole relationships and improves the detection
performance in a polyphonic context. This paper reports extensive evaluations
carried out on three publicly available datasets, showing how the CapsNet-based
algorithm not only outperforms standard CNNs but also allows to achieve the
best results with respect to the state of the art algorithms
Semantic Object Prediction and Spatial Sound Super-Resolution with Binaural Sounds
Humans can robustly recognize and localize objects by integrating visual and
auditory cues. While machines are able to do the same now with images, less
work has been done with sounds. This work develops an approach for dense
semantic labelling of sound-making objects, purely based on binaural sounds. We
propose a novel sensor setup and record a new audio-visual dataset of street
scenes with eight professional binaural microphones and a 360 degree camera.
The co-existence of visual and audio cues is leveraged for supervision
transfer. In particular, we employ a cross-modal distillation framework that
consists of a vision `teacher' method and a sound `student' method -- the
student method is trained to generate the same results as the teacher method.
This way, the auditory system can be trained without using human annotations.
We also propose two auxiliary tasks namely, a) a novel task on Spatial Sound
Super-resolution to increase the spatial resolution of sounds, and b) dense
depth prediction of the scene. We then formulate the three tasks into one
end-to-end trainable multi-tasking network aiming to boost the overall
performance. Experimental results on the dataset show that 1) our method
achieves promising results for semantic prediction and the two auxiliary tasks;
and 2) the three tasks are mutually beneficial -- training them together
achieves the best performance and 3) the number and orientations of microphones
are both important. The data and code will be released to facilitate the
research in this new direction.Comment: Project page:
https://www.trace.ethz.ch/publications/2020/sound_perception/index.htm
Recommended from our members
A database and challenge for acoustic scene classification and event detection
Sound Event Localization, Detection, and Tracking by Deep Neural Networks
In this thesis, we present novel sound representations and classification methods for the task of sound event localization, detection, and tracking (SELDT). The human auditory system has evolved to localize multiple sound events, recognize and further track their motion individually in an acoustic environment. This ability of humans makes them context-aware and enables them to interact with their surroundings naturally. Developing similar methods for machines will provide an automatic description of social and human activities around them and enable machines to be context-aware similar to humans. Such methods can be employed to assist the hearing impaired to visualize sounds, for robot navigation, and to monitor biodiversity, the home, and cities.
A real-life acoustic scene is complex in nature, with multiple sound events that are temporally and spatially overlapping, including stationary and moving events with varying angular velocities. Additionally, each individual sound event class, for example, a car horn can have a lot of variabilities, i.e., different cars have different horns, and within the same model of the car, the duration and the temporal structure of the horn sound is driver dependent. Performing SELDT in such overlapping and dynamic sound scenes while being robust is challenging for machines. Hence we propose to investigate the SELDT task in this thesis and use a data-driven approach using deep neural networks (DNNs).
The sound event detection (SED) task requires the detection of onset and offset time for individual sound events and their corresponding labels. In this regard, we propose to use spatial and perceptual features extracted from multichannel audio for SED using two different DNNs, recurrent neural networks (RNNs) and convolutional recurrent neural networks (CRNNs). We show that using multichannel audio features improves the SED performance for overlapping sound events in comparison to traditional single-channel audio features. The proposed novel features and methods produced state-of-the-art performance for the real-life SED task and won the IEEE AASP DCASE challenge consecutively in 2016 and 2017.
Sound event localization is the task of spatially locating the position of individual sound events. Traditionally, this has been approached using parametric methods. In this thesis, we propose a CRNN for detecting the azimuth and elevation angles of multiple temporally overlapping sound events. This is the first DNN-based method performing localization in complete azimuth and elevation space. In comparison to parametric methods which require the information of the number of active sources, the proposed method learns this information directly from the input data and estimates their respective spatial locations. Further, the proposed CRNN is shown to be more robust than parametric methods in reverberant scenarios.
Finally, the detection and localization tasks are performed jointly using a CRNN. This method additionally tracks the spatial location with time, thus producing the SELDT results. This is the first DNN-based SELDT method and is shown to perform equally with stand-alone baselines for SED, localization, and tracking. The proposed SELDT method is evaluated on nine datasets that represent anechoic and reverberant sound scenes, stationary and moving sources with varying velocities, a different number of overlapping sound events and different microphone array formats. The results show that the SELDT method can track multiple overlapping sound events that are both spatially stationary and moving
Binaural sound source localisation using a Bayesian-network-based blackboard system and hypothesis-driven feedback
An essential aspect of Auditory Scene Analysis is the localisation of sound sources in relation to the
position of the listener in the surrounding environment. The human auditory system is capable of
precisely locating and separating different sound sources, even in noisy and reverberant environments,
whereas mimicking this ability by computational means is still a challenging task. In this work, we
investigate a Bayesian-network-based approach in the context of binaural sound source localisation.
We extend existing solutions towards a Bayesian network based blackboard system that includes expert
knowledge inspired by insights into the human auditory system. In order to improve estimation
of source positions and reduce uncertainty caused by front-back ambiguities, hypothesis-driven feedback
is used. This is accomplished by triggering head movements based on inference results provided
by the Bayesian network. We evaluate the performance of our approach in comparison to existing
solutions in a sound-source localisation task within a virtual acoustic environment
No Link Between Speech-in-Noise Perception and Auditory Sensory Memory - Evidence From a Large Cohort of Older and Younger Listeners
A growing literature is demonstrating a link between working memory (WM) and speech-in-noise (SiN) perception. However, the nature of this correlation and which components of WM might underlie it, are being debated. We investigated how SiN reception links with auditory sensory memory (aSM) - the low-level processes that support the short-term maintenance of temporally unfolding sounds. A large sample of old (N = 199, 60-79 yo) and young (N = 149, 20-35 yo) participants was recruited online and performed a coordinate response measure-based speech-in-babble task that taps listeners' ability to track a speech target in background noise. We used two tasks to investigate implicit and explicit aSM. Both were based on tone patterns overlapping in processing time scales with speech (presentation rate of tones 20 Hz; of patterns 2 Hz). We hypothesised that a link between SiN and aSM may be particularly apparent in older listeners due to age-related reduction in both SiN reception and aSM. We confirmed impaired SiN reception in the older cohort and demonstrated reduced aSM performance in those listeners. However, SiN and aSM did not share variability. Across the two age groups, SiN performance was predicted by a binaural processing test and age. The results suggest that previously observed links between WM and SiN may relate to the executive components and other cognitive demands of the used tasks. This finding helps to constrain the search for the perceptual and cognitive factors that explain individual variability in SiN performance
- …