670 research outputs found

    Polyphonic Sound Event Detection by using Capsule Neural Networks

    Full text link
    Artificial sound event detection (SED) has the aim to mimic the human ability to perceive and understand what is happening in the surroundings. Nowadays, Deep Learning offers valuable techniques for this goal such as Convolutional Neural Networks (CNNs). The Capsule Neural Network (CapsNet) architecture has been recently introduced in the image processing field with the intent to overcome some of the known limitations of CNNs, specifically regarding the scarce robustness to affine transformations (i.e., perspective, size, orientation) and the detection of overlapped images. This motivated the authors to employ CapsNets to deal with the polyphonic-SED task, in which multiple sound events occur simultaneously. Specifically, we propose to exploit the capsule units to represent a set of distinctive properties for each individual sound event. Capsule units are connected through a so-called "dynamic routing" that encourages learning part-whole relationships and improves the detection performance in a polyphonic context. This paper reports extensive evaluations carried out on three publicly available datasets, showing how the CapsNet-based algorithm not only outperforms standard CNNs but also allows to achieve the best results with respect to the state of the art algorithms

    Semantic Object Prediction and Spatial Sound Super-Resolution with Binaural Sounds

    Full text link
    Humans can robustly recognize and localize objects by integrating visual and auditory cues. While machines are able to do the same now with images, less work has been done with sounds. This work develops an approach for dense semantic labelling of sound-making objects, purely based on binaural sounds. We propose a novel sensor setup and record a new audio-visual dataset of street scenes with eight professional binaural microphones and a 360 degree camera. The co-existence of visual and audio cues is leveraged for supervision transfer. In particular, we employ a cross-modal distillation framework that consists of a vision `teacher' method and a sound `student' method -- the student method is trained to generate the same results as the teacher method. This way, the auditory system can be trained without using human annotations. We also propose two auxiliary tasks namely, a) a novel task on Spatial Sound Super-resolution to increase the spatial resolution of sounds, and b) dense depth prediction of the scene. We then formulate the three tasks into one end-to-end trainable multi-tasking network aiming to boost the overall performance. Experimental results on the dataset show that 1) our method achieves promising results for semantic prediction and the two auxiliary tasks; and 2) the three tasks are mutually beneficial -- training them together achieves the best performance and 3) the number and orientations of microphones are both important. The data and code will be released to facilitate the research in this new direction.Comment: Project page: https://www.trace.ethz.ch/publications/2020/sound_perception/index.htm

    Sound event detection with recurrent neural networks

    Get PDF

    Sound Event Localization, Detection, and Tracking by Deep Neural Networks

    Get PDF
    In this thesis, we present novel sound representations and classification methods for the task of sound event localization, detection, and tracking (SELDT). The human auditory system has evolved to localize multiple sound events, recognize and further track their motion individually in an acoustic environment. This ability of humans makes them context-aware and enables them to interact with their surroundings naturally. Developing similar methods for machines will provide an automatic description of social and human activities around them and enable machines to be context-aware similar to humans. Such methods can be employed to assist the hearing impaired to visualize sounds, for robot navigation, and to monitor biodiversity, the home, and cities. A real-life acoustic scene is complex in nature, with multiple sound events that are temporally and spatially overlapping, including stationary and moving events with varying angular velocities. Additionally, each individual sound event class, for example, a car horn can have a lot of variabilities, i.e., different cars have different horns, and within the same model of the car, the duration and the temporal structure of the horn sound is driver dependent. Performing SELDT in such overlapping and dynamic sound scenes while being robust is challenging for machines. Hence we propose to investigate the SELDT task in this thesis and use a data-driven approach using deep neural networks (DNNs). The sound event detection (SED) task requires the detection of onset and offset time for individual sound events and their corresponding labels. In this regard, we propose to use spatial and perceptual features extracted from multichannel audio for SED using two different DNNs, recurrent neural networks (RNNs) and convolutional recurrent neural networks (CRNNs). We show that using multichannel audio features improves the SED performance for overlapping sound events in comparison to traditional single-channel audio features. The proposed novel features and methods produced state-of-the-art performance for the real-life SED task and won the IEEE AASP DCASE challenge consecutively in 2016 and 2017. Sound event localization is the task of spatially locating the position of individual sound events. Traditionally, this has been approached using parametric methods. In this thesis, we propose a CRNN for detecting the azimuth and elevation angles of multiple temporally overlapping sound events. This is the first DNN-based method performing localization in complete azimuth and elevation space. In comparison to parametric methods which require the information of the number of active sources, the proposed method learns this information directly from the input data and estimates their respective spatial locations. Further, the proposed CRNN is shown to be more robust than parametric methods in reverberant scenarios. Finally, the detection and localization tasks are performed jointly using a CRNN. This method additionally tracks the spatial location with time, thus producing the SELDT results. This is the first DNN-based SELDT method and is shown to perform equally with stand-alone baselines for SED, localization, and tracking. The proposed SELDT method is evaluated on nine datasets that represent anechoic and reverberant sound scenes, stationary and moving sources with varying velocities, a different number of overlapping sound events and different microphone array formats. The results show that the SELDT method can track multiple overlapping sound events that are both spatially stationary and moving

    Binaural sound source localisation using a Bayesian-network-based blackboard system and hypothesis-driven feedback

    Get PDF
    An essential aspect of Auditory Scene Analysis is the localisation of sound sources in relation to the position of the listener in the surrounding environment. The human auditory system is capable of precisely locating and separating different sound sources, even in noisy and reverberant environments, whereas mimicking this ability by computational means is still a challenging task. In this work, we investigate a Bayesian-network-based approach in the context of binaural sound source localisation. We extend existing solutions towards a Bayesian network based blackboard system that includes expert knowledge inspired by insights into the human auditory system. In order to improve estimation of source positions and reduce uncertainty caused by front-back ambiguities, hypothesis-driven feedback is used. This is accomplished by triggering head movements based on inference results provided by the Bayesian network. We evaluate the performance of our approach in comparison to existing solutions in a sound-source localisation task within a virtual acoustic environment

    No Link Between Speech-in-Noise Perception and Auditory Sensory Memory - Evidence From a Large Cohort of Older and Younger Listeners

    Get PDF
    A growing literature is demonstrating a link between working memory (WM) and speech-in-noise (SiN) perception. However, the nature of this correlation and which components of WM might underlie it, are being debated. We investigated how SiN reception links with auditory sensory memory (aSM) - the low-level processes that support the short-term maintenance of temporally unfolding sounds. A large sample of old (N = 199, 60-79 yo) and young (N = 149, 20-35 yo) participants was recruited online and performed a coordinate response measure-based speech-in-babble task that taps listeners' ability to track a speech target in background noise. We used two tasks to investigate implicit and explicit aSM. Both were based on tone patterns overlapping in processing time scales with speech (presentation rate of tones 20 Hz; of patterns 2 Hz). We hypothesised that a link between SiN and aSM may be particularly apparent in older listeners due to age-related reduction in both SiN reception and aSM. We confirmed impaired SiN reception in the older cohort and demonstrated reduced aSM performance in those listeners. However, SiN and aSM did not share variability. Across the two age groups, SiN performance was predicted by a binaural processing test and age. The results suggest that previously observed links between WM and SiN may relate to the executive components and other cognitive demands of the used tasks. This finding helps to constrain the search for the perceptual and cognitive factors that explain individual variability in SiN performance
    corecore