824 research outputs found

    Sound Event Localization, Detection, and Tracking by Deep Neural Networks

    Get PDF
    In this thesis, we present novel sound representations and classification methods for the task of sound event localization, detection, and tracking (SELDT). The human auditory system has evolved to localize multiple sound events, recognize and further track their motion individually in an acoustic environment. This ability of humans makes them context-aware and enables them to interact with their surroundings naturally. Developing similar methods for machines will provide an automatic description of social and human activities around them and enable machines to be context-aware similar to humans. Such methods can be employed to assist the hearing impaired to visualize sounds, for robot navigation, and to monitor biodiversity, the home, and cities. A real-life acoustic scene is complex in nature, with multiple sound events that are temporally and spatially overlapping, including stationary and moving events with varying angular velocities. Additionally, each individual sound event class, for example, a car horn can have a lot of variabilities, i.e., different cars have different horns, and within the same model of the car, the duration and the temporal structure of the horn sound is driver dependent. Performing SELDT in such overlapping and dynamic sound scenes while being robust is challenging for machines. Hence we propose to investigate the SELDT task in this thesis and use a data-driven approach using deep neural networks (DNNs). The sound event detection (SED) task requires the detection of onset and offset time for individual sound events and their corresponding labels. In this regard, we propose to use spatial and perceptual features extracted from multichannel audio for SED using two different DNNs, recurrent neural networks (RNNs) and convolutional recurrent neural networks (CRNNs). We show that using multichannel audio features improves the SED performance for overlapping sound events in comparison to traditional single-channel audio features. The proposed novel features and methods produced state-of-the-art performance for the real-life SED task and won the IEEE AASP DCASE challenge consecutively in 2016 and 2017. Sound event localization is the task of spatially locating the position of individual sound events. Traditionally, this has been approached using parametric methods. In this thesis, we propose a CRNN for detecting the azimuth and elevation angles of multiple temporally overlapping sound events. This is the first DNN-based method performing localization in complete azimuth and elevation space. In comparison to parametric methods which require the information of the number of active sources, the proposed method learns this information directly from the input data and estimates their respective spatial locations. Further, the proposed CRNN is shown to be more robust than parametric methods in reverberant scenarios. Finally, the detection and localization tasks are performed jointly using a CRNN. This method additionally tracks the spatial location with time, thus producing the SELDT results. This is the first DNN-based SELDT method and is shown to perform equally with stand-alone baselines for SED, localization, and tracking. The proposed SELDT method is evaluated on nine datasets that represent anechoic and reverberant sound scenes, stationary and moving sources with varying velocities, a different number of overlapping sound events and different microphone array formats. The results show that the SELDT method can track multiple overlapping sound events that are both spatially stationary and moving

    Binaural scene analysis : localization, detection and recognition of speakers in complex acoustic scenes

    Get PDF
    The human auditory system has the striking ability to robustly localize and recognize a specific target source in complex acoustic environments while ignoring interfering sources. Surprisingly, this remarkable capability, which is referred to as auditory scene analysis, is achieved by only analyzing the waveforms reaching the two ears. Computers, however, are presently not able to compete with the performance achieved by the human auditory system, even in the restricted paradigm of confronting a computer algorithm based on binaural signals with a highly constrained version of auditory scene analysis, such as localizing a sound source in a reverberant environment or recognizing a speaker in the presence of interfering noise. In particular, the problem of focusing on an individual speech source in the presence of competing speakers, termed the cocktail party problem, has been proven to be extremely challenging for computer algorithms. The primary objective of this thesis is the development of a binaural scene analyzer that is able to jointly localize, detect and recognize multiple speech sources in the presence of reverberation and interfering noise. The processing of the proposed system is divided into three main stages: localization stage, detection of speech sources, and recognition of speaker identities. The only information that is assumed to be known a priori is the number of target speech sources that are present in the acoustic mixture. Furthermore, the aim of this work is to reduce the performance gap between humans and machines by improving the performance of the individual building blocks of the binaural scene analyzer. First, a binaural front-end inspired by auditory processing is designed to robustly determine the azimuth of multiple, simultaneously active sound sources in the presence of reverberation. The localization model builds on the supervised learning of azimuthdependent binaural cues, namely interaural time and level differences. Multi-conditional training is performed to incorporate the uncertainty of these binaural cues resulting from reverberation and the presence of competing sound sources. Second, a speech detection module that exploits the distinct spectral characteristics of speech and noise signals is developed to automatically select azimuthal positions that are likely to correspond to speech sources. Due to the established link between the localization stage and the recognition stage, which is realized by the speech detection module, the proposed binaural scene analyzer is able to selectively focus on a predefined number of speech sources that are positioned at unknown spatial locations, while ignoring interfering noise sources emerging from other spatial directions. Third, the speaker identities of all detected speech sources are recognized in the final stage of the model. To reduce the impact of environmental noise on the speaker recognition performance, a missing data classifier is combined with the adaptation of speaker models using a universal background model. This combination is particularly beneficial in nonstationary background noise

    An Impulse Detection Methodology and System with Emphasis on Weapon Fire Detection

    Get PDF
    This dissertation proposes a methodology for detecting impulse signatures. An algorithm with specific emphasis on weapon fire detection is proposed. Multiple systems in which the detection algorithm can operate, are proposed. In order for detection systems to be used in practical application, they must have high detection performance, minimizing false alarms, be cost effective, and utilize available hardware. Most applications require real time processing and increased range performance, and some applications require detection from mobile platforms. This dissertation intends to provide a methodology for impulse detection, demonstrated for the specific application case of weapon fire detection, that is intended for real world application, taking into account acceptable algorithm performance, feasible system design, and practical implementation. The proposed detection algorithm is implemented with multiple sensors, allowing spectral waveband versatility in system design. The proposed algorithm is also shown to operate at a variety of video frame rates, allowing for practical design using available common, commercial off the shelf hardware. Detection, false alarm, and classification performance are provided, given the use of different sensors and associated wavebands. The false alarms are further mitigated through use of an adaptive, multi-layer classification scheme, leading to potential on-the-move application. The algorithm is shown to work in real time. The proposed system, including algorithm and hardware, is provided. Additional systems are proposed which attempt to complement the strengths and alleviate the weaknesses of the hardware and algorithm. Systems are proposed to mitigate saturation clutter signals and increase detection of saturated targets through the use of position, navigation, and timing sensors, acoustic sensors, and imaging sensors. Furthermore, systems are provided which increase target detection and provide increased functionality, improving the cost effectiveness of the system. The resulting algorithm is shown to enable detection of weapon fire targets, while minimizing false alarms, for real-world, fieldable applications. The work presented demonstrates the complexity of detection algorithm and system design for practical applications in complex environments and also emphasizes the complex interactions and considerations when designing a practical system, where system design is the intersection of algorithm performance and design, hardware performance and design, and size, weight, power, cost, and processing

    The significance of passive acoustic array-configurations on sperm whale range estimation when using the hyperbolic algorithm

    Get PDF
    In cetacean monitoring for population estimation, behavioural studies or mitigation, traditional visual observations are being augmented by the use of Passive Acoustic Monitoring (PAM) techniques that use the creature’s vocalisations for localisation. The design of hydrophone configurations is evaluated for sperm whale (Physeter macrocephalus) range estimation to meet the requirements of the current mitigation regulations for a safety zone and behaviour research. This thesis uses the Time Difference of Arrival (TDOA) of cetacean vocalisations with a three-dimensional hyperbolic localisation algorithm. A MATLAB simulator has been developed to model array-configurations and to assess their performance in source range estimation for both homogeneous and non-homogeneous sound speed profiles (SSP). The non-homogeneous medium is modelled on a Bellhop ray trace model, using data collected from the Gulf of Mexico. The sperm whale clicks are chosen as an exemplar of a distinctive underwater sound. The simulator is tested with a separate synthetic source generator which produced a set of TDOAs from a known source location. The performance in source range estimation for Square, Trapezium, Triangular, Shifted-pair and Y-shape geometries is tested. The Y-shape geometry, with four elements and aperture-length of 120m, is the most accurate, giving an error of ±10m over slant ranges of 500m in a homogeneous medium, and 300m in a non-homogeneous medium. However, for towed array deployments, the Y-shape array is sensitive to angle-positioning-error when the geometry is seriously distorted. The Shifted-pair geometry overcomes these limits, performing an initial accuracy of ±30m when the vessel either moves in a straight line or turns to port or starboard. It constitutes a recommendable array-configuration for towed array deployments. The thesis demonstrates that the number of receivers, the array-geometry and the arrayaperture are important parameters to consider when designing and deploying a hydrophone array. It is shown that certain array-configurations can significantly improve the accuracy of source range estimation. Recommendations are made concerning preferred array-configurations for use with PAM systems
    • …
    corecore