8 research outputs found

    Learning sound representations using trainable COPE feature extractors

    Get PDF
    Sound analysis research has mainly been focused on speech and music processing. The deployed methodologies are not suitable for analysis of sounds with varying background noise, in many cases with very low signal-to-noise ratio (SNR). In this paper, we present a method for the detection of patterns of interest in audio signals. We propose novel trainable feature extractors, which we call COPE (Combination of Peaks of Energy). The structure of a COPE feature extractor is determined using a single prototype sound pattern in an automatic configuration process, which is a type of representation learning. We construct a set of COPE feature extractors, configured on a number of training patterns. Then we take their responses to build feature vectors that we use in combination with a classifier to detect and classify patterns of interest in audio signals. We carried out experiments on four public data sets: MIVIA audio events, MIVIA road events, ESC-10 and TU Dortmund data sets. The results that we achieved (recognition rate equal to 91.71% on the MIVIA audio events, 94% on the MIVIA road events, 81.25% on the ESC-10 and 94.27% on the TU Dortmund) demonstrate the effectiveness of the proposed method and are higher than the ones obtained by other existing approaches. The COPE feature extractors have high robustness to variations of SNR. Real-time performance is achieved even when the value of a large number of features is computed.Comment: Accepted for publication in Pattern Recognitio

    Speech Mode Classification using the Fusion of CNNs and LSTM Networks

    Get PDF
    Speech mode classification is an area that has not been as widely explored in the field of sound classification as others such as environmental sounds, music genre, and speaker identification. But what is speech mode? While mode is defined as the way or the manner in which something occurs or is expressed or done, speech mode is defined as the style in which the speech is delivered by a person. There are some reports on speech mode classification using conventional methods, such as whispering and talking using a normal phonetic sound. However, to the best of our knowledge, deep learning-based methods have not been reported in the open literature for the aforementioned classification scenario. Specifically, in this work we assess the performance of image-based classification algorithms on this challenging speech mode classification problem, including the usage of pre-trained deep neural networks, namely AlexNet, ResNet18 and SqueezeNet. Thus, we compare the classification efficiency of a set of deep learning-based classifiers, while we also assess the impact of different 2D image representations (spectrograms, mel-spectrograms, and their image-based fusion) on classification accuracy. These representations are used as input to the networks after being generated from the original audio signals. Next, we compare the accuracy of the DL-based classifies to a set of machine learning (ML) ones that use as their inputs Mel-Frequency Cepstral Coefficients (MFCCs) features. Then, after determining the most efficient sampling rate for our classification problem (i.e. 32kHz), we study the performance of our proposed method of combining CNN with LSTM (Long Short-Term Memory) networks. For this purpose, we use the features extracted from the deep networks of the previous step. We conclude our study by evaluating the role of sampling rates on classification accuracy by generating two sets of 2D image representations – one with 32kHz and the other with 16kHz sampling. Experimental results show that after cross validation the accuracy of DL-based approaches is 15% higher than ML ones, with SqueezeNet yielding an accuracy of more than 91% at 32kHz, whether we use transfer learning, feature-level fusion or score-level fusion (92.5%). Our proposed method using LSTMs further increased that accuracy by more than 3%, resulting in an average accuracy of 95.7%

    Microphone Array Based Surveillance System

    Get PDF
    This work attempts to explore an alternative surveillance method through the usage of a microphone array. Most of the current audio based surveillance work focuses on the detection of a single sound source. For any future real world applications it is very plausible that multiple sound sources will occur simultaneously at some point in time. In this work a method of detecting and localizing multiple sound sources is presented. Three state-of-the-art techniques are given that put together allow the system to estimate the approximate location of one or multiple sound sources, separate the sounds, and then properly identify the sounds.Electrical Engineerin

    Sound Event Localization, Detection, and Tracking by Deep Neural Networks

    Get PDF
    In this thesis, we present novel sound representations and classification methods for the task of sound event localization, detection, and tracking (SELDT). The human auditory system has evolved to localize multiple sound events, recognize and further track their motion individually in an acoustic environment. This ability of humans makes them context-aware and enables them to interact with their surroundings naturally. Developing similar methods for machines will provide an automatic description of social and human activities around them and enable machines to be context-aware similar to humans. Such methods can be employed to assist the hearing impaired to visualize sounds, for robot navigation, and to monitor biodiversity, the home, and cities. A real-life acoustic scene is complex in nature, with multiple sound events that are temporally and spatially overlapping, including stationary and moving events with varying angular velocities. Additionally, each individual sound event class, for example, a car horn can have a lot of variabilities, i.e., different cars have different horns, and within the same model of the car, the duration and the temporal structure of the horn sound is driver dependent. Performing SELDT in such overlapping and dynamic sound scenes while being robust is challenging for machines. Hence we propose to investigate the SELDT task in this thesis and use a data-driven approach using deep neural networks (DNNs). The sound event detection (SED) task requires the detection of onset and offset time for individual sound events and their corresponding labels. In this regard, we propose to use spatial and perceptual features extracted from multichannel audio for SED using two different DNNs, recurrent neural networks (RNNs) and convolutional recurrent neural networks (CRNNs). We show that using multichannel audio features improves the SED performance for overlapping sound events in comparison to traditional single-channel audio features. The proposed novel features and methods produced state-of-the-art performance for the real-life SED task and won the IEEE AASP DCASE challenge consecutively in 2016 and 2017. Sound event localization is the task of spatially locating the position of individual sound events. Traditionally, this has been approached using parametric methods. In this thesis, we propose a CRNN for detecting the azimuth and elevation angles of multiple temporally overlapping sound events. This is the first DNN-based method performing localization in complete azimuth and elevation space. In comparison to parametric methods which require the information of the number of active sources, the proposed method learns this information directly from the input data and estimates their respective spatial locations. Further, the proposed CRNN is shown to be more robust than parametric methods in reverberant scenarios. Finally, the detection and localization tasks are performed jointly using a CRNN. This method additionally tracks the spatial location with time, thus producing the SELDT results. This is the first DNN-based SELDT method and is shown to perform equally with stand-alone baselines for SED, localization, and tracking. The proposed SELDT method is evaluated on nine datasets that represent anechoic and reverberant sound scenes, stationary and moving sources with varying velocities, a different number of overlapping sound events and different microphone array formats. The results show that the SELDT method can track multiple overlapping sound events that are both spatially stationary and moving

    Detección e identificación de señales sonoras en entornos asistivos.

    Get PDF
    150 p.El trabajo desarrollado en este documento de Tesis Doctoral tiene como principal objetivo el estudio y aplicabilidad de técnicas de reconocimiento de sonidos no relacionados con el habla, tales como timbres de puerta, grifos abiertos, despertadores, etc., que ayuden a mejorar la independencia y calidad de vida de las personas con discapacidad auditiva.En esta investigación se han desarrollado sistemas de reconocimiento capaces de trabajar en tiempo real utilizando micrófonos profesionales con una localización fija. Estos sistemas han sido diseñados tanto para avisar a las personas con problemas auditivos de sonidos de interés como para su uso en sistemas inteligentes que utilicen esta información para el reconocimiento de actividades de la vida diaria de la persona. No obstante, la principal contribución de esta tesis reside en la investigación de este tipo de sistemas en teléfonos móviles donde las prestaciones hardware están más limitadas y las condiciones de entrenamiento de los sonidos y las de validación o testeo varían. Se ha demostrado cómo optimizando los algoritmos de detección y clasificación, estos sistemas pueden ser funcionales en dispositivos móviles en tiempo real. El trabajo en este campo ha derivado en el desarrollo de una aplicación funcional para teléfonos móviles, capaz de funcionar en tiempo real y diseñada en base a pautas de accesibilidad para el apoyo de personas con discapacidad auditiva

    Contribution to study and implementation of a bio-inspired perception system based on visual and auditory attention

    Get PDF
    The main goal of these researches is the design of one artificial perception system allowing to identify events or scenes in a complex environment. The work carried out during this thesis focused on the study and the conception of a bio-inspired perception system based on the both visual and auditory saliency. The main contributions of this thesis are auditory saliency with sound recognition and visual saliency with object recognition. The auditory saliency is computed by merging information from the both temporal and spectral signals with a saliency map of a spectrogram. The visual perception system is based on visual saliency and recognition of foreground object. In addition, the originality of the proposed approach is the possibility to do an evaluation of the coherence between visual and auditory observations using the obtained information from the features extracted from both visual and auditory patters. The experimental results have proven the interest of this method in the framework of scene identification in a complex environmentL'objectif principal de cette thèse porte sur la conception d'un système de perception artificiel permettant d'identifier des scènes ou évènements pertinents dans des environnements complexes. Les travaux réalisés ont permis d'étudier et de mettre en œuvre d'un système de perception bio-inspiré basé sur l'attention visuelle et auditive. Les principales contributions de cette thèse concernent la saillance auditive associée à une identification des sons et bruits environnementaux ainsi que la saillance visuelle associée à une reconnaissance d'objets pertinents. La saillance du signal sonore est calculée en fusionnant des informations extraites des représentations temporelles et spectrales du signal acoustique avec une carte de saillance visuelle du spectrogramme du signal concerné. Le système de perception visuelle est quant à lui composé de deux mécanismes distincts. Le premier se base sur des méthodes de saillance visuelle et le deuxième permet d'identifier l'objet en premier plan. D'autre part, l'originalité de notre approche est qu'elle permet d'évaluer la cohérence des observations en fusionnant les informations extraites des signaux auditifs et visuels perçus. Les résultats expérimentaux ont permis de confirmer l'intérêt des méthodes utilisées dans le cadre de l'identification de scènes pertinentes dans un environnement complex

    Deep Neural Networks for Sound Event Detection

    Get PDF
    The objective of this thesis is to develop novel classification and feature learning techniques for the task of sound event detection (SED) in real-world environments. Throughout their lives, humans experience a consistent learning process on how to assign meanings to sounds. Thanks to this, most of the humans can easily recognize the sound of a thunder, dog bark, door bell, bird singing etc. In this work, we aim to develop systems that can automatically detect the sound events commonly present in our daily lives. Such systems can be utilized in e.g. contextaware devices, acoustic surveillance, bio-acoustical and healthcare monitoring, and smart-home cities.In this thesis, we propose to apply the modern machine learning methods called deep learning for SED. The relationship between the commonly used timefrequency representations for SED (such as mel spectrogram and magnitude spectrogram) and the target sound event labels are highly complex. Deep learning methods such as deep neural networks (DNN) utilize a layered structure of units to extract features from the given sound representation input with increased abstraction at each layer. This increases the network’s capacity to efficiently learn the highly complex relationship between the sound representation and the target sound event labels. We found that the proposed DNN approach performs significantly better than the established classifier techniques for SED such as Gaussian mixture models.In a time-frequency representation of an audio recording, a sound event can often be recognized as a distinct pattern that may exhibit shifts in both dimensions. The intra-class variability of the sound events may cause to small shifts in the frequency domain content, and the time domain shift results from the fact that a sound event can occur at any time for a given audio recording. We found that convolutional neural networks (CNN) are useful to learn shift-invariant filters that are essential for robust modeling of sound events. In addition, we show that recurrent neural networks (RNN) are effective in modeling the long-term temporal characteristics of the sound events. Finally, we combine the convolutional and recurrent layers in a single classifier called convolutional recurrent neural networks (CRNN), which emphasizes the benefits of both and provides state-of-the-art results in multiple SED benchmark datasets.Aside from learning the mappings between the time-frequency representations and the sound event labels, we show that deep learning methods can also be utilized to learn a direct mapping between the the target labels and a lower level representation such as the magnitude spectrogram or even the raw audio signals. In this thesis, the feature learning capabilities of the deep learning methods and the empirical knowledge on the human auditory perception are proposed to be integrated through the means of layer weight initialization with filterbank coefficients. This results with an optimal, ad-hoc filterbank that is obtained through gradient based optimization of the original coefficients to improve the SED performance

    Robust Environmental Sound Recognition for Home Automation

    No full text
    corecore