    A target guided subband filter for acoustic event detection in noisy environments using wavelet packets

    This paper deals with acoustic event detection (AED), such as screams, gunshots, and explosions, in noisy environments. The main aim is to improve the detection performance under adverse conditions with a very low signal-to-noise ratio (SNR). A novel filtering method combined with an energy detector is presented. The wavelet packet transform (WPT) is first used for time-frequency representation of the acoustic signals. The proposed filter in the wavelet packet domain then uses a priori knowledge of the target event and an estimate of noise features to selectively suppress the background noise. It is in fact a content-aware band-pass filter which can automatically pass the frequency bands that are more significant in the target than in the noise. Theoretical analysis shows that the proposed filtering method is capable of enhancing the target content while suppressing the background noise for signals with a low SNR. A condition to increase the probability of correct detection is also obtained. Experiments have been carried out on a large dataset of acoustic events that are contaminated by different types of environmental noise and white noise with varying SNRs. Results show that the proposed method is more robust and better adapted to noise than ordinary energy detectors, and it can work even with an SNR as low as -15 dB. A practical system for real time processing and multi-target detection is also proposed in this work

    Automatic Environmental Sound Recognition: Performance versus Computational Cost

    In the context of the Internet of Things (IoT), sound sensing applications are required to run on embedded platforms where notions of product pricing and form factor impose hard constraints on the available computing power. Whereas Automatic Environmental Sound Recognition (AESR) algorithms are most often developed with limited consideration for computational cost, this article seeks which AESR algorithm can make the most of a limited amount of computing power by comparing the sound classification performance em as a function of its computational cost. Results suggest that Deep Neural Networks yield the best ratio of sound classification accuracy across a range of computational costs, while Gaussian Mixture Models offer a reasonable accuracy at a consistently small cost, and Support Vector Machines stand between both in terms of compromise between accuracy and computational cost

    Le projet CIRDO d'assistance aux personnes âgées isolées à domicile

    National audienceCIRDO RI est un projet qui vise à mettre au point un système d'assistance pour faciliter le maintien à domicile des personnes âgées et/ou dépendantes. Il a pour objectif de sécuriser les seniors et de favoriser leur autonomie pour soulager les familles et les aidants. L'idée principale du projet CIRDO est d'utiliser à la fois l'analyse vidéo et la reconnaissance de la parole et ceci dans le même processus. Afin de répondre aux contraintes fixées dans le projet CIRDO comme le respect de l'intimité, la préservation de l'information privée et la réactivité dans la détection de situations de détresse, les solutions proposées sont conçues pour opérer en tâches de fond, sans l'intervention d'un opérateur. Outre l'aspect technique, ce projet porte également sur l'évaluation psycho-sociale et psycho-ergonomique de ces technologies par des utilisateurs potentiels et leur entourage (famille, aidants...) au niveau de leur utilisabilité et de leur acceptabilité notamment. Des expérimentations conduites dans un milieu réaliste ont permis l'enregistrement de données audio et vidéo correspondant au cas d'usage

    Speech Recognition of Aged Voices in the AAL Context: Detection of Distress Sentences

    International audienceBy 2050, about a third of the French population will be over 65. In the context of technologies development aiming at helping aged people to live independently at home, the CIRDO project aims at implementing an ASR system into a social inclusion product designed for elderly people in order to detect distress situations. Speech recognition systems present higher word error rate when speech is uttered by elderly speakers compared to when non-aged voice is considered. Two specialized corpora in French, AD80 and ERES38, were recorded in this framework by aged people, they were used first to study the possibility of adaptation of standard ASR to aged voice. Then we looked at whether the variability of the WER between speakers could be correlated with the level of dependence. Then, we assessed the performance of distress sentence detection by a filter and we demonstrated a significant drop in performance for those with the lowest degree of autonomy

    Réseaux Évidentiels pour la fusion de données multimodales hétérogènes (application à la détection de chutes)

    Ces travaux de recherche se sont déroulés dans le cadre du développement d une application de télévigilance médicale ayant pour but de détecter des situations de détresse à travers l utilisation de plusieurs types de capteurs. La fusion multi-capteurs peut fournir des informations plus précises et fiables par rapport aux informations provenant de chaque capteur prises séparément. Par ailleurs les données issues de ces capteurs hétérogènes possèdent différents degrés d imperfection et de confiance. Parmi les techniques de fusion multi-capteurs, les méthodes crédibilistes fondées sur la théorie de Dempster-Shafer sont actuellement considérées comme les plus adaptées à la représentation et au traitement des informations imparfaites, de ce fait permettant une modélisation plus réaliste du problème. En nous appuyant sur une représentation graphique de la théorie de Dempster-Shafer appelée Réseaux Évidentiels, nous proposons une structure de fusion de données hétérogènes issues de plusieurs capteurs pour la détection de chutes afin de maximiser les performances de détection chutes et ainsi de rendre le système plus fiable. La non-stationnarité des signaux recueillis sur les capteurs du système considéré peut conduire à une dégradation des conditions expérimentales, pouvant rendre les Réseaux Évidentiels incohérents dans leurs décisions. Afin de compenser les effets résultant de la non-stationnarité des signaux provenant des capteurs, les Réseaux Évidentiels sont rendus évolutifs dans le temps, ce qui nous a conduit à introduire les Réseaux Evidentiels Dynamiques dans nos traitements et à les évaluer sur des scénarios de chute simulés correspondant à des cas d usage variésThis work took place in the development of a remote home healthcare monitoring application designed to detect distress situations through several types of sensors. The multi-sensor fusion can provide more accurate and reliable information compared to information provided by each sensor separately. Furthermore, data from multiple heterogeneous sensors present in the remote home healthcare monitoring systems have different degrees of imperfection and trust. Among the multi-sensor fusion techniques, belief methods based on Dempster-Shafer Theory are currently considered as the most appropriate for the representation and processing of imperfect information, thus allowing a more realistic modeling of the problem. Based on a graphical representation of the Dempster-Shafer called Evidential Networks, a structure of heterogeneous data fusion from multiple sensors for fall detection has been proposed in order to maximize the performance of automatic fall detection and thus make the system more reliable. Sensors non-stationary signals of the considered system may lead to degradation of the experimental conditions and make Evidential Networks inconsistent in their decisions. In order to compensate the sensors signals non-stationarity effects, the time evolution is taken into account by introducing the Dynamic Evidential Networks which was evaluated by the simulated fall scenarios corresponding to various use cases

    Reconnaissance des sons de l'environnement dans un contexte domotique

    Dans beaucoup de pays du monde, on observe une importante augmentation du nombre de personnes âgées vivant seules. Depuis quelques années, un nombre significatif de projets de recherche sur l assistance aux personnes âgées ont vu le jour. La plupart de ces projets utilisent plusieurs modalités (vidéo, son, détection de chute, etc.) pour surveiller l'activité de la personne et lui permettre de communiquer naturellement avec sa maison "intelligente", et, en cas de danger, lui venir en aide au plus vite. Ce travail a été réalisé dans le cadre du projet ANR VERSO de recherche industrielle, Sweet-Home. Les objectifs du projet sont de proposer un système domotique permettant une interaction naturelle (par commande vocale et tactile) avec la maison, et procurant plus de sécurité à l'habitant par la détection des situations de détresse. Dans ce cadre, l'objectif de ce travail est de proposer des solutions pour la reconnaissance des sons de la vie courante dans un contexte réaliste. La reconnaissance du son fonctionnera en amont d'un système de Reconnaissance Automatique de la Parole. Les performances de celui-ci dépendent donc de la fiabilité de la séparation entre la parole et les autres sons. Par ailleurs, une bonne reconnaissance de certains sons, complétée par d'autres sources informations (détection de présence, détection de chute, etc.) permettrait de bien suivre les activités de la personne et de détecter ainsi les situations de danger. Dans un premier temps, nous nous sommes intéressés aux méthodes en provenance de la Reconnaissance et Vérification du Locuteur. Dans cet esprit, nous avons testé des méthodes basées sur GMM et SVM. Nous avons, en particulier, testé le noyau SVM-GSL (SVM GMM Supervector Linear Kernel) utilisé pour la classification de séquences. SVM-GSL est une combinaison de SVM et GMM et consiste à transformer une séquence de vecteurs de longueur arbitraire en un seul vecteur de très grande taille, appelé Super Vecteur, et utilisé en entrée d'un SVM. Les expérimentations ont été menées en utilisant une base de données créée localement (18 classes de sons, plus de 1000 enregistrements), puis le corpus du projet Sweet-Home, en intégrant notre système dans un système plus complet incluant la détection multi-canaux du son et la reconnaissance de la parole. Ces premières expérimentations ont toutes été réalisées en utilisant un seul type de coefficients acoustiques, les MFCC. Par la suite, nous nous sommes penchés sur l'étude d'autres familles de coefficients en vue d'en évaluer l'utilisabilité en reconnaissance des sons de l'environnement. Notre motivation fut de trouver des représentations plus simples et/ou plus efficaces que les MFCC. En utilisant 15 familles différentes de coefficients, nous avons également expérimenté deux approches pour transformer une séquence de vecteurs en un seul vecteur, à utiliser avec un SVM linéaire. Dans le première approche, on calcule un nombre fixe de coefficients statistiques qui remplaceront toute la séquence de vecteurs. La seconde approche (une des contributions de ce travail) utilise une méthode de discrétisation pour trouver, pour chaque caractéristique d'un vecteur acoustique, les meilleurs points de découpage permettant d'associer une classe donnée à un ou plusieurs intervalles de valeurs. La probabilité de la séquence est estimée par rapport à chaque intervalle. Les probabilités obtenues ainsi sont utilisées pour construire un seul vecteur qui remplacera la séquence de vecteurs acoustiques. Les résultats obtenus montrent que certaines familles de coefficients sont effectivement plus adaptées pour reconnaître certaines classes de sons. En effet, pour la plupart des classes, les meilleurs taux de reconnaissance ont été observés avec une ou plusieurs familles de coefficients différentes des MFCC. Certaines familles sont, de surcroît, moins complexes et comptent une seule caractéristique par fenêtre d'analyse contre 16 caractéristiques pour les MFCCIn many countries around the world, the number of elderly people living alone has been increasing. In the last few years, a significant number of research projects on elderly people monitoring have been launched. Most of them make use of several modalities such as video streams, sound, fall detection and so on, in order to monitor the activities of an elderly person, to supply them with a natural way to communicate with their smart-home , and to render assistance in case of an emergency. This work is part of the Industrial Research ANR VERSO project, Sweet-Home. The goals of the project are to propose a domotic system that enables a natural interaction (using touch and voice command) between an elderly person and their house and to provide them a higher safety level through the detection of distress situations. Thus, the goal of this work is to come up with solutions for sound recognition of daily life in a realistic context. Sound recognition will run prior to an Automatic Speech Recognition system. Therefore, the speech recognition s performances rely on the reliability of the speech/non-speech separation. Furthermore, a good recognition of a few kinds of sounds, complemented by other sources of information (presence detection, fall detection, etc.) could allow for a better monitoring of the person's activities that leads to a better detection of dangerous situations. We first had been interested in methods from the Speaker Recognition and Verification field. As part of this, we have experimented methods based on GMM and SVM. We had particularly tested a Sequence Discriminant SVM kernel called SVM-GSL (SVM GMM Super Vector Linear Kernel). SVM-GSL is a combination of GMM and SVM whose basic idea is to map a sequence of vectors of an arbitrary length into one high dimensional vector called a Super Vector and used as an input of an SVM. Experiments had been carried out using a locally created sound database (containing 18 sound classes for over 1000 records), then using the Sweet-Home project's corpus. Our daily sounds recognition system was integrated into a more complete system that also performs a multi-channel sound detection and speech recognition. These first experiments had all been performed using one kind of acoustical coefficients, MFCC coefficients. Thereafter, we focused on the study of other families of acoustical coefficients. The aim of this study was to assess the usability of other acoustical coefficients for environmental sounds recognition. Our motivation was to find a few representations that are simpler and/or more effective than the MFCC coefficients. Using 15 different acoustical coefficients families, we have also experimented two approaches to map a sequence of vectors into one vector, usable with a linear SVM. The first approach consists of computing a set of a fixed number of statistical coefficients and use them instead of the whole sequence. The second one, which is one of the novel contributions of this work, makes use of a discretization method to find, for each feature within an acoustical vector, the best cut points that associates a given class with one or many intervals of values. The likelihood of the sequence is estimated for each interval. The obtained likelihood values are used to build one single vector that replaces the sequence of acoustical vectors. The obtained results show that a few families of coefficients are actually more appropriate to the recognition of some sound classes. For most sound classes, we noticed that the best recognition performances were obtained with one or many families other than MFCC. Moreover, a number of these families are less complex than MFCC. They are actually a one-feature per frame acoustical families, whereas MFCC coefficients contain 16 features per frame

    Neural Networks for Analysing Music and Environmental Audio

    PhDIn this thesis, we consider the analysis of music and environmental audio recordings with neural networks. Recently, neural networks have been shown to be an effective family of models for speech recognition, computer vision, natural language processing and a number of other statistical modelling problems. The composite layer-wise structure of neural networks allows for flexible model design, where prior knowledge about the domain of application can be used to inform the design and architecture of the neural network models. Additionally, it has been shown that when trained on sufficient quantities of data, neural networks can be directly applied to low-level features to learn mappings to high level concepts like phonemes in speech and object classes in computer vision. In this thesis we investigate whether neural network models can be usefully applied to processing music and environmental audio. With regards to music signal analysis, we investigate 2 different problems. The fi rst problem, automatic music transcription, aims to identify the score or the sequence of musical notes that comprise an audio recording. We also consider the problem of automatic chord transcription, where the aim is to identify the sequence of chords in a given audio recording. For both problems, we design neural network acoustic models which are applied to low-level time-frequency features in order to detect the presence of notes or chords. Our results demonstrate that the neural network acoustic models perform similarly to state-of-the-art acoustic models, without the need for any feature engineering. The networks are able to learn complex transformations from time-frequency features to the desired outputs, given sufficient amounts of training data. Additionally, we use recurrent neural networks to model the temporal structure of sequences of notes or chords, similar to language modelling in speech. Our results demonstrate that the combination of the acoustic and language model predictions yields improved performance over the acoustic models alone. We also observe that convolutional neural networks yield better performance compared to other neural network architectures for acoustic modelling. For the analysis of environmental audio recordings, we consider the problem of acoustic event detection. Acoustic event detection has a similar structure to automatic music and chord transcription, where the system is required to output the correct sequence of semantic labels along with onset and offset times. We compare the performance of neural network architectures against Gaussian mixture models and support vector machines. In order to account for the fact that such systems are typically deployed on embedded devices, we compare performance as a function of the computational cost of each model. We evaluate the models on 2 large datasets of real-world recordings of baby cries and smoke alarms. Our results demonstrate that the neural networks clearly outperform the other models and they are able to do so without incurring a heavy computation cost