51 research outputs found

    Sound Event Localization, Detection, and Tracking by Deep Neural Networks

    Get PDF
    In this thesis, we present novel sound representations and classification methods for the task of sound event localization, detection, and tracking (SELDT). The human auditory system has evolved to localize multiple sound events, recognize and further track their motion individually in an acoustic environment. This ability of humans makes them context-aware and enables them to interact with their surroundings naturally. Developing similar methods for machines will provide an automatic description of social and human activities around them and enable machines to be context-aware similar to humans. Such methods can be employed to assist the hearing impaired to visualize sounds, for robot navigation, and to monitor biodiversity, the home, and cities. A real-life acoustic scene is complex in nature, with multiple sound events that are temporally and spatially overlapping, including stationary and moving events with varying angular velocities. Additionally, each individual sound event class, for example, a car horn can have a lot of variabilities, i.e., different cars have different horns, and within the same model of the car, the duration and the temporal structure of the horn sound is driver dependent. Performing SELDT in such overlapping and dynamic sound scenes while being robust is challenging for machines. Hence we propose to investigate the SELDT task in this thesis and use a data-driven approach using deep neural networks (DNNs). The sound event detection (SED) task requires the detection of onset and offset time for individual sound events and their corresponding labels. In this regard, we propose to use spatial and perceptual features extracted from multichannel audio for SED using two different DNNs, recurrent neural networks (RNNs) and convolutional recurrent neural networks (CRNNs). We show that using multichannel audio features improves the SED performance for overlapping sound events in comparison to traditional single-channel audio features. The proposed novel features and methods produced state-of-the-art performance for the real-life SED task and won the IEEE AASP DCASE challenge consecutively in 2016 and 2017. Sound event localization is the task of spatially locating the position of individual sound events. Traditionally, this has been approached using parametric methods. In this thesis, we propose a CRNN for detecting the azimuth and elevation angles of multiple temporally overlapping sound events. This is the first DNN-based method performing localization in complete azimuth and elevation space. In comparison to parametric methods which require the information of the number of active sources, the proposed method learns this information directly from the input data and estimates their respective spatial locations. Further, the proposed CRNN is shown to be more robust than parametric methods in reverberant scenarios. Finally, the detection and localization tasks are performed jointly using a CRNN. This method additionally tracks the spatial location with time, thus producing the SELDT results. This is the first DNN-based SELDT method and is shown to perform equally with stand-alone baselines for SED, localization, and tracking. The proposed SELDT method is evaluated on nine datasets that represent anechoic and reverberant sound scenes, stationary and moving sources with varying velocities, a different number of overlapping sound events and different microphone array formats. The results show that the SELDT method can track multiple overlapping sound events that are both spatially stationary and moving

    Multimedia Context Awareness for Smart Mobile Environments

    Get PDF
    openNowadays the development of the IoT framework and the resulting huge number of smart connected devices opens the door to exploit the presence of multiple smart nodes to accomplish a variety of tasks. Multimedia context awareness, together with the concept of ambient intelligence, is tightly related to the IoT framework, and it can be applied to a large number of smart scenarios. In this thesis, the aim is to study and analyze the role of context awareness in different applications related to smart mobile environments, such as future smart spaces and connected cities. Indeed, this research work focuses on different aspects of ambient intelligence, such as audio-awareness and wireless-awareness. In particular, this thesis tackles two main research topics: the first one, related to the framework of audio-awareness, concerns a multiple observations approach for smart speaker recognition in mobile environments; the second one, tied to the concept of wireless-awareness, regards Unmanned Aerial Vehicle (UAV) detection based on WiFi statistical fingerprint analysis.openXXXI CICLO - SC. E TECN. ING. ELETTR. E DELLE TEL. - Ambienti cognitivi interattiviGaribotto, Chiar

    Enhancing speaker verification accuracy with deep ensemble learning and inclusion of multifaceted demographic factors

    Get PDF
    Effective speaker identification is essential for achieving robust speaker recognition in real-world applications such as mobile devices, security, and entertainment while ensuring high accuracy. However, deep learning models trained on large datasets with diverse demographic and environmental factors may lead to increased misclassification and longer processing times. This study proposes incorporating ethnicity and gender information as critical parameters in a deep learning model to enhance accuracy. Two convolutional neural network (CNN) models classify gender and ethnicity, followed by a Siamese deep learning model trained with critical parameters and additional features for speaker verification. The proposed model was tested using the VoxCeleb 2 database, which includes over one million utterances from 6,112 celebrities. In an evaluation after 500 epochs, equal error rate (EER) and minimum decision cost function (minDCF) showed notable results, scoring 1.68 and 0.10, respectively. The proposed model outperforms existing deep learning models, demonstrating improved performance in terms of reduced misclassification errors and faster processing times

    Métodos discriminativos para la optimización de modelos en la Verificación del Hablante

    Get PDF
    La creciente necesidad de sistemas de autenticación seguros ha motivado el interés de algoritmos efectivos de Verificación de Hablante (VH). Dicha necesidad de algoritmos de alto rendimiento, capaces de obtener tasas de error bajas, ha abierto varias ramas de investigación. En este trabajo proponemos investigar, desde un punto de vista discriminativo, un conjunto de metodologías para mejorar el desempeño del estado del arte de los sistemas de VH. En un primer enfoque investigamos la optimización de los hiper-parámetros para explícitamente considerar el compromiso entre los errores de falsa aceptación y falso rechazo. El objetivo de la optimización se puede lograr maximizando el área bajo la curva conocida como ROC (Receiver Operating Characteristic) por sus siglas en inglés. Creemos que esta optimización de los parámetros no debe de estar limitada solo a un punto de operación y una estrategia más robusta es optimizar los parámetros para incrementar el área bajo la curva, AUC (Area Under the Curve por sus siglas en inglés) de modo que todos los puntos sean maximizados. Estudiaremos cómo optimizar los parámetros utilizando la representación matemática del área bajo la curva ROC basada en la estadística de Wilcoxon Mann Whitney (WMW) y el cálculo adecuado empleando el algoritmo de descendente probabilístico generalizado. Además, analizamos el efecto y mejoras en métricas como la curva detection error tradeoff (DET), el error conocido como Equal Error Rate (EER) y el valor mínimo de la función de detección de costo, minimum value of the detection cost function (minDCF) todos ellos por sue siglas en inglés. En un segundo enfoque, investigamos la señal de voz como una combinación de atributos que contienen información del hablante, del canal y el ruido. Los sistemas de verificación convencionales entrenan modelos únicos genéricos para todos los casos, y manejan las variaciones de estos atributos ya sea usando análisis de factores o no considerando esas variaciones de manera explícita. Proponemos una nueva metodología para particionar el espacio de los datos de acuerdo a estas carcterísticas y entrenar modelos por separado para cada partición. Las particiones se pueden obtener de acuerdo a cada atributo. En esta investigación mostraremos como entrenar efectivamente los modelos de manera discriminativa para maximizar la separación entre ellos. Además, el diseño de algoritimos robustos a las condiciones de ruido juegan un papel clave que permite a los sistemas de VH operar en condiciones reales. Proponemos extender nuestras metodologías para mitigar los efectos del ruido en esas condiciones. Para nuestro primer enfoque, en una situación donde el ruido se encuentre presente, el punto de operación puede no ser solo un punto, o puede existir un corrimiento de forma impredecible. Mostraremos como nuestra metodología de maximización del área bajo la curva ROC es más robusta que la usada por clasificadores convencionales incluso cuando el ruido no está explícitamente considerado. Además, podemos encontrar ruido a diferentes relación señal a ruido (SNR) que puede degradar el desempeño del sistema. Así, es factible considerar una descomposición eficiente de las señales de voz que tome en cuenta los diferentes atributos como son SNR, el ruido y el tipo de canal. Consideramos que en lugar de abordar el problema con un modelo unificado, una descomposición en particiones del espacio de características basado en atributos especiales puede proporcionar mejores resultados. Esos atributos pueden representar diferentes canales y condiciones de ruido. Hemos analizado el potencial de estas metodologías que permiten mejorar el desempeño del estado del arte de los sistemas reduciendo el error, y por otra parte controlar los puntos de operación y mitigar los efectos del ruido

    Signal processing techniques for robust sound event recognition

    Get PDF
    The computational analysis of acoustic scenes is today a topic of major interest, with a growing community focused on designing machines capable of identifying and understanding the sounds produced in our environment, similar to how humans perform this task. Although these domains have not reached the industrial popularity of other related audio domains, such as speech recognition or music analysis, applications designed to identify the occurrence of sounds in a given scenario are rapidly increasing. These applications are usually limited to a set of sound classes, which must be defined beforehand. In order to train sound classification models, representative sets of sound events are recorded and used as training data. However, the acoustic conditions present during the collection of training examples may not coincide with the conditions during application testing. Background noise, overlapping sound events or weakly segmented data, among others, may substantially affect audio data, lowering the actual performance of the learned models. To avoid such situations, machine learning systems have to be designed with the ability to generalize to data collected under conditions different from the ones seen during training. Traditionally, the techniques used to carry out tasks related to the computational understanding of sound events have been inspired by similar domains such as music or speech, so the features selected to represent acoustic events come from those specific domains. Most of the contributions of this thesis are based on how such features are suitably applied for sound event recognition, proposing specific methods to adapt the features extracted both within classical recognition approaches and modern end-to-end convolutional neural networks. The objective of this thesis is therefore to develop novel signal processing techniques aimed at increasing the robustness of the features representing acoustic events to adverse conditions affecting the mismatch between the training and test conditions in model learning. To achieve such objective, we start first by analyzing the importance of classical feature sets such as Mel-frequency cepstral coefficients (MFCCs) or the energies extracted from log-mel filterbanks, analyzing as well the impact of noise, reverberveration or segmentation errors in diverse scenarios. We show that the performance of both classical and deep learning-based approaches is severely affected by these factors and we propose novel signal processing techniques designed to improve their robustness by means of the non-linear transformation of feature vectors along the temporal axis. Such transformation is based on the so called event trace, which can be interpreted as an indicator of the temporal activity of the event within the feature space. Finally, we propose the use of the energy envelope as a target for event detection, which implies the change from a classification-based approach to a regression-oriented one

    Fisher Motion Descriptor for Multiview Gait Recognition

    Get PDF
    The goal of this paper is to identify individuals by analyzing their gait. Instead of using binary silhouettes as input data (as done in many previous works) we propose and evaluate the use of motion descriptors based on densely sampled short-term trajectories. We take advantage of state-of-the-art people detectors to de ne custom spatial con gurations of the descriptors around the target person, obtaining a rich representation of the gait motion. The local motion features (described by the Divergence-Curl-Shear descriptor [1]) extracted on the di erent spatial areas of the person are combined into a single high-level gait descriptor by using the Fisher Vector encoding [2]. The proposed approach, coined Pyramidal Fisher Motion, is experimentally validated on `CASIA' dataset [3] (parts B and C), `TUM GAID' dataset [4], `CMU MoBo' dataset [5] and the recent `AVA Multiview Gait' dataset [6]. The results show that this new approach achieves state-of-the-art results in the problem of gait recognition, allowing to recognize walking people from diverse viewpoints on single and multiple camera setups, wearing di erent clothes, carrying bags, walking at diverse speeds and not limited to straight walking paths
    corecore