5 research outputs found

    Detección de voz y música en un corpus a gran escala de eventos de audio

    Full text link
    El reconocimiento de eventos acústicos es la capacidad para extraer información de un suceso a partir de los sonidos producidos por el mismo. El desarrollo de esta habilidad, básica en el sistema auditivo humano, en la inteligencia computacional es un problema que se está abordando mediante la investigación en modelos de aprendizaje automático (machine learning) como las redes neuronales. En este Trabajo Fin de Máster se estudia la aplicación de diferentes arquitecturas de redes neuronales a la detección de eventos de voz y de música sobre un conjunto de 77.396 segmentos de audio de 10 segundos (216 horas) obtenidos de la base de datos Google AudioSet. Estos segmentos pertenecen a fragmentos de vídeos de la plataforma YouTube. Se proponen y comparan dos enfoques distintos para el problema: uno de ellos es el entrenamiento de dos redes neuronales separadas, una para detección de presencia de voz y otra para detección de presencia de música, y el otro consiste en el entrenamiento de una red neuronal conjunta que se enfrente simultáneamente a ambas tareas. Entre las arquitecturas estudiadas se encuentran las redes fully-connected, las redes convolucionales y las redes LSTM (Long Short-Term Memory). A lo largo del trabajo se describe la organización de la base de datos, la construcción de los conjuntos de datos empleados y el diseño de los modelos propuestos. Además, se proveen resultados comparativos de las distintas configuraciones evaluadas, tanto en rendimiento como en complejidad de los modelos.Acoustic event recognition is the ability to extract information about ocurrences from the sounds produced by them. Such is a basic skill for the human auditory system, but its development in computational intelligence is currently an active research field involving machine learning models such as neural networks. This Master’s Thesis aims to study the implementation of several neural network architectures to speech and music event recognition over a collection of 77,936 ten seconds audio segments (216 hours), obtained from the Google AudioSet dataset. These segments belong to YouTube videos. Two different approaches are proposed and compared: one of them is training two separated neural networks for speech event detection and music event detection, while the other one consists on training a joint neural network to tackle both tasks at the same time. Among the studied architectures, fully-connected networks, convolutional networks and LSTM (Long Short- Term Memory) are included. Along this report, the organization of the dataset, the selection of the used segments and the design of the proposed models are described. Additionally, comparative results of the evaluated settings are provided in terms of performance and model complexity

    Detección de música en contenidos multimedia mediante ritmo y armonía

    Full text link
    Este Trabajo Fin de Grado se encuadra en el ámbito de la segmentación de audio, más concretamente en la detección de contenidos musicales en señales de audio. Al ser la musicalidad una propiedad de las señales de audio más subjetiva que, por ejemplo, la presencia de voz hablada, es necesario establecer qué propiedades objetivas de la señal de audio influirán en la decisión sobre presencia o ausencia de música. Para el desarrollo de este trabajo, se toman como referencias de la musicalidad de un audio la presencia de un pulso rítmico en su evolución temporal y la aparición de armonía o cromaticidad en sus componentes frecuenciales. El sistema desarrollado en este TFG toma las decisiones de detección de música a partir de los dos componentes citados: ritmo y armonía. Para ello, cuenta con dos detectores dedicados a cada uno de los componentes, que pueden funcionar conjuntamente para detectar la presencia de contenidos musicales, pero también por separado para segmentar la señal en función del ritmo o de la armonía. El detector de ritmo se basa, principalmente, en la periodicidad de la energía localizada de la señal de audio. Esta periodicidad puede cuantificarse mediante la construcción y el análisis de matrices de autocorrelación, que contienen la evolución de la función autocorrelación a lo largo de la duración de la señal de audio. Por otra parte, el detector de armonía o cromaticidad parte del cálculo del cromagrama, una representación espectral basada en la Transformada de Fourier de Tiempo Corto, o ShortTime Fourier Transform (STFT). La peculiaridad de un cromagrama frente a un espectrograma es que acumula las componentes frecuenciales pertenecientes a una misma nota musical, permitiendo observar si existe una distribución del espectro que favorezca ciertas notas, denotando la presencia de armonía. En el desarrollo de este trabajo también se incluyen pruebas de rendimiento sobre la base de datos ATVS-Radio, que contiene 25 horas de audio etiquetado según la presencia de música y de vozThis Bachelor Thesis is framed within the area of audio segmentation, as it is focused in the detection of musical contents in audio signals. Musicality is a more subjective property of audio signals than, for example, speech activity, so it is necessary to define which objective properties of the signal will be relevant when assessing whether there is music present in it or not. The features evaluated as traces of musicality are the presence of a rhythmic beat and the harmony (chromaticity) found in the frequency spectrum of the signal. Our system bases its decisions in both components: rhythm and harmony. For that purpose, it uses two specific detectors, each one focused in one of the components. These detectors can work together to detect different kinds of musical contents, but they can also run separately to segment the audio based only on rhythm or harmony. The rhythm detector mainly evaluates the periodicity found in the local energy of the audio signal. This periodicity can be quantified building and analyzing the autocorrelation matrix of the audio. These matrices show the evolution of the autocorrelation function along the audio signal. On its side, the harmony detector starts from the chromagram matrix of the audio, a spectral representation derived from the Short-Time Fourier Transform (STFT). Unlike the spectrogram representation, a chromagram clusters the spectral components belonging to the same musical note, showing if the spectral distribution stimulates certain notes more than others, suggesting the existence of harmony. The realization of this Bachelor Thesis also includes some performance tests using the ATVS-Radio database, which contains 25 hours of audio with music and speech activity tags

    An analysis of sound event detection under acoustic degradation using multi-resolution systems

    Full text link
    The Sound Event Detection task aims to determine the temporal locations of acoustic events in audio clips. In recent years, the relevance of this field is rising due to the introduction of datasets such as Google AudioSet or DESED (Domestic Environment Sound Event Detection) and competitive evaluations like the DCASE Challenge (Detection and Classification of Acoustic Scenes and Events). In this paper, we analyze the performance of Sound Event Detection systems under diverse artificial acoustic conditions such as high-or low-pass filtering and clipping or dynamic range compression, as well as under an scenario of high overlap between events. For this purpose, the audio was obtained from the Evaluation subset of the DESED dataset, whereas the systems were trained in the context of the DCASE Challenge 2020 Task 4. Our systems are based upon the challenge baseline, which consists of a Convolutional-Recurrent Neural Network trained using the Mean Teacher method, and they employ a multiresolution approach which is able to improve the Sound Event Detection performance through the use of several resolutions during the extraction of Mel-spectrogram features. We provide insights on the benefits of this multiresolution approach in different acoustic settings, and compare the performance of the single-resolution systems in the aforementioned scenarios when using different resolutions. Furthermore, we complement the analysis of the performance in the high-overlap scenario by assessing the degree of overlap of each event category in sound event detection datasetsThis research and the APC were supported by project DSForSec (grant number RTI2018- 098091-B-I00) funded by the Ministry of Science, Innovation and Universities of Spain and the European Regional Development Fund (ERDF

    Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset

    Full text link
    Audio signals represent a wide diversity of acoustic events, from background environmental noise to spoken communication. Machine learning models such as neural networks have already been proposed for audio signal modeling, where recurrent structures can take advantage of temporal dependencies. This work aims to study the implementation of several neural network-based systems for speech and music event detection over a collection of 77,937 10-second audio segments (216 h), selected from the Google AudioSet dataset. These segments belong to YouTube videos and have been represented as mel-spectrograms. We propose and compare two approaches. The first one is the training of two different neural networks, one for speech detection and another for music detection. The second approach consists on training a single neural network to tackle both tasks at the same time. The studied architectures include fully connected, convolutional and LSTM (long short-term memory) recurrent networks. Comparative results are provided in terms of classification performance and model complexity. We would like to highlight the performance of convolutional architectures, specially in combination with an LSTM stage. The hybrid convolutional-LSTM models achieve the best overall results (85% accuracy) in the three proposed tasks. Furthermore, a distractor analysis of the results has been carried out in order to identify which events in the ontology are the most harmful for the performance of the models, showing some difficult scenarios for the detection of music and speechThis work has been supported by project “DSSL: Redes Profundas y Modelos de Subespacios para Deteccion y Seguimiento de Locutor, Idioma y Enfermedades Degenerativas a partir de la Voz” (TEC2015-68172-C2-1-P), funded by the Ministry of Economy and Competitivity of Spain and FEDE

    An Analysis of Sound Event Detection under Acoustic Degradation Using Multi-Resolution Systems

    No full text
    The Sound Event Detection task aims to determine the temporal locations of acoustic events in audio clips. In recent years, the relevance of this field is rising due to the introduction of datasets such as Google AudioSet or DESED (Domestic Environment Sound Event Detection) and competitive evaluations like the DCASE Challenge (Detection and Classification of Acoustic Scenes and Events). In this paper, we analyze the performance of Sound Event Detection systems under diverse artificial acoustic conditions such as high- or low-pass filtering and clipping or dynamic range compression, as well as under an scenario of high overlap between events. For this purpose, the audio was obtained from the Evaluation subset of the DESED dataset, whereas the systems were trained in the context of the DCASE Challenge 2020 Task 4. Our systems are based upon the challenge baseline, which consists of a Convolutional-Recurrent Neural Network trained using the Mean Teacher method, and they employ a multiresolution approach which is able to improve the Sound Event Detection performance through the use of several resolutions during the extraction of Mel-spectrogram features. We provide insights on the benefits of this multiresolution approach in different acoustic settings, and compare the performance of the single-resolution systems in the aforementioned scenarios when using different resolutions. Furthermore, we complement the analysis of the performance in the high-overlap scenario by assessing the degree of overlap of each event category in sound event detection datasets
    corecore