9 research outputs found

    Reducing Model Complexity for DNN Based Large-Scale Audio Classification

    Full text link
    Audio classification is the task of identifying the sound categories that are associated with a given audio signal. This paper presents an investigation on large-scale audio classification based on the recently released AudioSet database. AudioSet comprises 2 millions of audio samples from YouTube, which are human-annotated with 527 sound category labels. Audio classification experiments with the balanced training set and the evaluation set of AudioSet are carried out by applying different types of neural network models. The classification performance and the model complexity of these models are compared and analyzed. While the CNN models show better performance than MLP and RNN, its model complexity is relatively high and undesirable for practical use. We propose two different strategies that aim at constructing low-dimensional embedding feature extractors and hence reducing the number of model parameters. It is shown that the simplified CNN model has only 1/22 model parameters of the original model, with only a slight degradation of performance.Comment: Accepted by ICASSP 201

    A Closer Look at Weak Label Learning for Audio Events

    Get PDF
    鈥擜udio content analysis in terms of sound events is an important research problem for a variety of applications. Recently, the development of weak labeling approaches for audio or sound event detection (AED) and availability of large scale weakly labeled dataset have finally opened up the possibility of large scale AED. However, a deeper understanding of how weak labels affect the learning for sound events is still missing from literature. In this work, we first describe a CNN based approach for weakly supervised training of audio events. The approach follows some basic design principle desirable in a learning method relying on weakly labeled audio. We then describe important characteristics, which naturally arise in weakly supervised learning of sound events. We show how these aspects of weak labels affect the generalization of models. More specifically, we study how characteristics such as label density and corruption of labels affects weakly supervised training for audio events. We also study the feasibility of directly obtaining weak labeled data from the web without any manual label and compare it with a dataset which has been manually labeled. The analysis and understanding of these factors should be taken into picture in the development of future weak label learning methods. Audioset, a large scale weakly labeled dataset for sound events is used in our experiments

    A comparative study of preprocessing and model compression techniques in deep learning for forest sound classification

    Get PDF
    Deep-learning models play a significant role in modern software solutions, with the capabilities of handling complex tasks, improving accuracy, automating processes, and adapting to diverse domains, eventually contributing to advancements in various industries. This study provides a comparative study on deep-learning techniques that can also be deployed on resource-constrained edge devices. As a novel contribution, we analyze the performance of seven Convolutional Neural Network models in the context of data augmentation, feature extraction, and model compression using acoustic data. The results show that the best performers can achieve an optimal trade-off between model accuracy and size when compressed with weight and filter pruning followed by 8-bit quantization. In adherence to the study workflow utilizing the forest sound dataset, MobileNet-v3-small and ACDNet achieved accuracies of 87.95% and 85.64%, respectively, while maintaining compact sizes of 243 KB and 484 KB, respectively. Henceforth, this study concludes that CNNs can be optimized and compressed to be deployed in resource-constrained edge devices for classifying forest environment sounds

    Classification of Sound Scenes and Events in Real-World Scenarios with Deep Learning Techniques

    Get PDF
    La clasificaci贸n de los eventos sonoros es un campo de la audici贸n por computador que se est谩 volviendo cada vez m谩s interesante debido al gran n煤mero de aplicaciones que podr铆an beneficiarse de esta tecnolog铆a. A diferencia de otros campos de la audici贸n por computador relacionados con la recuperaci贸n de informaci贸n musical o el reconocimiento del habla, la clasificaci贸n de eventos sonoros tiene una serie de problemas intr铆nsecos. Estos problemas son la naturaleza polif贸nica de la mayor铆a de las grabaciones de sonido ambiental, la diferencia en la naturaleza de cada sonido, la falta de estructura temporal y la adici贸n de ruido de fondo y reverberaci贸n en el proceso de grabaci贸n. Estos problemas son campos de estudio para la comunidad cient铆fica a d铆a de hoy. Sin embargo, cabe se帽alar que cuando se despliega una soluci贸n de audici贸n por computador en entornos reales, pueden surgir una serie de problemas adicionales. Estos problemas son el Reconocimiento de Conjunto Abierto (OSR), el Aprendizaje de Pocos Disparos (FSL) y la consideraci贸n del tiempo de ejecuci贸n del sistema (baja complejidad). El OSR se define como el problema que aparece cuando un sistema de inteligencia artificial tiene que enfrentarse a una situaci贸n desconocida en la que clases no vistas durante la etapa de entrenamiento est谩n presentes en una etapa de inferencia. El FSL corresponde al problema que se produce cuando hay muy pocas muestras disponibles para cada clase considerada. Por 煤ltimo, dado que estos sistemas se despliegan normalmente en dispositivos de borde, hay que tener en cuenta el tiempo de ejecuci贸n, ya que cuanto menos tiempo tarde el sistema en dar una respuesta, mejor ser谩 la experiencia percibida por los usuarios. Las soluciones basadas en las t茅cnicas de aprendizaje en profundidad para problemas similares en el dominio de la imagen han mostrado resultados prometedores. Las soluciones m谩s difundidas son las que implementan Redes Neuronales Convolucionales (CNN). Por lo tanto, muchos sistemas de audio de 煤ltima generaci贸n proponen convertir las se帽ales de audio en una representaci贸n bidimensional que puede ser tratada como una imagen. La generaci贸n de mapas internos se realiza a menudo por las capas convolucionales de las CNN. Sin embargo, estas capas tienen una serie de limitaciones que deben ser estudiadas para poder proponer t茅cnicas para mejorar los mapas de caracter铆sticas resultantes. Con este fin, se han propuesto novedosas redes que fusionan dos m茅todos diferentes, como el aprendizaje residual y las t茅cnicas de excitaci贸n y compresi贸n. Los resultados muestran una mejora de la precisi贸n del sistema con la adici贸n de un n煤mero reducido de par谩metros adicionales. Por otra parte, estas soluciones basadas en entradas bidimensionales pueden mostrar un cierto sesgo, ya que la elecci贸n de la representaci贸n de audio puede ser espec铆fica para una tarea concreta. Por lo tanto, se ha realizado un estudio comparativo de diferentes redes residuales alimentadas directamente por la se帽al de audio en bruto. Estas soluciones se conocen como de extremo a extremo. Si bien se han realizado estudios similares en la literatura en el dominio de la imagen, los resultados sugieren que los bloques residuales de mejor rendimiento para las tareas de visi贸n artificial pueden no ser los mismos que los de la clasificaci贸n de audio. En cuanto a los problemas de FSL y OSR, se propone un marco basado en un autoencoder capaz de mitigar ambos problemas juntos. Esta soluci贸n es capaz de crear representaciones robustas de estos patrones de audio a partir de s贸lo unas pocas muestras, al tiempo que es capaz de rechazar las clases de audio no deseadas.The classification of sound events is a field of machine listening that is becoming increasingly interesting due to the large number of applications that could benefit from this technology. Unlike other fields of machine listening related to music information retrieval or speech recognition, sound event classification has a number of intrinsic problems. These problems are the polyphonic nature of most environmental sound recordings, the difference in the nature of each sound, the lack of temporal structure and the addition of background noise and reverberation in the recording process. These problems are fields of study for the scientific community today. However, it should be noted that when a machine listening solution is deployed in real environments, a number of extra problems may arise. These problems are Open-Set Recognition (OSR), Few-Shot Learning (FSL) and consideration of system runtime (low-complexity). OSR is defined as the problem that appears when an artificial intelligence system has to face an unknown situation where classes unseen during the training stage are present at a usage stage. FSL corresponds to the problem that occurs when there are very few samples available for each considered class. Finally, since these systems are normally deployed in edge devices, the consideration of the execution time must be taken into account, as the less time the system takes to give a response, the better the experience perceived by the users. Solutions based on Deep Learning techniques for similar problems in the image domain have shown promising results. The most widespread solutions are those that implement Convolutional Neural Networks (CNNs). Therefore, many state-of-the-art audio systems propose to convert audio signals into a two-dimensional representation that can be treated as an image. The generation of internal maps is often done by the convolutional layers of the CNNs. However, these layers have a series of limitations that must be studied in order to be able to propose techniques for improving the resulting feature maps. To this end, novel networks have been proposed that merge two different methods such as residual learning and squeeze-excitation techniques. The results show an improvement in the accuracy of the system with the addition of few number of extra parameters. On the other hand, these solutions based on two-dimensional inputs can show a certain bias since the choice of audio representation can be specific to a particular task. Therefore, a comparative study of different residual networks directly fed by the raw audio signal has been carried out. These solutions are known as end-to-end. While similar studies have been carried out in the literature in the image domain, the results suggest that the best performing residual blocks for computer vision tasks may not be the same as those for audio classification. Regarding the FSL and OSR problems, an autoencoder-based framework capable of mitigating both problems together is proposed. This solution is capable of creating robust representations of these audio patterns from just a few samples while being able to reject unwanted audio classes
    corecore