3 research outputs found

    A Hybrid Approach with Multi-channel I-Vectors and Convolutional Neural Networks for Acoustic Scene Classification

    Full text link
    In Acoustic Scene Classification (ASC) two major approaches have been followed . While one utilizes engineered features such as mel-frequency-cepstral-coefficients (MFCCs), the other uses learned features that are the outcome of an optimization algorithm. I-vectors are the result of a modeling technique that usually takes engineered features as input. It has been shown that standard MFCCs extracted from monaural audio signals lead to i-vectors that exhibit poor performance, especially on indoor acoustic scenes. At the same time, Convolutional Neural Networks (CNNs) are well known for their ability to learn features by optimizing their filters. They have been applied on ASC and have shown promising results. In this paper, we first propose a novel multi-channel i-vector extraction and scoring scheme for ASC, improving their performance on indoor and outdoor scenes. Second, we propose a CNN architecture that achieves promising ASC results. Further, we show that i-vectors and CNNs capture complementary information from acoustic scenes. Finally, we propose a hybrid system for ASC using multi-channel i-vectors and CNNs by utilizing a score fusion technique. Using our method, we participated in the ASC task of the DCASE-2016 challenge. Our hybrid approach achieved 1 st rank among 49 submissions, substantially improving the previous state of the art

    Music Artist Classification with WaveNet Classifier for Raw Waveform Audio Data

    Full text link
    Models for music artist classification usually were operated in the frequency domain, in which the input audio samples are processed by the spectral transformation. The WaveNet architecture, originally designed for speech and music generation. In this paper, we propose an end-to-end architecture in the time domain for this task. A WaveNet classifier was introduced which directly models the features from a raw audio waveform. The WaveNet takes the waveform as the input and several downsampling layers are subsequent to discriminate which artist the input belongs to. In addition, the proposed method is applied to singer identification. The model achieving the best performance obtains an average F1 score of 0.854 on benchmark dataset of Artist20, which is a significant improvement over the related works. In order to show the effectiveness of feature learning of the proposed method, the bottleneck layer of the model is visualized.Comment: 12 page

    Signal processing techniques for robust sound event recognition

    Get PDF
    The computational analysis of acoustic scenes is today a topic of major interest, with a growing community focused on designing machines capable of identifying and understanding the sounds produced in our environment, similar to how humans perform this task. Although these domains have not reached the industrial popularity of other related audio domains, such as speech recognition or music analysis, applications designed to identify the occurrence of sounds in a given scenario are rapidly increasing. These applications are usually limited to a set of sound classes, which must be defined beforehand. In order to train sound classification models, representative sets of sound events are recorded and used as training data. However, the acoustic conditions present during the collection of training examples may not coincide with the conditions during application testing. Background noise, overlapping sound events or weakly segmented data, among others, may substantially affect audio data, lowering the actual performance of the learned models. To avoid such situations, machine learning systems have to be designed with the ability to generalize to data collected under conditions different from the ones seen during training. Traditionally, the techniques used to carry out tasks related to the computational understanding of sound events have been inspired by similar domains such as music or speech, so the features selected to represent acoustic events come from those specific domains. Most of the contributions of this thesis are based on how such features are suitably applied for sound event recognition, proposing specific methods to adapt the features extracted both within classical recognition approaches and modern end-to-end convolutional neural networks. The objective of this thesis is therefore to develop novel signal processing techniques aimed at increasing the robustness of the features representing acoustic events to adverse conditions affecting the mismatch between the training and test conditions in model learning. To achieve such objective, we start first by analyzing the importance of classical feature sets such as Mel-frequency cepstral coefficients (MFCCs) or the energies extracted from log-mel filterbanks, analyzing as well the impact of noise, reverberveration or segmentation errors in diverse scenarios. We show that the performance of both classical and deep learning-based approaches is severely affected by these factors and we propose novel signal processing techniques designed to improve their robustness by means of the non-linear transformation of feature vectors along the temporal axis. Such transformation is based on the so called event trace, which can be interpreted as an indicator of the temporal activity of the event within the feature space. Finally, we propose the use of the energy envelope as a target for event detection, which implies the change from a classification-based approach to a regression-oriented one
    corecore