94 research outputs found

    Acoustic Scene Classification by Implicitly Identifying Distinct Sound Events

    Full text link
    In this paper, we propose a new strategy for acoustic scene classification (ASC) , namely recognizing acoustic scenes through identifying distinct sound events. This differs from existing strategies, which focus on characterizing global acoustical distributions of audio or the temporal evolution of short-term audio features, without analysis down to the level of sound events. To identify distinct sound events for each scene, we formulate ASC in a multi-instance learning (MIL) framework, where each audio recording is mapped into a bag-of-instances representation. Here, instances can be seen as high-level representations for sound events inside a scene. We also propose a MIL neural networks model, which implicitly identifies distinct instances (i.e., sound events). Furthermore, we propose two specially designed modules that model the multi-temporal scale and multi-modal natures of the sound events respectively. The experiments were conducted on the official development set of the DCASE2018 Task1 Subtask B, and our best-performing model improves over the official baseline by 9.4% (68.3% vs 58.9%) in terms of classification accuracy. This study indicates that recognizing acoustic scenes by identifying distinct sound events is effective and paves the way for future studies that combine this strategy with previous ones.Comment: code URL typo, code is available at https://github.com/hackerekcah/distinct-events-asc.gi

    Deep learning techniques for computer audition

    Get PDF
    Automatically recognising audio signals plays a crucial role in the development of intelligent computer audition systems. Particularly, audio signal classification, which aims to predict a label for an audio wave, has promoted many real-life applications. Amounts of efforts have been made to develop effective audio signal classification systems in the real world. However, several challenges in deep learning techniques for audio signal classification remain to be addressed. For instance, training a deep neural network (DNN) from scratch is time-consuming to extracting high-level deep representations. Furthermore, DNNs have not been well explained to construct the trust between humans and machines, and facilitate developing realistic intelligent systems. Moreover, most DNNs are vulnerable to adversarial attacks, resulting in many misclassifications. To deal with these challenges, this thesis proposes and presents a set of deep-learning-based approaches for audio signal classification. In particular, to tackle the challenge of extracting high-level deep representations, the transfer learning frameworks, benefiting from pre-trained models on large-scale image datasets, are introduced to produce effective deep spectrum representations. Furthermore, the attention mechanisms at both the frame level and the time-frequency level are proposed to explain the DNNs by respectively estimating the contributions of each frame and each time-frequency bin to the predictions. Likewise, the convolutional neural networks (CNNs) with an attention mechanism at the time-frequency level is extended to atrous CNNs with attention, aiming to explain the CNNs by visualising high-resolution attention tensors. Additionally, to interpret the CNNs evaluated on multi-device datasets, the atrous CNNs with attention are trained in the conditional training frameworks. Moreover, to improve the robustness of the DNNs against adversarial attacks, models are trained in the adversarial training frameworks. Besides, the transferability of adversarial attacks is enhanced by a lifelong learning framework. Finally, the experiments conducted with various datasets demonstrate that these presented approaches are effective to address the challenges

    Automatic User Preferences Selection of Smart Hearing Aid Using BioAid

    Get PDF
    Noisy environments, changes and variations in the volume of speech, and non-face-to-face conversations impair the user experience with hearing aids. Generally, a hearing aid amplifies sounds so that a hearing-impaired person can listen, converse, and actively engage in daily activities. Presently, there are some sophisticated hearing aid algorithms available that operate on numerous frequency bands to not only amplify but also provide tuning and noise filtering to minimize background distractions. One of those is the BioAid assistive hearing system, which is an open-source, freely available downloadable app with twenty-four tuning settings. Critically, with this device, a person suffering with hearing loss must manually alter the settings/tuning of their hearing device when their surroundings and scene changes in order to attain a comfortable level of hearing. However, this manual switching among multiple tuning settings is inconvenient and cumbersome since the user is forced to switch to the state that best matches the scene every time the auditory environment changes. The goal of this study is to eliminate this manual switching and automate the BioAid with a scene classification algorithm so that the system automatically identifies the user-selected preferences based on adequate training. The aim of acoustic scene classification is to recognize the audio signature of one of the predefined scene classes that best represent the environment in which it was recorded. BioAid, an open-source biological inspired hearing aid algorithm, is used after conversion to Python. The proposed method consists of two main parts: classification of auditory scenes and selection of hearing aid tuning settings based on user experiences. The DCASE2017 dataset is utilized for scene classification. Among the many classifiers that were trained and tested, random forests have the highest accuracy of 99.7%. In the second part, clean speech audios from the LJ speech dataset are combined with scenes, and the user is asked to listen to the resulting audios and adjust the presets and subsets. A CSV file stores the selection of presets and subsets at which the user can hear clearly against the scenes. Various classifiers are trained on the dataset of user preferences. After training, clean speech audio was convolved with the scene and fed as input to the scene classifier that predicts the scene. The predicted scene was then fed as input to the preset classifier that predicts the user’s choice for preset and subset. The BioAid is automatically tuned to the predicted selection. The accuracy of random forest in the prediction of presets and subsets was 100%. This proposed approach has great potential to eliminate the tedious manual switching of hearing assistive device parameters by allowing hearing-impaired individuals to actively participate in daily life by automatically adjusting hearing aid settings based on the acoustic scen

    Attention-based atrous convolutional neural networks: visualisation and understanding perspectives of acoustic scenes

    Get PDF
    The goal of Acoustic Scene Classification (ASC) is to recognise the environment in which an audio waveform has been recorded. Recently, deep neural networks have been applied to ASC and have achieved state-of-the-art performance. However, few works have investigated how to visualise and understand what a neural network has learnt from acoustic scenes. Previous work applied local pooling after each convolutional layer, therefore reduced the size of the feature maps. In this paper, we suggest that local pooling is not necessary, but the size of the receptive field is important. We apply atrous Convolutional Neural Networks (CNNs) with global attention pooling as the classification model. The internal feature maps of the attention model can be visualised and explained. On the Detection and Classification of Acoustic Scenes and Events (DCASE) 2018 dataset, our proposed method achieves an accuracy of 72.7 %, significantly outperforming the CNNs without dilation at 60.4 %. Furthermore, our results demonstrate that the learnt feature maps contain rich information on acoustic scenes in the time-frequency domain

    Binaural Acoustic Scene Classification Using Wavelet Scattering, Parallel Ensemble Classifiers and Nonlinear Fusion

    Get PDF
    The analysis of ambient sounds can be very useful when developing sound base intelligent systems. Acoustic scene classification (ASC) is defined as identifying the area of a recorded sound or clip among some predefined scenes. ASC has huge potential to be used in urban sound event classification systems. This research presents a hybrid method that includes a novel mathematical fusion step which aims to tackle the challenges of ASC accuracy and adaptability of current state-of-the-art models. The proposed method uses a stereo signal, two ensemble classifiers (random subspace), and a novel mathematical fusion step. In the proposed method, a stable, invariant signal representation of the stereo signal is built using Wavelet Scattering Transform (WST). For each mono, i.e., left and right, channel, a different random subspace classifier is trained using WST. A novel mathematical formula for fusion step was developed, its parameters being found using a Genetic algorithm. The results on the DCASE 2017 dataset showed that the proposed method has higher classification accuracy (about 95%), pushing the boundaries of existing methods
    corecore