67 research outputs found

    Identifiability of multivariate logistic mixture models

    Full text link
    Mixture models have been widely used in modeling of continuous observations. For the possibility to estimate the parameters of a mixture model consistently on the basis of observations from the mixture, identifiability is a necessary condition. In this study, we give some results on the identifiability of multivariate logistic mixture models

    A Compact and Discriminative Feature Based on Auditory Summary Statistics for Acoustic Scene Classification

    Full text link
    One of the biggest challenges of acoustic scene classification (ASC) is to find proper features to better represent and characterize environmental sounds. Environmental sounds generally involve more sound sources while exhibiting less structure in temporal spectral representations. However, the background of an acoustic scene exhibits temporal homogeneity in acoustic properties, suggesting it could be characterized by distribution statistics rather than temporal details. In this work, we investigated using auditory summary statistics as the feature for ASC tasks. The inspiration comes from a recent neuroscience study, which shows the human auditory system tends to perceive sound textures through time-averaged statistics. Based on these statistics, we further proposed to use linear discriminant analysis to eliminate redundancies among these statistics while keeping the discriminative information, providing an extreme com-pact representation for acoustic scenes. Experimental results show the outstanding performance of the proposed feature over the conventional handcrafted features.Comment: Accepted as a conference paper of Interspeech 201

    Acoustic Scene Classification by Implicitly Identifying Distinct Sound Events

    Full text link
    In this paper, we propose a new strategy for acoustic scene classification (ASC) , namely recognizing acoustic scenes through identifying distinct sound events. This differs from existing strategies, which focus on characterizing global acoustical distributions of audio or the temporal evolution of short-term audio features, without analysis down to the level of sound events. To identify distinct sound events for each scene, we formulate ASC in a multi-instance learning (MIL) framework, where each audio recording is mapped into a bag-of-instances representation. Here, instances can be seen as high-level representations for sound events inside a scene. We also propose a MIL neural networks model, which implicitly identifies distinct instances (i.e., sound events). Furthermore, we propose two specially designed modules that model the multi-temporal scale and multi-modal natures of the sound events respectively. The experiments were conducted on the official development set of the DCASE2018 Task1 Subtask B, and our best-performing model improves over the official baseline by 9.4% (68.3% vs 58.9%) in terms of classification accuracy. This study indicates that recognizing acoustic scenes by identifying distinct sound events is effective and paves the way for future studies that combine this strategy with previous ones.Comment: code URL typo, code is available at https://github.com/hackerekcah/distinct-events-asc.gi

    Contrastive Regularization for Multimodal Emotion Recognition Using Audio and Text

    Full text link
    Speech emotion recognition is a challenge and an important step towards more natural human-computer interaction (HCI). The popular approach is multimodal emotion recognition based on model-level fusion, which means that the multimodal signals can be encoded to acquire embeddings, and then the embeddings are concatenated together for the final classification. However, due to the influence of noise or other factors, each modality does not always tend to the same emotional category, which affects the generalization of a model. In this paper, we propose a novel regularization method via contrastive learning for multimodal emotion recognition using audio and text. By introducing a discriminator to distinguish the difference between the same and different emotional pairs, we explicitly restrict the latent code of each modality to contain the same emotional information, so as to reduce the noise interference and get more discriminative representation. Experiments are performed on the standard IEMOCAP dataset for 4-class emotion recognition. The results show a significant improvement of 1.44\% and 1.53\% in terms of weighted accuracy (WA) and unweighted accuracy (UA) compared to the baseline system.Comment: Completed in October 2020 and submitted to ICASSP202

    Speech Separation Based on Multi-Stage Elaborated Dual-Path Deep BiLSTM with Auxiliary Identity Loss

    Full text link
    Deep neural network with dual-path bi-directional long short-term memory (BiLSTM) block has been proved to be very effective in sequence modeling, especially in speech separation. This work investigates how to extend dual-path BiLSTM to result in a new state-of-the-art approach, called TasTas, for multi-talker monaural speech separation (a.k.a cocktail party problem). TasTas introduces two simple but effective improvements, one is an iterative multi-stage refinement scheme, and the other is to correct the speech with imperfect separation through a loss of speaker identity consistency between the separated speech and original speech, to boost the performance of dual-path BiLSTM based networks. TasTas takes the mixed utterance of two speakers and maps it to two separated utterances, where each utterance contains only one speaker's voice. Our experiments on the notable benchmark WSJ0-2mix data corpus result in 20.55dB SDR improvement, 20.35dB SI-SDR improvement, 3.69 of PESQ, and 94.86\% of ESTOI, which shows that our proposed networks can lead to big performance improvement on the speaker separation task. We have open sourced our re-implementation of the DPRNN-TasNet here (https://github.com/ShiZiqiang/dual-path-RNNs-DPRNNs-based-speech-separation), and our TasTas is realized based on this implementation of DPRNN-TasNet, it is believed that the results in this paper can be reproduced with ease.Comment: To appear in Interspeech 2020. arXiv admin note: substantial text overlap with arXiv:2001.08998, arXiv:1902.04891, arXiv:1902.0065
    • …
    corecore