67 research outputs found
Identifiability of multivariate logistic mixture models
Mixture models have been widely used in modeling of continuous observations.
For the possibility to estimate the parameters of a mixture model consistently
on the basis of observations from the mixture, identifiability is a necessary
condition. In this study, we give some results on the identifiability of
multivariate logistic mixture models
A Compact and Discriminative Feature Based on Auditory Summary Statistics for Acoustic Scene Classification
One of the biggest challenges of acoustic scene classification (ASC) is to
find proper features to better represent and characterize environmental sounds.
Environmental sounds generally involve more sound sources while exhibiting less
structure in temporal spectral representations. However, the background of an
acoustic scene exhibits temporal homogeneity in acoustic properties, suggesting
it could be characterized by distribution statistics rather than temporal
details. In this work, we investigated using auditory summary statistics as the
feature for ASC tasks. The inspiration comes from a recent neuroscience study,
which shows the human auditory system tends to perceive sound textures through
time-averaged statistics. Based on these statistics, we further proposed to use
linear discriminant analysis to eliminate redundancies among these statistics
while keeping the discriminative information, providing an extreme com-pact
representation for acoustic scenes. Experimental results show the outstanding
performance of the proposed feature over the conventional handcrafted features.Comment: Accepted as a conference paper of Interspeech 201
Acoustic Scene Classification by Implicitly Identifying Distinct Sound Events
In this paper, we propose a new strategy for acoustic scene classification
(ASC) , namely recognizing acoustic scenes through identifying distinct sound
events. This differs from existing strategies, which focus on characterizing
global acoustical distributions of audio or the temporal evolution of
short-term audio features, without analysis down to the level of sound events.
To identify distinct sound events for each scene, we formulate ASC in a
multi-instance learning (MIL) framework, where each audio recording is mapped
into a bag-of-instances representation. Here, instances can be seen as
high-level representations for sound events inside a scene. We also propose a
MIL neural networks model, which implicitly identifies distinct instances
(i.e., sound events). Furthermore, we propose two specially designed modules
that model the multi-temporal scale and multi-modal natures of the sound events
respectively. The experiments were conducted on the official development set of
the DCASE2018 Task1 Subtask B, and our best-performing model improves over the
official baseline by 9.4% (68.3% vs 58.9%) in terms of classification accuracy.
This study indicates that recognizing acoustic scenes by identifying distinct
sound events is effective and paves the way for future studies that combine
this strategy with previous ones.Comment: code URL typo, code is available at
https://github.com/hackerekcah/distinct-events-asc.gi
Contrastive Regularization for Multimodal Emotion Recognition Using Audio and Text
Speech emotion recognition is a challenge and an important step towards more
natural human-computer interaction (HCI). The popular approach is multimodal
emotion recognition based on model-level fusion, which means that the
multimodal signals can be encoded to acquire embeddings, and then the
embeddings are concatenated together for the final classification. However, due
to the influence of noise or other factors, each modality does not always tend
to the same emotional category, which affects the generalization of a model. In
this paper, we propose a novel regularization method via contrastive learning
for multimodal emotion recognition using audio and text. By introducing a
discriminator to distinguish the difference between the same and different
emotional pairs, we explicitly restrict the latent code of each modality to
contain the same emotional information, so as to reduce the noise interference
and get more discriminative representation. Experiments are performed on the
standard IEMOCAP dataset for 4-class emotion recognition. The results show a
significant improvement of 1.44\% and 1.53\% in terms of weighted accuracy (WA)
and unweighted accuracy (UA) compared to the baseline system.Comment: Completed in October 2020 and submitted to ICASSP202
Speech Separation Based on Multi-Stage Elaborated Dual-Path Deep BiLSTM with Auxiliary Identity Loss
Deep neural network with dual-path bi-directional long short-term memory
(BiLSTM) block has been proved to be very effective in sequence modeling,
especially in speech separation. This work investigates how to extend dual-path
BiLSTM to result in a new state-of-the-art approach, called TasTas, for
multi-talker monaural speech separation (a.k.a cocktail party problem). TasTas
introduces two simple but effective improvements, one is an iterative
multi-stage refinement scheme, and the other is to correct the speech with
imperfect separation through a loss of speaker identity consistency between the
separated speech and original speech, to boost the performance of dual-path
BiLSTM based networks. TasTas takes the mixed utterance of two speakers and
maps it to two separated utterances, where each utterance contains only one
speaker's voice. Our experiments on the notable benchmark WSJ0-2mix data corpus
result in 20.55dB SDR improvement, 20.35dB SI-SDR improvement, 3.69 of PESQ,
and 94.86\% of ESTOI, which shows that our proposed networks can lead to big
performance improvement on the speaker separation task. We have open sourced
our re-implementation of the DPRNN-TasNet here
(https://github.com/ShiZiqiang/dual-path-RNNs-DPRNNs-based-speech-separation),
and our TasTas is realized based on this implementation of DPRNN-TasNet, it is
believed that the results in this paper can be reproduced with ease.Comment: To appear in Interspeech 2020. arXiv admin note: substantial text
overlap with arXiv:2001.08998, arXiv:1902.04891, arXiv:1902.0065
- …