2,022 research outputs found
Factorization of Discriminatively Trained i-vector Extractor for Speaker Recognition
In this work, we continue in our research on i-vector extractor for speaker
verification (SV) and we optimize its architecture for fast and effective
discriminative training. We were motivated by computational and memory
requirements caused by the large number of parameters of the original
generative i-vector model. Our aim is to preserve the power of the original
generative model, and at the same time focus the model towards extraction of
speaker-related information. We show that it is possible to represent a
standard generative i-vector extractor by a model with significantly less
parameters and obtain similar performance on SV tasks. We can further refine
this compact model by discriminative training and obtain i-vectors that lead to
better performance on various SV benchmarks representing different acoustic
domains.Comment: Submitted to Interspeech 2019, Graz, Austria. arXiv admin note:
substantial text overlap with arXiv:1810.1318
Deep factorization for speech signal
Various informative factors mixed in speech signals, leading to great
difficulty when decoding any of the factors. An intuitive idea is to factorize
each speech frame into individual informative factors, though it turns out to
be highly difficult. Recently, we found that speaker traits, which were assumed
to be long-term distributional properties, are actually short-time patterns,
and can be learned by a carefully designed deep neural network (DNN). This
discovery motivated a cascade deep factorization (CDF) framework that will be
presented in this paper. The proposed framework infers speech factors in a
sequential way, where factors previously inferred are used as conditional
variables when inferring other factors. We will show that this approach can
effectively factorize speech signals, and using these factors, the original
speech spectrum can be recovered with a high accuracy. This factorization and
reconstruction approach provides potential values for many speech processing
tasks, e.g., speaker recognition and emotion recognition, as will be
demonstrated in the paper.Comment: Accepted by ICASSP 2018. arXiv admin note: substantial text overlap
with arXiv:1706.0177
Modelling of Sound Events with Hidden Imbalances Based on Clustering and Separate Sub-Dictionary Learning
This paper proposes an effective modelling of sound event spectra with a
hidden data-size-imbalance, for improved Acoustic Event Detection (AED). The
proposed method models each event as an aggregated representation of a few
latent factors, while conventional approaches try to find acoustic elements
directly from the event spectra. In the method, all the latent factors across
all events are assigned comparable importance and complexity to overcome the
hidden imbalance of data-sizes in event spectra. To extract latent factors in
each event, the proposed method employs clustering and performs non-negative
matrix factorization to each latent factor, and learns its acoustic elements as
a sub-dictionary. Separate sub-dictionary learning effectively models the
acoustic elements with limited data-sizes and avoids over-fitting due to hidden
imbalances in training data. For the task of polyphonic sound event detection
from DCASE 2013 challenge, an AED based on the proposed modelling achieves a
detection F-measure of 46.5%, a significant improvement of more than 19% as
compared to the existing state-of-the-art methods
- …