8,591 research outputs found
Deep factorization for speech signal
Various informative factors mixed in speech signals, leading to great
difficulty when decoding any of the factors. An intuitive idea is to factorize
each speech frame into individual informative factors, though it turns out to
be highly difficult. Recently, we found that speaker traits, which were assumed
to be long-term distributional properties, are actually short-time patterns,
and can be learned by a carefully designed deep neural network (DNN). This
discovery motivated a cascade deep factorization (CDF) framework that will be
presented in this paper. The proposed framework infers speech factors in a
sequential way, where factors previously inferred are used as conditional
variables when inferring other factors. We will show that this approach can
effectively factorize speech signals, and using these factors, the original
speech spectrum can be recovered with a high accuracy. This factorization and
reconstruction approach provides potential values for many speech processing
tasks, e.g., speaker recognition and emotion recognition, as will be
demonstrated in the paper.Comment: Accepted by ICASSP 2018. arXiv admin note: substantial text overlap
with arXiv:1706.0177
Multi-talker Speech Separation with Utterance-level Permutation Invariant Training of Deep Recurrent Neural Networks
In this paper we propose the utterance-level Permutation Invariant Training
(uPIT) technique. uPIT is a practically applicable, end-to-end, deep learning
based solution for speaker independent multi-talker speech separation.
Specifically, uPIT extends the recently proposed Permutation Invariant Training
(PIT) technique with an utterance-level cost function, hence eliminating the
need for solving an additional permutation problem during inference, which is
otherwise required by frame-level PIT. We achieve this using Recurrent Neural
Networks (RNNs) that, during training, minimize the utterance-level separation
error, hence forcing separated frames belonging to the same speaker to be
aligned to the same output stream. In practice, this allows RNNs, trained with
uPIT, to separate multi-talker mixed speech without any prior knowledge of
signal duration, number of speakers, speaker identity or gender. We evaluated
uPIT on the WSJ0 and Danish two- and three-talker mixed-speech separation tasks
and found that uPIT outperforms techniques based on Non-negative Matrix
Factorization (NMF) and Computational Auditory Scene Analysis (CASA), and
compares favorably with Deep Clustering (DPCL) and the Deep Attractor Network
(DANet). Furthermore, we found that models trained with uPIT generalize well to
unseen speakers and languages. Finally, we found that a single model, trained
with uPIT, can handle both two-speaker, and three-speaker speech mixtures
- …