640 research outputs found
Exploring the time-domain deep attractor network with two-stream architectures in a reverberant environment
With the success of deep learning in speech signal processing,
speaker-independent speech separation under a reverberant environment remains
challenging. The deep attractor network (DAN) performs speech separation with
speaker attractors on the time-frequency domain. The recently proposed
convolutional time-domain audio separation network (Conv-TasNet) surpasses
ideal masks in anechoic mixture signals, while its architecture renders the
problem of separating signals with arbitrary numbers of speakers. Moreover,
these models will suffer performance degradation in a reverberant environment.
In this study, we propose a time-domain deep attractor network (TD-DAN) with
two-stream convolutional networks that efficiently performs both
dereverberation and separation tasks under the condition of variable numbers of
speakers. The speaker encoding stream (SES) of the TD-DAN models speaker
information, and is explored with various waveform encoders. The speech
decoding steam (SDS) accepts speaker attractors from SES, and learns to predict
early reflections. Experiment results demonstrated that the TD-DAN achieved
scale-invariant source-to-distortion ratio (SI-SDR) gains of 10.40/9.78 dB and
9.15/7.92 dB on the reverberant two- and three-speaker development/evaluation
set, exceeding Conv-TasNet by 1.55/1.33 dB and 0.94/1.21 dB, respectively
Cracking the cocktail party problem by multi-beam deep attractor network
While recent progresses in neural network approaches to single-channel speech
separation, or more generally the cocktail party problem, achieved significant
improvement, their performance for complex mixtures is still not satisfactory.
In this work, we propose a novel multi-channel framework for multi-talker
separation. In the proposed model, an input multi-channel mixture signal is
firstly converted to a set of beamformed signals using fixed beam patterns. For
this beamforming, we propose to use differential beamformers as they are more
suitable for speech separation. Then each beamformed signal is fed into a
single-channel anchored deep attractor network to generate separated signals.
And the final separation is acquired by post selecting the separating output
for each beams. To evaluate the proposed system, we create a challenging
dataset comprising mixtures of 2, 3 or 4 speakers. Our results show that the
proposed system largely improves the state of the art in speech separation,
achieving 11.5 dB, 11.76 dB and 11.02 dB average signal-to-distortion ratio
improvement for 4, 3 and 2 overlapped speaker mixtures, which is comparable to
the performance of a minimum variance distortionless response beamformer that
uses oracle location, source, and noise information. We also run speech
recognition with a clean trained acoustic model on the separated speech,
achieving relative word error rate (WER) reduction of 45.76\%, 59.40\% and
62.80\% on fully overlapped speech of 4, 3 and 2 speakers, respectively. With a
far talk acoustic model, the WER is further reduced
Single Channel auditory source separation with neural network
Although distinguishing diļ¬erent sounds in noisy environment is a relative easy task for human, source separation has long been extremely diļ¬cult in audio signal processing. The problem is challenging for three reasons: the large variety of sound type, the abundant mixing conditions and the unclear mechanism to distinguish sources, especially for similar sounds.
In recent years, the neural network based methods achieved impressive successes in various problems, including the speech enhancement, where the task is to separate the clean speech out of the noise mixture. However, the current deep learning based source separator does not perform well on real recorded noisy speech, and more importantly, is not applicable in a more general source separation scenario such as overlapped speech.
In this thesis, we ļ¬rstly propose extensions for the current mask learning network, for the problem of speech enhancement, to ļ¬x the scale mismatch problem which is usually occurred in real recording audio. We solve this problem by combining two additional restoration layers in the existing mask learning network. We also proposed a residual learning architecture for the speech enhancement, further improving the network generalization under diļ¬erent recording conditions. We evaluate the proposed speech enhancement models on CHiME 3 data. Without retraining the acoustic model, the best bi-direction LSTM with residue connections yields 25.13% relative WER reduction on real data and 34.03% WER on simulated data.
Then we propose a novel neural network based model called ādeep clusteringā for more general source separation tasks. We train a deep network to assign contrastive embedding vectors to each time-frequency region of the spectrogram in order to implicitly predict the segmentation labels of the target spectrogram from the input mixtures. This yields a deep network-based analogue to spectral clustering, in that the embeddings form a low-rank pairwise aļ¬nity matrix that approximates the ideal aļ¬nity matrix, while enabling much faster performance. At test time, the clustering step ādecodesā the segmentation implicit in the embeddings by optimizing K-means with respect to the unknown assignments. Experiments on single channel mixtures from multiple speakers show that a speaker-independent model trained on two-speaker and three speakers mixtures can improve signal quality for mixtures of held-out speakers by an average over 10dB.
We then propose an extension for deep clustering named ādeep attractorā network that allows the system to perform eļ¬cient end-to-end training. In the proposed model, attractor points for each source are ļ¬rstly created the acoustic signals which pull together the time-frequency bins corresponding to each source by ļ¬nding the centroids of the sources in the embedding space, which are subsequently used to determine the similarity of each bin in the mixture to each source. The network is then trained to minimize the reconstruction error of each source by optimizing the embeddings. We showed that this frame work can achieve even better results.
Lastly, we introduce two applications of the proposed models, in singing voice separation and the smart hearing aid device. For the former, a multi-task architecture is proposed, which combines the deep clustering and the classiļ¬cation based network. And a new state of the art separation result was achieved, where the signal to noise ratio was improved by 11.1dB on music and 7.9dB on singing voice. In the application of smart hearing aid device, we combine the neural decoding with the separation network. The system ļ¬rstly decodes the userās attention, which is further used to guide the separator for the targeting source. Both objective study and subjective study show the proposed system can accurately decode the attention and significantly improve the user experience
Monaural Audio Speaker Separation with Source Contrastive Estimation
We propose an algorithm to separate simultaneously speaking persons from each
other, the "cocktail party problem", using a single microphone. Our approach
involves a deep recurrent neural networks regression to a vector space that is
descriptive of independent speakers. Such a vector space can embed empirically
determined speaker characteristics and is optimized by distinguishing between
speaker masks. We call this technique source-contrastive estimation. The
methodology is inspired by negative sampling, which has seen success in natural
language processing, where an embedding is learned by correlating and
de-correlating a given input vector with output weights. Although the matrix
determined by the output weights is dependent on a set of known speakers, we
only use the input vectors during inference. Doing so will ensure that source
separation is explicitly speaker-independent. Our approach is similar to recent
deep neural network clustering and permutation-invariant training research; we
use weighted spectral features and masks to augment individual speaker
frequencies while filtering out other speakers. We avoid, however, the severe
computational burden of other approaches with our technique. Furthermore, by
training a vector space rather than combinations of different speakers or
differences thereof, we avoid the so-called permutation problem during
training. Our algorithm offers an intuitive, computationally efficient response
to the cocktail party problem, and most importantly boasts better empirical
performance than other current techniques
Single-Channel Multi-talker Speech Recognition with Permutation Invariant Training
Although great progresses have been made in automatic speech recognition
(ASR), significant performance degradation is still observed when recognizing
multi-talker mixed speech. In this paper, we propose and evaluate several
architectures to address this problem under the assumption that only a single
channel of mixed signal is available. Our technique extends permutation
invariant training (PIT) by introducing the front-end feature separation module
with the minimum mean square error (MSE) criterion and the back-end recognition
module with the minimum cross entropy (CE) criterion. More specifically, during
training we compute the average MSE or CE over the whole utterance for each
possible utterance-level output-target assignment, pick the one with the
minimum MSE or CE, and optimize for that assignment. This strategy elegantly
solves the label permutation problem observed in the deep learning based
multi-talker mixed speech separation and recognition systems. The proposed
architectures are evaluated and compared on an artificially mixed AMI dataset
with both two- and three-talker mixed speech. The experimental results indicate
that our proposed architectures can cut the word error rate (WER) by 45.0% and
25.0% relatively against the state-of-the-art single-talker speech recognition
system across all speakers when their energies are comparable, for two- and
three-talker mixed speech, respectively. To our knowledge, this is the first
work on the multi-talker mixed speech recognition on the challenging
speaker-independent spontaneous large vocabulary continuous speech task.Comment: 11 pages, 6 figures, Submitted to IEEE/ACM Transactions on Audio,
Speech and Language Processing. arXiv admin note: text overlap with
arXiv:1704.0198
Recognizing Multi-talker Speech with Permutation Invariant Training
In this paper, we propose a novel technique for direct recognition of
multiple speech streams given the single channel of mixed speech, without first
separating them. Our technique is based on permutation invariant training (PIT)
for automatic speech recognition (ASR). In PIT-ASR, we compute the average
cross entropy (CE) over all frames in the whole utterance for each possible
output-target assignment, pick the one with the minimum CE, and optimize for
that assignment. PIT-ASR forces all the frames of the same speaker to be
aligned with the same output layer. This strategy elegantly solves the label
permutation problem and speaker tracing problem in one shot. Our experiments on
artificially mixed AMI data showed that the proposed approach is very
promising.Comment: 5 pages, 6 figures, InterSpeech201
Auditory Separation of a Conversation from Background via Attentional Gating
We present a model for separating a set of voices out of a sound mixture
containing an unknown number of sources. Our Attentional Gating Network (AGN)
uses a variable attentional context to specify which speakers in the mixture
are of interest. The attentional context is specified by an embedding vector
which modifies the processing of a neural network through an additive bias.
Individual speaker embeddings are learned to separate a single speaker while
superpositions of the individual speaker embeddings are used to separate sets
of speakers. We first evaluate AGN on a traditional single speaker separation
task and show an improvement of 9% with respect to comparable models. Then, we
introduce a new task to separate an arbitrary subset of voices from a mixture
of an unknown-sized set of voices, inspired by the human ability to separate a
conversation of interest from background chatter at a cafeteria. We show that
AGN is the only model capable of solving this task, performing only 7% worse
than on the single speaker separation task
Recent Progresses in Deep Learning based Acoustic Models (Updated)
In this paper, we summarize recent progresses made in deep learning based
acoustic models and the motivation and insights behind the surveyed techniques.
We first discuss acoustic models that can effectively exploit variable-length
contextual information, such as recurrent neural networks (RNNs), convolutional
neural networks (CNNs), and their various combination with other models. We
then describe acoustic models that are optimized end-to-end with emphasis on
feature representations learned jointly with rest of the system, the
connectionist temporal classification (CTC) criterion, and the attention-based
sequence-to-sequence model. We further illustrate robustness issues in speech
recognition systems, and discuss acoustic model adaptation, speech enhancement
and separation, and robust training strategies. We also cover modeling
techniques that lead to more efficient decoding and discuss possible future
directions in acoustic model research.Comment: This is an updated version with latest literature until ICASSP2018 of
the paper: Dong Yu and Jinyu Li, "Recent Progresses in Deep Learning based
Acoustic Models," vol.4, no.3, IEEE/CAA Journal of Automatica Sinica, 201
Improving Source Separation via Multi-Speaker Representations
Lately there have been novel developments in deep learning towards solving
the cocktail party problem. Initial results are very promising and allow for
more research in the domain. One technique that has not yet been explored in
the neural network approach to this task is speaker adaptation. Intuitively,
information on the speakers that we are trying to separate seems fundamentally
important for the speaker separation task. However, retrieving this speaker
information is challenging since the speaker identities are not known a priori
and multiple speakers are simultaneously active. There is thus some sort of
chicken and egg problem. To tackle this, source signals and i-vectors are
estimated alternately. We show that blind multi-speaker adaptation improves the
results of the network and that (in our case) the network is not capable of
adequately retrieving this useful speaker information itself
Optimization of Speaker Extraction Neural Network with Magnitude and Temporal Spectrum Approximation Loss
The SpeakerBeam-FE (SBF) method is proposed for speaker extraction. It
attempts to overcome the problem of unknown number of speakers in an audio
recording during source separation. The mask approximation loss of SBF is
sub-optimal, which doesn't calculate direct signal reconstruction error and
consider the speech context. To address these problems, this paper proposes a
magnitude and temporal spectrum approximation loss to estimate a phase
sensitive mask for the target speaker with the speaker characteristics.
Moreover, this paper explores a concatenation framework instead of the context
adaptive deep neural network in the SBF method to encode a speaker embedding
into the mask estimation network. Experimental results under open evaluation
condition show that the proposed method achieves 70.4% and 17.7% relative
improvement over the SBF baseline on signal-to-distortion ratio (SDR) and
perceptual evaluation of speech quality (PESQ), respectively. A further
analysis demonstrates 69.1% and 72.3% relative SDR improvements obtained by the
proposed method for different and same gender mixtures.Comment: Accepted in ICASSP 201
- ā¦