1,913 research outputs found
Progressive Joint Modeling in Unsupervised Single-channel Overlapped Speech Recognition
Unsupervised single-channel overlapped speech recognition is one of the
hardest problems in automatic speech recognition (ASR). Permutation invariant
training (PIT) is a state of the art model-based approach, which applies a
single neural network to solve this single-input, multiple-output modeling
problem. We propose to advance the current state of the art by imposing a
modular structure on the neural network, applying a progressive pretraining
regimen, and improving the objective function with transfer learning and a
discriminative training criterion. The modular structure splits the problem
into three sub-tasks: frame-wise interpreting, utterance-level speaker tracing,
and speech recognition. The pretraining regimen uses these modules to solve
progressively harder tasks. Transfer learning leverages parallel clean speech
to improve the training targets for the network. Our discriminative training
formulation is a modification of standard formulations, that also penalizes
competing outputs of the system. Experiments are conducted on the artificial
overlapped Switchboard and hub5e-swb dataset. The proposed framework achieves
over 30% relative improvement of WER over both a strong jointly trained system,
PIT for ASR, and a separately optimized system, PIT for speech separation with
clean speech ASR model. The improvement comes from better model generalization,
training efficiency and the sequence level linguistic knowledge integration.Comment: submitted to TASLP, 07/20/2017; accepted by TASLP, 10/13/201
End-to-End Monaural Multi-speaker ASR System without Pretraining
Recently, end-to-end models have become a popular approach as an alternative
to traditional hybrid models in automatic speech recognition (ASR). The
multi-speaker speech separation and recognition task is a central task in
cocktail party problem. In this paper, we present a state-of-the-art monaural
multi-speaker end-to-end automatic speech recognition model. In contrast to
previous studies on the monaural multi-speaker speech recognition, this
end-to-end framework is trained to recognize multiple label sequences
completely from scratch. The system only requires the speech mixture and
corresponding label sequences, without needing any indeterminate supervisions
obtained from non-mixture speech or corresponding labels/alignments. Moreover,
we exploited using the individual attention module for each separated speaker
and the scheduled sampling to further improve the performance. Finally, we
evaluate the proposed model on the 2-speaker mixed speech generated from the
WSJ corpus and the wsj0-2mix dataset, which is a speech separation and
recognition benchmark. The experiments demonstrate that the proposed methods
can improve the performance of the end-to-end model in separating the
overlapping speech and recognizing the separated streams. From the results, the
proposed model leads to ~10.0% relative performance gains in terms of CER and
WER respectively.Comment: submitted to ICASSP201
DIHARD II is Still Hard: Experimental Results and Discussions from the DKU-LENOVO Team
In this paper, we present the submitted system for the second DIHARD Speech
Diarization Challenge from the DKULENOVO team. Our diarization system includes
multiple modules, namely voice activity detection (VAD), segmentation, speaker
embedding extraction, similarity scoring, clustering, resegmentation and
overlap detection. For each module, we explore different techniques to enhance
performance. Our final submission employs the ResNet-LSTM based VAD, the Deep
ResNet based speaker embedding, the LSTM based similarity scoring and spectral
clustering. Variational Bayes (VB) diarization is applied in the resegmentation
stage and overlap detection also brings slight improvement. Our proposed system
achieves 18.84% DER in Track1 and 27.90% DER in Track2. Although our systems
have reduced the DERs by 27.5% and 31.7% relatively against the official
baselines, we believe that the diarization task is still very difficult.Comment: Submitted to Odyssesy 202
Linguistic Search Optimization for Deep Learning Based LVCSR
Recent advances in deep learning based large vocabulary con- tinuous speech
recognition (LVCSR) invoke growing demands in large scale speech transcription.
The inference process of a speech recognizer is to find a sequence of labels
whose corresponding acoustic and language models best match the input feature
[1]. The main computation includes two stages: acoustic model (AM) inference
and linguistic search (weighted finite-state transducer, WFST). Large
computational overheads of both stages hamper the wide application of LVCSR.
Benefit from stronger classifiers, deep learning, and more powerful computing
devices, we propose general ideas and some initial trials to solve these
fundamental problems.Comment: accepted by Doctoral Consortium, INTERSPEECH 201
Recent Progresses in Deep Learning based Acoustic Models (Updated)
In this paper, we summarize recent progresses made in deep learning based
acoustic models and the motivation and insights behind the surveyed techniques.
We first discuss acoustic models that can effectively exploit variable-length
contextual information, such as recurrent neural networks (RNNs), convolutional
neural networks (CNNs), and their various combination with other models. We
then describe acoustic models that are optimized end-to-end with emphasis on
feature representations learned jointly with rest of the system, the
connectionist temporal classification (CTC) criterion, and the attention-based
sequence-to-sequence model. We further illustrate robustness issues in speech
recognition systems, and discuss acoustic model adaptation, speech enhancement
and separation, and robust training strategies. We also cover modeling
techniques that lead to more efficient decoding and discuss possible future
directions in acoustic model research.Comment: This is an updated version with latest literature until ICASSP2018 of
the paper: Dong Yu and Jinyu Li, "Recent Progresses in Deep Learning based
Acoustic Models," vol.4, no.3, IEEE/CAA Journal of Automatica Sinica, 201
Sequential Multi-Frame Neural Beamforming for Speech Separation and Enhancement
This work introduces sequential neural beamforming, which alternates between
neural network based spectral separation and beamforming based spatial
separation. Our neural networks for separation use an advanced convolutional
architecture trained with a novel stabilized signal-to-noise ratio loss
function. For beamforming, we explore multiple ways of computing time-varying
covariance matrices, including factorizing the spatial covariance into a
time-varying amplitude component and a time-invariant spatial component, as
well as using block-based techniques. In addition, we introduce a multi-frame
beamforming method which improves the results significantly by adding
contextual frames to the beamforming formulations. We extensively evaluate and
analyze the effects of window size, block size, and multi-frame context size
for these methods. Our best method utilizes a sequence of three neural
separation and multi-frame time-invariant spatial beamforming stages, and
demonstrates an average improvement of 2.75 dB in scale-invariant
signal-to-noise ratio and 14.2% absolute reduction in a comparative speech
recognition metric across four challenging reverberant speech enhancement and
separation tasks. We also use our three-speaker separation model to separate
real recordings in the LibriCSS evaluation set into non-overlapping tracks, and
achieve a better word error rate as compared to a baseline mask based
beamformer.Comment: 7 pages, 7 figures, IEEE SLT 2021 (slt2020.org
Recognizing Overlapped Speech in Meetings: A Multichannel Separation Approach Using Neural Networks
The goal of this work is to develop a meeting transcription system that can
recognize speech even when utterances of different speakers are overlapped.
While speech overlaps have been regarded as a major obstacle in accurately
transcribing meetings, a traditional beamformer with a single output has been
exclusively used because previously proposed speech separation techniques have
critical constraints for application to real meetings. This paper proposes a
new signal processing module, called an unmixing transducer, and describes its
implementation using a windowed BLSTM. The unmixing transducer has a fixed
number, say J, of output channels, where J may be different from the number of
meeting attendees, and transforms an input multi-channel acoustic signal into J
time-synchronous audio streams. Each utterance in the meeting is separated and
emitted from one of the output channels. Then, each output signal can be simply
fed to a speech recognition back-end for segmentation and transcription. Our
meeting transcription system using the unmixing transducer outperforms a system
based on a state-of-the-art neural mask-based beamformer by 10.8%. Significant
improvements are observed in overlapped segments. To the best of our knowledge,
this is the first report that applies overlapped speech recognition to
unconstrained real meeting audio
Sequence Discriminative Training for Deep Learning based Acoustic Keyword Spotting
Speech recognition is a sequence prediction problem. Besides employing
various deep learning approaches for framelevel classification, sequence-level
discriminative training has been proved to be indispensable to achieve the
state-of-the-art performance in large vocabulary continuous speech recognition
(LVCSR). However, keyword spotting (KWS), as one of the most common speech
recognition tasks, almost only benefits from frame-level deep learning due to
the difficulty of getting competing sequence hypotheses. The few studies on
sequence discriminative training for KWS are limited for fixed vocabulary or
LVCSR based methods and have not been compared to the state-of-the-art deep
learning based KWS approaches. In this paper, a sequence discriminative
training framework is proposed for both fixed vocabulary and unrestricted
acoustic KWS. Sequence discriminative training for both sequence-level
generative and discriminative models are systematically investigated. By
introducing word-independent phone lattices or non-keyword blank symbols to
construct competing hypotheses, feasible and efficient sequence discriminative
training approaches are proposed for acoustic KWS. Experiments showed that the
proposed approaches obtained consistent and significant improvement in both
fixed vocabulary and unrestricted KWS tasks, compared to previous frame-level
deep learning based acoustic KWS methods.Comment: accepted by Speech Communication, 08/02/201
Improved Speaker-Dependent Separation for CHiME-5 Challenge
This paper summarizes several follow-up contributions for improving our
submitted NWPU speaker-dependent system for CHiME-5 challenge, which aims to
solve the problem of multi-channel, highly-overlapped conversational speech
recognition in a dinner party scenario with reverberations and non-stationary
noises. We adopt a speaker-aware training method by using i-vector as the
target speaker information for multi-talker speech separation. With only one
unified separation model for all speakers, we achieve a 10\% absolute
improvement in terms of word error rate (WER) over the previous baseline of
80.28\% on the development set by leveraging our newly proposed data processing
techniques and beamforming approach. With our improved back-end acoustic model,
we further reduce WER to 60.15\% which surpasses the result of our submitted
CHiME-5 challenge system without applying any fusion techniques.Comment: Submitted to Interspeech 2019, Graz, Austri
Speech Enhancement Based on Reducing the Detail Portion of Speech Spectrograms in Modulation Domain via Discrete Wavelet Transform
In this paper, we propose a novel speech enhancement (SE) method by
exploiting the discrete wavelet transform (DWT). This new method reduces the
amount of fast time-varying portion, viz. the DWT-wise detail component, in the
spectrogram of speech signals so as to highlight the speech-dominant component
and achieves better speech quality. A particularity of this new method is that
it is completely unsupervised and requires no prior information about the clean
speech and noise in the processed utterance. The presented DWT-based SE method
with various scaling factors for the detail part is evaluated with a subset of
Aurora-2 database, and the PESQ metric is used to indicate the quality of
processed speech signals. The preliminary results show that the processed
speech signals reveal a higher PESQ score in comparison with the original
counterparts. Furthermore, we show that this method can still enhance the
signal by totally discarding the detail part (setting the respective scaling
factor to zero), revealing that the spectrogram can be down-sampled and thus
compressed without the cost of lowered quality. In addition, we integrate this
new method with conventional speech enhancement algorithms, including spectral
subtraction, Wiener filtering, and spectral MMSE estimation, and show that the
resulting integration behaves better than the respective component method. As a
result, this new method is quite effective in improving the speech quality and
well additive to the other SE methods.Comment: 4 pages, 4 figures, to appear in ISCSLP 201
- …