3,493 research outputs found
Neural Speech Synthesis with Transformer Network
Although end-to-end neural text-to-speech (TTS) methods (such as Tacotron2)
are proposed and achieve state-of-the-art performance, they still suffer from
two problems: 1) low efficiency during training and inference; 2) hard to model
long dependency using current recurrent neural networks (RNNs). Inspired by the
success of Transformer network in neural machine translation (NMT), in this
paper, we introduce and adapt the multi-head attention mechanism to replace the
RNN structures and also the original attention mechanism in Tacotron2. With the
help of multi-head self-attention, the hidden states in the encoder and decoder
are constructed in parallel, which improves the training efficiency. Meanwhile,
any two inputs at different times are connected directly by self-attention
mechanism, which solves the long range dependency problem effectively. Using
phoneme sequences as input, our Transformer TTS network generates mel
spectrograms, followed by a WaveNet vocoder to output the final audio results.
Experiments are conducted to test the efficiency and performance of our new
network. For the efficiency, our Transformer TTS network can speed up the
training about 4.25 times faster compared with Tacotron2. For the performance,
rigorous human tests show that our proposed model achieves state-of-the-art
performance (outperforms Tacotron2 with a gap of 0.048) and is very close to
human quality (4.39 vs 4.44 in MOS)
Language Modeling with Deep Transformers
We explore deep autoregressive Transformer models in language modeling for
speech recognition. We focus on two aspects. First, we revisit Transformer
model configurations specifically for language modeling. We show that well
configured Transformer models outperform our baseline models based on the
shallow stack of LSTM recurrent neural network layers. We carry out experiments
on the open-source LibriSpeech 960hr task, for both 200K vocabulary word-level
and 10K byte-pair encoding subword-level language modeling. We apply our
word-level models to conventional hybrid speech recognition by lattice
rescoring, and the subword-level models to attention based encoder-decoder
models by shallow fusion. Second, we show that deep Transformer language models
do not require positional encoding. The positional encoding is an essential
augmentation for the self-attention mechanism which is invariant to sequence
ordering. However, in autoregressive setup, as is the case for language
modeling, the amount of information increases along the position dimension,
which is a positional signal by its own. The analysis of attention weights
shows that deep autoregressive self-attention models can automatically make use
of such positional information. We find that removing the positional encoding
even slightly improves the performance of these models.Comment: To appear in the proceedings of INTERSPEECH 201
Improving Design of Input Condition Invariant Speech Enhancement
Building a single universal speech enhancement (SE) system that can handle
arbitrary input is a demanded but underexplored research topic. Towards this
ultimate goal, one direction is to build a single model that handles diverse
audio duration, sampling frequencies, and microphone variations in noisy and
reverberant scenarios, which we define here as "input condition invariant SE".
Such a model was recently proposed showing promising performance; however, its
multi-channel performance degraded severely in real conditions. In this paper
we propose novel architectures to improve the input condition invariant SE
model so that performance in simulated conditions remains competitive while
real condition degradation is much mitigated. For this purpose, we redesign the
key components that comprise such a system. First, we identify that the
channel-modeling module's generalization to unseen scenarios can be sub-optimal
and redesign this module. We further introduce a two-stage training strategy to
enhance training efficiency. Second, we propose two novel dual-path
time-frequency blocks, demonstrating superior performance with fewer parameters
and computational costs compared to the existing method. All proposals
combined, experiments on various public datasets validate the efficacy of the
proposed model, with significantly improved performance on real conditions.
Recipe with full model details is released at https://github.com/espnet/espnet.Comment: Accepted by ICASSP 2024, 5 pages, 2 figures, 3 tables (corrected the
results of no processing on CHiME-4 (Simu) in Table 2
Toward Universal Speech Enhancement for Diverse Input Conditions
The past decade has witnessed substantial growth of data-driven speech
enhancement (SE) techniques thanks to deep learning. While existing approaches
have shown impressive performance in some common datasets, most of them are
designed only for a single condition (e.g., single-channel, multi-channel, or a
fixed sampling frequency) or only consider a single task (e.g., denoising or
dereverberation). Currently, there is no universal SE approach that can
effectively handle diverse input conditions with a single model. In this paper,
we make the first attempt to investigate this line of research. First, we
devise a single SE model that is independent of microphone channels, signal
lengths, and sampling frequencies. Second, we design a universal SE benchmark
by combining existing public corpora with multiple conditions. Our experiments
on a wide range of datasets show that the proposed single model can
successfully handle diverse conditions with strong performance.Comment: 6 pages, 3 figures, 5 tables, published in ASRU 2023 (corrected the
results of noisy speech on CHiME-4 (Simu) in Table 4
DPATD: Dual-Phase Audio Transformer for Denoising
Recent high-performance transformer-based speech enhancement models
demonstrate that time domain methods could achieve similar performance as
time-frequency domain methods. However, time-domain speech enhancement systems
typically receive input audio sequences consisting of a large number of time
steps, making it challenging to model extremely long sequences and train models
to perform adequately. In this paper, we utilize smaller audio chunks as input
to achieve efficient utilization of audio information to address the above
challenges. We propose a dual-phase audio transformer for denoising (DPATD), a
novel model to organize transformer layers in a deep structure to learn clean
audio sequences for denoising. DPATD splits the audio input into smaller
chunks, where the input length can be proportional to the square root of the
original sequence length. Our memory-compressed explainable attention is
efficient and converges faster compared to the frequently used self-attention
module. Extensive experiments demonstrate that our model outperforms
state-of-the-art methods.Comment: IEEE DD
VarArray Meets t-SOT: Advancing the State of the Art of Streaming Distant Conversational Speech Recognition
This paper presents a novel streaming automatic speech recognition (ASR)
framework for multi-talker overlapping speech captured by a distant microphone
array with an arbitrary geometry. Our framework, named t-SOT-VA, capitalizes on
independently developed two recent technologies; array-geometry-agnostic
continuous speech separation, or VarArray, and streaming multi-talker ASR based
on token-level serialized output training (t-SOT). To combine the best of both
technologies, we newly design a t-SOT-based ASR model that generates a
serialized multi-talker transcription based on two separated speech signals
from VarArray. We also propose a pre-training scheme for such an ASR model
where we simulate VarArray's output signals based on monaural single-talker ASR
training data. Conversation transcription experiments using the AMI meeting
corpus show that the system based on the proposed framework significantly
outperforms conventional ones. Our system achieves the state-of-the-art word
error rates of 13.7% and 15.5% for the AMI development and evaluation sets,
respectively, in the multiple-distant-microphone setting while retaining the
streaming inference capability.Comment: 6 pages, 2 figure, 3 tables, v2: Appendix A has been adde
A Review of Deep Learning Techniques for Speech Processing
The field of speech processing has undergone a transformative shift with the
advent of deep learning. The use of multiple processing layers has enabled the
creation of models capable of extracting intricate features from speech data.
This development has paved the way for unparalleled advancements in speech
recognition, text-to-speech synthesis, automatic speech recognition, and
emotion recognition, propelling the performance of these tasks to unprecedented
heights. The power of deep learning techniques has opened up new avenues for
research and innovation in the field of speech processing, with far-reaching
implications for a range of industries and applications. This review paper
provides a comprehensive overview of the key deep learning models and their
applications in speech-processing tasks. We begin by tracing the evolution of
speech processing research, from early approaches, such as MFCC and HMM, to
more recent advances in deep learning architectures, such as CNNs, RNNs,
transformers, conformers, and diffusion models. We categorize the approaches
and compare their strengths and weaknesses for solving speech-processing tasks.
Furthermore, we extensively cover various speech-processing tasks, datasets,
and benchmarks used in the literature and describe how different deep-learning
networks have been utilized to tackle these tasks. Additionally, we discuss the
challenges and future directions of deep learning in speech processing,
including the need for more parameter-efficient, interpretable models and the
potential of deep learning for multimodal speech processing. By examining the
field's evolution, comparing and contrasting different approaches, and
highlighting future directions and challenges, we hope to inspire further
research in this exciting and rapidly advancing field
SPGM: Prioritizing Local Features for enhanced speech separation performance
Dual-path is a popular architecture for speech separation models (e.g.
Sepformer) which splits long sequences into overlapping chunks for its intra-
and inter-blocks that separately model intra-chunk local features and
inter-chunk global relationships. However, it has been found that inter-blocks,
which comprise half a dual-path model's parameters, contribute minimally to
performance. Thus, we propose the Single-Path Global Modulation (SPGM) block to
replace inter-blocks. SPGM is named after its structure consisting of a
parameter-free global pooling module followed by a modulation module comprising
only 2% of the model's total parameters. The SPGM block allows all transformer
layers in the model to be dedicated to local feature modelling, making the
overall model single-path. SPGM achieves 22.1 dB SI-SDRi on WSJ0-2Mix and 20.4
dB SI-SDRi on Libri2Mix, exceeding the performance of Sepformer by 0.5 dB and
0.3 dB respectively and matches the performance of recent SOTA models with up
to 8 times fewer parameters
MossFormer2: Combining Transformer and RNN-Free Recurrent Network for Enhanced Time-Domain Monaural Speech Separation
Our previously proposed MossFormer has achieved promising performance in
monaural speech separation. However, it predominantly adopts a
self-attention-based MossFormer module, which tends to emphasize longer-range,
coarser-scale dependencies, with a deficiency in effectively modelling
finer-scale recurrent patterns. In this paper, we introduce a novel hybrid
model that provides the capabilities to model both long-range, coarse-scale
dependencies and fine-scale recurrent patterns by integrating a recurrent
module into the MossFormer framework. Instead of applying the recurrent neural
networks (RNNs) that use traditional recurrent connections, we present a
recurrent module based on a feedforward sequential memory network (FSMN), which
is considered "RNN-free" recurrent network due to the ability to capture
recurrent patterns without using recurrent connections. Our recurrent module
mainly comprises an enhanced dilated FSMN block by using gated convolutional
units (GCU) and dense connections. In addition, a bottleneck layer and an
output layer are also added for controlling information flow. The recurrent
module relies on linear projections and convolutions for seamless, parallel
processing of the entire sequence. The integrated MossFormer2 hybrid model
demonstrates remarkable enhancements over MossFormer and surpasses other
state-of-the-art methods in WSJ0-2/3mix, Libri2Mix, and WHAM!/WHAMR!
benchmarks.Comment: 5 pages, 3 figures, accepted by ICASSP 202
- …