1,126 research outputs found
OverFlow: Putting flows on top of neural transducers for better TTS
Neural HMMs are a type of neural transducer recently proposed for
sequence-to-sequence modelling in text-to-speech. They combine the best
features of classic statistical speech synthesis and modern neural TTS,
requiring less data and fewer training updates, and are less prone to gibberish
output caused by neural attention failures. In this paper, we combine neural
HMM TTS with normalising flows for describing the highly non-Gaussian
distribution of speech acoustics. The result is a powerful, fully probabilistic
model of durations and acoustics that can be trained using exact maximum
likelihood. Experiments show that a system based on our proposal needs fewer
updates than comparable methods to produce accurate pronunciations and a
subjective speech quality close to natural speech. Please see
https://shivammehta25.github.io/OverFlow/ for audio examples and code.Comment: 5 pages, 2 figures. Accepted for publication at Interspeech 202
Beyond Universal Transformer: block reusing with adaptor in Transformer for automatic speech recognition
Transformer-based models have recently made significant achievements in the
application of end-to-end (E2E) automatic speech recognition (ASR). It is
possible to deploy the E2E ASR system on smart devices with the help of
Transformer-based models. While these models still have the disadvantage of
requiring a large number of model parameters. To overcome the drawback of
universal Transformer models for the application of ASR on edge devices, we
propose a solution that can reuse the block in Transformer models for the
occasion of the small footprint ASR system, which meets the objective of
accommodating resource limitations without compromising recognition accuracy.
Specifically, we design a novel block-reusing strategy for speech Transformer
(BRST) to enhance the effectiveness of parameters and propose an adapter module
(ADM) that can produce a compact and adaptable model with only a few additional
trainable parameters accompanying each reusing block. We conducted an
experiment with the proposed method on the public AISHELL-1 corpus, and the
results show that the proposed approach achieves the character error rate (CER)
of 9.3%/6.63% with only 7.6M/8.3M parameters without and with the ADM,
respectively. In addition, we also make a deeper analysis to show the effect of
ADM in the general block-reusing method
One Small Step for Generative AI, One Giant Leap for AGI: A Complete Survey on ChatGPT in AIGC Era
OpenAI has recently released GPT-4 (a.k.a. ChatGPT plus), which is
demonstrated to be one small step for generative AI (GAI), but one giant leap
for artificial general intelligence (AGI). Since its official release in
November 2022, ChatGPT has quickly attracted numerous users with extensive
media coverage. Such unprecedented attention has also motivated numerous
researchers to investigate ChatGPT from various aspects. According to Google
scholar, there are more than 500 articles with ChatGPT in their titles or
mentioning it in their abstracts. Considering this, a review is urgently
needed, and our work fills this gap. Overall, this work is the first to survey
ChatGPT with a comprehensive review of its underlying technology, applications,
and challenges. Moreover, we present an outlook on how ChatGPT might evolve to
realize general-purpose AIGC (a.k.a. AI-generated content), which will be a
significant milestone for the development of AGI.Comment: A Survey on ChatGPT and GPT-4, 29 pages. Feedback is appreciated
([email protected]
Enhancing the Unified Streaming and Non-streaming Model with Contrastive Learning
The unified streaming and non-streaming speech recognition model has achieved
great success due to its comprehensive capabilities. In this paper, we propose
to improve the accuracy of the unified model by bridging the inherent
representation gap between the streaming and non-streaming modes with a
contrastive objective. Specifically, the top-layer hidden representation at the
same frame of the streaming and non-streaming modes are regarded as a positive
pair, encouraging the representation of the streaming mode close to its
non-streaming counterpart. The multiple negative samples are randomly selected
from the rest frames of the same sample under the non-streaming mode.
Experimental results demonstrate that the proposed method achieves consistent
improvements toward the unified model in both streaming and non-streaming
modes. Our method achieves CER of 4.66% in the streaming mode and CER of 4.31%
in the non-streaming mode, which sets a new state-of-the-art on the AISHELL-1
benchmark.Comment: Accepted by INTERSPEECH 202
Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages
We introduce the Universal Speech Model (USM), a single large model that
performs automatic speech recognition (ASR) across 100+ languages. This is
achieved by pre-training the encoder of the model on a large unlabeled
multilingual dataset of 12 million (M) hours spanning over 300 languages, and
fine-tuning on a smaller labeled dataset. We use multilingual pre-training with
random-projection quantization and speech-text modality matching to achieve
state-of-the-art performance on downstream multilingual ASR and speech-to-text
translation tasks. We also demonstrate that despite using a labeled training
set 1/7-th the size of that used for the Whisper model, our model exhibits
comparable or better performance on both in-domain and out-of-domain speech
recognition tasks across many languages.Comment: 20 pages, 7 figures, 8 table
EM-Network: Oracle Guided Self-distillation for Sequence Learning
We introduce EM-Network, a novel self-distillation approach that effectively
leverages target information for supervised sequence-to-sequence (seq2seq)
learning. In contrast to conventional methods, it is trained with oracle
guidance, which is derived from the target sequence. Since the oracle guidance
compactly represents the target-side context that can assist the sequence model
in solving the task, the EM-Network achieves a better prediction compared to
using only the source input. To allow the sequence model to inherit the
promising capability of the EM-Network, we propose a new self-distillation
strategy, where the original sequence model can benefit from the knowledge of
the EM-Network in a one-stage manner. We conduct comprehensive experiments on
two types of seq2seq models: connectionist temporal classification (CTC) for
speech recognition and attention-based encoder-decoder (AED) for machine
translation. Experimental results demonstrate that the EM-Network significantly
advances the current state-of-the-art approaches, improving over the best prior
work on speech recognition and establishing state-of-the-art performance on
WMT'14 and IWSLT'14.Comment: ICML 202
- …