11 research outputs found
Constrained Output Embeddings for End-to-End Code-Switching Speech Recognition with Only Monolingual Data
The lack of code-switch training data is one of the major concerns in the
development of end-to-end code-switching automatic speech recognition (ASR)
models. In this work, we propose a method to train an improved end-to-end
code-switching ASR using only monolingual data. Our method encourages the
distributions of output token embeddings of monolingual languages to be
similar, and hence, promotes the ASR model to easily code-switch between
languages. Specifically, we propose to use Jensen-Shannon divergence and cosine
distance based constraints. The former will enforce output embeddings of
monolingual languages to possess similar distributions, while the later simply
brings the centroids of two distributions to be close to each other.
Experimental results demonstrate high effectiveness of the proposed method,
yielding up to 4.5% absolute mixed error rate improvement on Mandarin-English
code-switching ASR task.Comment: 5 pages, 3 figures, accepted to INTERSPEECH 201
Cloud-based Automatic Speech Recognition Systems for Southeast Asian Languages
This paper provides an overall introduction of our Automatic Speech
Recognition (ASR) systems for Southeast Asian languages. As not much existing
work has been carried out on such regional languages, a few difficulties should
be addressed before building the systems: limitation on speech and text
resources, lack of linguistic knowledge, etc. This work takes Bahasa Indonesia
and Thai as examples to illustrate the strategies of collecting various
resources required for building ASR systems.Comment: Published by the 2017 IEEE International Conference on Orange
Technologies (ICOT 2017
Independent language modeling architecture for end-to-end ASR
The attention-based end-to-end (E2E) automatic speech recognition (ASR)
architecture allows for joint optimization of acoustic and language models
within a single network. However, in a vanilla E2E ASR architecture, the
decoder sub-network (subnet), which incorporates the role of the language model
(LM), is conditioned on the encoder output. This means that the acoustic
encoder and the language model are entangled that doesn't allow language model
to be trained separately from external text data. To address this problem, in
this work, we propose a new architecture that separates the decoder subnet from
the encoder output. In this way, the decoupled subnet becomes an independently
trainable LM subnet, which can easily be updated using the external text data.
We study two strategies for updating the new architecture. Experimental results
show that, 1) the independent LM architecture benefits from external text data,
achieving 9.3% and 22.8% relative character and word error rate reduction on
Mandarin HKUST and English NSC datasets respectively; 2)the proposed
architecture works well with external LM and can be generalized to different
amount of labelled data
MossFormer2: Combining Transformer and RNN-Free Recurrent Network for Enhanced Time-Domain Monaural Speech Separation
Our previously proposed MossFormer has achieved promising performance in
monaural speech separation. However, it predominantly adopts a
self-attention-based MossFormer module, which tends to emphasize longer-range,
coarser-scale dependencies, with a deficiency in effectively modelling
finer-scale recurrent patterns. In this paper, we introduce a novel hybrid
model that provides the capabilities to model both long-range, coarse-scale
dependencies and fine-scale recurrent patterns by integrating a recurrent
module into the MossFormer framework. Instead of applying the recurrent neural
networks (RNNs) that use traditional recurrent connections, we present a
recurrent module based on a feedforward sequential memory network (FSMN), which
is considered "RNN-free" recurrent network due to the ability to capture
recurrent patterns without using recurrent connections. Our recurrent module
mainly comprises an enhanced dilated FSMN block by using gated convolutional
units (GCU) and dense connections. In addition, a bottleneck layer and an
output layer are also added for controlling information flow. The recurrent
module relies on linear projections and convolutions for seamless, parallel
processing of the entire sequence. The integrated MossFormer2 hybrid model
demonstrates remarkable enhancements over MossFormer and surpasses other
state-of-the-art methods in WSJ0-2/3mix, Libri2Mix, and WHAM!/WHAMR!
benchmarks.Comment: 5 pages, 3 figures, accepted by ICASSP 202
Contrastive Speech Mixup for Low-resource Keyword Spotting
Most of the existing neural-based models for keyword spotting (KWS) in smart
devices require thousands of training samples to learn a decent audio
representation. However, with the rising demand for smart devices to become
more personalized, KWS models need to adapt quickly to smaller user samples. To
tackle this challenge, we propose a contrastive speech mixup (CosMix) learning
algorithm for low-resource KWS. CosMix introduces an auxiliary contrastive loss
to the existing mixup augmentation technique to maximize the relative
similarity between the original pre-mixed samples and the augmented samples.
The goal is to inject enhancing constraints to guide the model towards simpler
but richer content-based speech representations from two augmented views (i.e.
noisy mixed and clean pre-mixed utterances). We conduct our experiments on the
Google Speech Command dataset, where we trim the size of the training set to as
small as 2.5 mins per keyword to simulate a low-resource condition. Our
experimental results show a consistent improvement in the performance of
multiple models, which exhibits the effectiveness of our method.Comment: Accepted by ICASSP 202
Are Soft Prompts Good Zero-shot Learners for Speech Recognition?
Large self-supervised pre-trained speech models require computationally
expensive fine-tuning for downstream tasks. Soft prompt tuning offers a simple
parameter-efficient alternative by utilizing minimal soft prompt guidance,
enhancing portability while also maintaining competitive performance. However,
not many people understand how and why this is so. In this study, we aim to
deepen our understanding of this emerging method by investigating the role of
soft prompts in automatic speech recognition (ASR). Our findings highlight
their role as zero-shot learners in improving ASR performance but also make
them vulnerable to malicious modifications. Soft prompts aid generalization but
are not obligatory for inference. We also identify two primary roles of soft
prompts: content refinement and noise information enhancement, which enhances
robustness against background noise. Additionally, we propose an effective
modification on noise prompts to show that they are capable of zero-shot
learning on adapting to out-of-distribution noise environments
SPGM: Prioritizing Local Features for enhanced speech separation performance
Dual-path is a popular architecture for speech separation models (e.g.
Sepformer) which splits long sequences into overlapping chunks for its intra-
and inter-blocks that separately model intra-chunk local features and
inter-chunk global relationships. However, it has been found that inter-blocks,
which comprise half a dual-path model's parameters, contribute minimally to
performance. Thus, we propose the Single-Path Global Modulation (SPGM) block to
replace inter-blocks. SPGM is named after its structure consisting of a
parameter-free global pooling module followed by a modulation module comprising
only 2% of the model's total parameters. The SPGM block allows all transformer
layers in the model to be dedicated to local feature modelling, making the
overall model single-path. SPGM achieves 22.1 dB SI-SDRi on WSJ0-2Mix and 20.4
dB SI-SDRi on Libri2Mix, exceeding the performance of Sepformer by 0.5 dB and
0.3 dB respectively and matches the performance of recent SOTA models with up
to 8 times fewer parameters
INDEPENDENT LANGUAGE MODELING ARCHITECTURE FOR END-TO-END ASR
The attention-based end-to-end (E2E) automatic speech
recognition (ASR) architecture allows for joint optimization
of acoustic and language models within a single network.
However, in a vanilla E2E ASR architecture, the decoder
sub-network (subnet), which incorporates the role of the lan guage model (LM), is conditioned on the encoder output.
This means that the acoustic encoder and the language model
are entangled that doesn’t allow language model to be trained
separately from external text data. To address this problem,
in this work, we propose a new architecture that separates
the decoder subnet from the encoder output. In this way, the
decoupled subnet becomes an independently trainable LM
subnet, which can easily be updated using the external text
data. We study two strategies for updating the new architec ture. Experimental results show that, 1) the independent LM
architecture benefits from external text data, achieving 9.3%
and 22.8% relative character and word error rate reduction on
Mandarin HKUST and English NSC datasets respectively; 2)
the proposed architecture works well with external LM and
can be generalized to different amount of labelled data