10 research outputs found
Analysis of Speech Separation Performance Degradation on Emotional Speech Mixtures
Despite recent strides made in Speech Separation, most models are trained on
datasets with neutral emotions. Emotional speech has been known to degrade
performance of models in a variety of speech tasks, which reduces the
effectiveness of these models when deployed in real-world scenarios. In this
paper we perform analysis to differentiate the performance degradation arising
from the emotions in speech from the impact of out-of-domain inference. This is
measured using a carefully designed test dataset, Emo2Mix, consisting of
balanced data across all emotional combinations. We show that even models with
strong out-of-domain performance such as Sepformer can still suffer significant
degradation of up to 5.1 dB SI-SDRi on mixtures with strong emotions. This
demonstrates the importance of accounting for emotions in real-world speech
separation applications.Comment: Accepted by APSIPA ASC 202
Amino Acid Classification in 2D NMR Spectra via Acoustic Signal Embeddings
Nuclear Magnetic Resonance (NMR) is used in structural biology to
experimentally determine the structure of proteins, which is used in many areas
of biology and is an important part of drug development. Unfortunately, NMR
data can cost thousands of dollars per sample to collect and it can take a
specialist weeks to assign the observed resonances to specific chemical groups.
There has thus been growing interest in the NMR community to use deep learning
to automate NMR data annotation. Due to similarities between NMR and audio
data, we propose that methods used in acoustic signal processing can be applied
to NMR as well. Using a simulated amino acid dataset, we show that by swapping
out filter banks with a trainable convolutional encoder, acoustic signal
embeddings from speaker verification models can be used for amino acid
classification in 2D NMR spectra by treating each amino acid as a unique
speaker. On an NMR dataset comparable in size with of 46 hours of audio, we
achieve a classification performance of 97.7% on a 20-class problem. We also
achieve a 23% relative improvement by using an acoustic embedding model
compared to an existing NMR-based model
Contrastive Speech Mixup for Low-resource Keyword Spotting
Most of the existing neural-based models for keyword spotting (KWS) in smart
devices require thousands of training samples to learn a decent audio
representation. However, with the rising demand for smart devices to become
more personalized, KWS models need to adapt quickly to smaller user samples. To
tackle this challenge, we propose a contrastive speech mixup (CosMix) learning
algorithm for low-resource KWS. CosMix introduces an auxiliary contrastive loss
to the existing mixup augmentation technique to maximize the relative
similarity between the original pre-mixed samples and the augmented samples.
The goal is to inject enhancing constraints to guide the model towards simpler
but richer content-based speech representations from two augmented views (i.e.
noisy mixed and clean pre-mixed utterances). We conduct our experiments on the
Google Speech Command dataset, where we trim the size of the training set to as
small as 2.5 mins per keyword to simulate a low-resource condition. Our
experimental results show a consistent improvement in the performance of
multiple models, which exhibits the effectiveness of our method.Comment: Accepted by ICASSP 202
Noise robust distillation of self-supervised speech models via correlation metrics
Compared to large speech foundation models, small distilled models exhibit
degraded noise robustness. The student's robustness can be improved by
introducing noise at the inputs during pre-training. Despite this, using the
standard distillation loss still yields a student with degraded performance.
Thus, this paper proposes improving student robustness via distillation with
correlation metrics. Teacher behavior is learned by maximizing the teacher and
student cross-correlation matrix between their representations towards
identity. Noise robustness is encouraged via the student's self-correlation
minimization. The proposed method is agnostic of the teacher model and
consistently outperforms the previous approach. This work also proposes an
heuristic to weigh the importance of the two correlation terms automatically.
Experiments show consistently better clean and noise generalization on Intent
Classification, Keyword Spotting, and Automatic Speech Recognition tasks on
SUPERB Challenge.Comment: 6 page
MossFormer2: Combining Transformer and RNN-Free Recurrent Network for Enhanced Time-Domain Monaural Speech Separation
Our previously proposed MossFormer has achieved promising performance in
monaural speech separation. However, it predominantly adopts a
self-attention-based MossFormer module, which tends to emphasize longer-range,
coarser-scale dependencies, with a deficiency in effectively modelling
finer-scale recurrent patterns. In this paper, we introduce a novel hybrid
model that provides the capabilities to model both long-range, coarse-scale
dependencies and fine-scale recurrent patterns by integrating a recurrent
module into the MossFormer framework. Instead of applying the recurrent neural
networks (RNNs) that use traditional recurrent connections, we present a
recurrent module based on a feedforward sequential memory network (FSMN), which
is considered "RNN-free" recurrent network due to the ability to capture
recurrent patterns without using recurrent connections. Our recurrent module
mainly comprises an enhanced dilated FSMN block by using gated convolutional
units (GCU) and dense connections. In addition, a bottleneck layer and an
output layer are also added for controlling information flow. The recurrent
module relies on linear projections and convolutions for seamless, parallel
processing of the entire sequence. The integrated MossFormer2 hybrid model
demonstrates remarkable enhancements over MossFormer and surpasses other
state-of-the-art methods in WSJ0-2/3mix, Libri2Mix, and WHAM!/WHAMR!
benchmarks.Comment: 5 pages, 3 figures, accepted by ICASSP 202
Are Soft Prompts Good Zero-shot Learners for Speech Recognition?
Large self-supervised pre-trained speech models require computationally
expensive fine-tuning for downstream tasks. Soft prompt tuning offers a simple
parameter-efficient alternative by utilizing minimal soft prompt guidance,
enhancing portability while also maintaining competitive performance. However,
not many people understand how and why this is so. In this study, we aim to
deepen our understanding of this emerging method by investigating the role of
soft prompts in automatic speech recognition (ASR). Our findings highlight
their role as zero-shot learners in improving ASR performance but also make
them vulnerable to malicious modifications. Soft prompts aid generalization but
are not obligatory for inference. We also identify two primary roles of soft
prompts: content refinement and noise information enhancement, which enhances
robustness against background noise. Additionally, we propose an effective
modification on noise prompts to show that they are capable of zero-shot
learning on adapting to out-of-distribution noise environments
SPGM: Prioritizing Local Features for enhanced speech separation performance
Dual-path is a popular architecture for speech separation models (e.g.
Sepformer) which splits long sequences into overlapping chunks for its intra-
and inter-blocks that separately model intra-chunk local features and
inter-chunk global relationships. However, it has been found that inter-blocks,
which comprise half a dual-path model's parameters, contribute minimally to
performance. Thus, we propose the Single-Path Global Modulation (SPGM) block to
replace inter-blocks. SPGM is named after its structure consisting of a
parameter-free global pooling module followed by a modulation module comprising
only 2% of the model's total parameters. The SPGM block allows all transformer
layers in the model to be dedicated to local feature modelling, making the
overall model single-path. SPGM achieves 22.1 dB SI-SDRi on WSJ0-2Mix and 20.4
dB SI-SDRi on Libri2Mix, exceeding the performance of Sepformer by 0.5 dB and
0.3 dB respectively and matches the performance of recent SOTA models with up
to 8 times fewer parameters
ConvMixer: Feature Interactive Convolution with Curriculum Learning for Small Footprint and Noisy Far-field Keyword Spotting
Building efficient architecture in neural speech processing is paramount to
success in keyword spotting deployment. However, it is very challenging for
lightweight models to achieve noise robustness with concise neural operations.
In a real-world application, the user environment is typically noisy and may
also contain reverberations. We proposed a novel feature interactive
convolutional model with merely 100K parameters to tackle this under the noisy
far-field condition. The interactive unit is proposed in place of the attention
module that promotes the flow of information with more efficient computations.
Moreover, curriculum-based multi-condition training is adopted to attain better
noise robustness. Our model achieves 98.2% top-1 accuracy on Google Speech
Command V2-12 and is competitive against large transformer models under the
designed noise condition.Comment: submitted to ICASSP 202
Deep Learning Systems for Pneumothorax Detection on Chest Radiographs: A Multicenter External Validation Study
10.1148/ryai.2021200190Radiology: Artificial Intelligence3
Guidelines for the use and interpretation of assays for monitoring autophagy (4th edition)
In 2008, we published the first set of guidelines for standardizing research in autophagy. Since then, this topic has received increasing attention, and many scientists have entered the field. Our knowledge base and relevant new technologies have also been expanding. Thus, it is important to formulate on a regular basis updated guidelines for monitoring autophagy in different organisms. Despite numerous reviews, there continues to be confusion regarding acceptable methods to evaluate autophagy, especially in multicellular eukaryotes. Here, we present a set of guidelines for investigators to select and interpret methods to examine autophagy and related processes, and for reviewers to provide realistic and reasonable critiques of reports that are focused on these processes. These guidelines are not meant to be a dogmatic set of rules, because the appropriateness of any assay largely depends on the question being asked and the system being used. Moreover, no individual assay is perfect for every situation, calling for the use of multiple techniques to properly monitor autophagy in each experimental setting. Finally, several core components of the autophagy machinery have been implicated in distinct autophagic processes (canonical and noncanonical autophagy), implying that genetic approaches to block autophagy should rely on targeting two or more autophagy-related genes that ideally participate in distinct steps of the pathway. Along similar lines, because multiple proteins involved in autophagy also regulate other cellular pathways including apoptosis, not all of them can be used as a specific marker for bona fide autophagic responses. Here, we critically discuss current methods of assessing autophagy and the information they can, or cannot, provide. Our ultimate goal is to encourage intellectual and technical innovation in the field