116 research outputs found
Unsupervised Active Learning: Optimizing Labeling Cost-Effectiveness for Automatic Speech Recognition
In recent years, speech-based self-supervised learning (SSL) has made
significant progress in various tasks, including automatic speech recognition
(ASR). An ASR model with decent performance can be realized by fine-tuning an
SSL model with a small fraction of labeled data. Reducing the demand for
labeled data is always of great practical value. In this paper, we further
extend the use of SSL to cut down labeling costs with active learning. Three
types of units on different granularities are derived from speech signals in an
unsupervised way, and their effects are compared by applying a contrastive data
selection method. The experimental results show that our proposed data
selection framework can effectively improve the word error rate (WER) by more
than 11% with the same amount of labeled data, or halve the labeling cost while
maintaining the same WER, compared to random selection.Comment: 5 pages, 3 figures. Accepted to Interspeech 202
GL-Fusion: Global-Local Fusion Network for Multi-view Echocardiogram Video Segmentation
Cardiac structure segmentation from echocardiogram videos plays a crucial
role in diagnosing heart disease. The combination of multi-view echocardiogram
data is essential to enhance the accuracy and robustness of automated methods.
However, due to the visual disparity of the data, deriving cross-view context
information remains a challenging task, and unsophisticated fusion strategies
can even lower performance. In this study, we propose a novel Gobal-Local
fusion (GL-Fusion) network to jointly utilize multi-view information globally
and locally that improve the accuracy of echocardiogram analysis. Specifically,
a Multi-view Global-based Fusion Module (MGFM) is proposed to extract global
context information and to explore the cyclic relationship of different
heartbeat cycles in an echocardiogram video. Additionally, a Multi-view
Local-based Fusion Module (MLFM) is designed to extract correlations of cardiac
structures from different views. Furthermore, we collect a multi-view
echocardiogram video dataset (MvEVD) to evaluate our method. Our method
achieves an 82.29% average dice score, which demonstrates a 7.83% improvement
over the baseline method, and outperforms other existing state-of-the-art
methods. To our knowledge, this is the first exploration of a multi-view method
for echocardiogram video segmentation. Code available at:
https://github.com/xmed-lab/GL-FusionComment: Accepted By MICCAI 202
GraphEcho: Graph-Driven Unsupervised Domain Adaptation for Echocardiogram Video Segmentation
Echocardiogram video segmentation plays an important role in cardiac disease
diagnosis. This paper studies the unsupervised domain adaption (UDA) for
echocardiogram video segmentation, where the goal is to generalize the model
trained on the source domain to other unlabelled target domains. Existing UDA
segmentation methods are not suitable for this task because they do not model
local information and the cyclical consistency of heartbeat. In this paper, we
introduce a newly collected CardiacUDA dataset and a novel GraphEcho method for
cardiac structure segmentation. Our GraphEcho comprises two innovative modules,
the Spatial-wise Cross-domain Graph Matching (SCGM) and the Temporal Cycle
Consistency (TCC) module, which utilize prior knowledge of echocardiogram
videos, i.e., consistent cardiac structure across patients and centers and the
heartbeat cyclical consistency, respectively. These two modules can better
align global and local features from source and target domains, improving UDA
segmentation results. Experimental results showed that our GraphEcho
outperforms existing state-of-the-art UDA segmentation methods. Our collected
dataset and code will be publicly released upon acceptance. This work will lay
a new and solid cornerstone for cardiac structure segmentation from
echocardiogram videos. Code and dataset are available at:
https://github.com/xmed-lab/GraphEchoComment: Accepted By ICCV 202
Fast-HuBERT: An Efficient Training Framework for Self-Supervised Speech Representation Learning
Recent years have witnessed significant advancements in self-supervised
learning (SSL) methods for speech-processing tasks. Various speech-based SSL
models have been developed and present promising performance on a range of
downstream tasks including speech recognition. However, existing speech-based
SSL models face a common dilemma in terms of computational cost, which might
hinder their potential application and in-depth academic research. To address
this issue, we first analyze the computational cost of different modules during
HuBERT pre-training and then introduce a stack of efficiency optimizations,
which is named Fast-HuBERT in this paper. The proposed Fast-HuBERT can be
trained in 1.1 days with 8 V100 GPUs on the Librispeech 960h benchmark, without
performance degradation, resulting in a 5.2x speedup, compared to the original
implementation. Moreover, we explore two well-studied techniques in the
Fast-HuBERT and demonstrate consistent improvements as reported in previous
work
Exploring Effective Distillation of Self-Supervised Speech Models for Automatic Speech Recognition
Recent years have witnessed great strides in self-supervised learning (SSL)
on the speech processing. The SSL model is normally pre-trained on a great
variety of unlabelled data and a large model size is preferred to increase the
modeling capacity. However, this might limit its potential applications due to
the expensive computation and memory costs introduced by the oversize model.
Miniaturization for SSL models has become an important research direction of
practical value. To this end, we explore the effective distillation of
HuBERT-based SSL models for automatic speech recognition (ASR). First, in order
to establish a strong baseline, a comprehensive study on different student
model structures is conducted. On top of this, as a supplement to the
regression loss widely adopted in previous works, a discriminative loss is
introduced for HuBERT to enhance the distillation performance, especially in
low-resource scenarios. In addition, we design a simple and effective algorithm
to distill the front-end input from waveform to Fbank feature, resulting in 17%
parameter reduction and doubling inference speed, at marginal performance
degradation.Comment: Submitted to ICASSP 202
Dual adaptive training of photonic neural networks
Photonic neural network (PNN) is a remarkable analog artificial intelligence
(AI) accelerator that computes with photons instead of electrons to feature low
latency, high energy efficiency, and high parallelism. However, the existing
training approaches cannot address the extensive accumulation of systematic
errors in large-scale PNNs, resulting in a significant decrease in model
performance in physical systems. Here, we propose dual adaptive training (DAT)
that allows the PNN model to adapt to substantial systematic errors and
preserves its performance during the deployment. By introducing the systematic
error prediction networks with task-similarity joint optimization, DAT achieves
the high similarity mapping between the PNN numerical models and physical
systems and high-accurate gradient calculations during the dual backpropagation
training. We validated the effectiveness of DAT by using diffractive PNNs and
interference-based PNNs on image classification tasks. DAT successfully trained
large-scale PNNs under major systematic errors and preserved the model
classification accuracies comparable to error-free systems. The results further
demonstrated its superior performance over the state-of-the-art in situ
training approaches. DAT provides critical support for constructing large-scale
PNNs to achieve advanced architectures and can be generalized to other types of
AI systems with analog computing errors.Comment: 31 pages, 11 figure
Modeling trajectories with recurrent neural networks
National Research Foundation (NRF) Singapore under International Research Centres in Singapore Funding Initiativ
- …