110 research outputs found
Graph Relation Distillation for Efficient Biomedical Instance Segmentation
Instance-aware embeddings predicted by deep neural networks have
revolutionized biomedical instance segmentation, but its resource requirements
are substantial. Knowledge distillation offers a solution by transferring
distilled knowledge from heavy teacher networks to lightweight yet
high-performance student networks. However, existing knowledge distillation
methods struggle to extract knowledge for distinguishing instances and overlook
global relation information. To address these challenges, we propose a graph
relation distillation approach for efficient biomedical instance segmentation,
which considers three essential types of knowledge: instance-level features,
instance relations, and pixel-level boundaries. We introduce two graph
distillation schemes deployed at both the intra-image level and the inter-image
level: instance graph distillation (IGD) and affinity graph distillation (AGD).
IGD constructs a graph representing instance features and relations,
transferring these two types of knowledge by enforcing instance graph
consistency. AGD constructs an affinity graph representing pixel relations to
capture structured knowledge of instance boundaries, transferring
boundary-related knowledge by ensuring pixel affinity consistency. Experimental
results on a number of biomedical datasets validate the effectiveness of our
approach, enabling student models with less than parameters and less
than inference time while achieving promising performance compared to
teacher models
Incorporating Ultrasound Tongue Images for Audio-Visual Speech Enhancement
Audio-visual speech enhancement (AV-SE) aims to enhance degraded speech along
with extra visual information such as lip videos, and has been shown to be more
effective than audio-only speech enhancement. This paper proposes the
incorporation of ultrasound tongue images to improve the performance of
lip-based AV-SE systems further. To address the challenge of acquiring
ultrasound tongue images during inference, we first propose to employ knowledge
distillation during training to investigate the feasibility of leveraging
tongue-related information without directly inputting ultrasound tongue images.
Specifically, we guide an audio-lip speech enhancement student model to learn
from a pre-trained audio-lip-tongue speech enhancement teacher model, thus
transferring tongue-related knowledge. To better model the alignment between
the lip and tongue modalities, we further propose the introduction of a
lip-tongue key-value memory network into the AV-SE model. This network enables
the retrieval of tongue features based on readily available lip features,
thereby assisting the subsequent speech enhancement task. Experimental results
demonstrate that both methods significantly improve the quality and
intelligibility of the enhanced speech compared to traditional lip-based AV-SE
baselines. Moreover, both proposed methods exhibit strong generalization
performance on unseen speakers and in the presence of unseen noises.
Furthermore, phone error rate (PER) analysis of automatic speech recognition
(ASR) reveals that while all phonemes benefit from introducing ultrasound
tongue images, palatal and velar consonants benefit most.Comment: Submmited to IEEE/ACM Transactions on Audio, Speech and Language
Processing. arXiv admin note: text overlap with arXiv:2305.1493
ZhiJian: A Unifying and Rapidly Deployable Toolbox for Pre-trained Model Reuse
The rapid expansion of foundation pre-trained models and their fine-tuned
counterparts has significantly contributed to the advancement of machine
learning. Leveraging pre-trained models to extract knowledge and expedite
learning in real-world tasks, known as "Model Reuse", has become crucial in
various applications. Previous research focuses on reusing models within a
certain aspect, including reusing model weights, structures, and hypothesis
spaces. This paper introduces ZhiJian, a comprehensive and user-friendly
toolbox for model reuse, utilizing the PyTorch backend. ZhiJian presents a
novel paradigm that unifies diverse perspectives on model reuse, encompassing
target architecture construction with PTM, tuning target model with PTM, and
PTM-based inference. This empowers deep learning practitioners to explore
downstream tasks and identify the complementary advantages among different
methods. ZhiJian is readily accessible at
https://github.com/zhangyikaii/lamda-zhijian facilitating seamless utilization
of pre-trained models and streamlining the model reuse process for researchers
and developers
Incorporating Ultrasound Tongue Images for Audio-Visual Speech Enhancement through Knowledge Distillation
Audio-visual speech enhancement (AV-SE) aims to enhance degraded speech along
with extra visual information such as lip videos, and has been shown to be more
effective than audio-only speech enhancement. This paper proposes further
incorporating ultrasound tongue images to improve lip-based AV-SE systems'
performance. Knowledge distillation is employed at the training stage to
address the challenge of acquiring ultrasound tongue images during inference,
enabling an audio-lip speech enhancement student model to learn from a
pre-trained audio-lip-tongue speech enhancement teacher model. Experimental
results demonstrate significant improvements in the quality and intelligibility
of the speech enhanced by the proposed method compared to the traditional
audio-lip speech enhancement baselines. Further analysis using phone error
rates (PER) of automatic speech recognition (ASR) shows that palatal and velar
consonants benefit most from the introduction of ultrasound tongue images.Comment: To be published in InterSpeech 202
- …