151 research outputs found
Knowledge Transfer from Pre-trained Language Models to Cif-based Speech Recognizers via Hierarchical Distillation
Large-scale pre-trained language models (PLMs) have shown great potential in
natural language processing tasks. Leveraging the capabilities of PLMs to
enhance automatic speech recognition (ASR) systems has also emerged as a
promising research direction. However, previous works may be limited by the
inflexible structures of PLMs and the insufficient utilization of PLMs. To
alleviate these problems, we propose the hierarchical knowledge distillation
(HKD) on the continuous integrate-and-fire (CIF) based ASR models. To transfer
knowledge from PLMs to the ASR models, HKD employs cross-modal knowledge
distillation with contrastive loss at the acoustic level and knowledge
distillation with regression loss at the linguistic level. Compared with the
original CIF-based model, our method achieves 15% and 9% relative error rate
reduction on the AISHELL-1 and LibriSpeech datasets, respectively.Comment: Accepted by INTERSPEECH 202
DMRM: A Dual-channel Multi-hop Reasoning Model for Visual Dialog
Visual Dialog is a vision-language task that requires an AI agent to engage
in a conversation with humans grounded in an image. It remains a challenging
task since it requires the agent to fully understand a given question before
making an appropriate response not only from the textual dialog history, but
also from the visually-grounded information. While previous models typically
leverage single-hop reasoning or single-channel reasoning to deal with this
complex multimodal reasoning task, which is intuitively insufficient. In this
paper, we thus propose a novel and more powerful Dual-channel Multi-hop
Reasoning Model for Visual Dialog, named DMRM. DMRM synchronously captures
information from the dialog history and the image to enrich the semantic
representation of the question by exploiting dual-channel reasoning.
Specifically, DMRM maintains a dual channel to obtain the question- and
history-aware image features and the question- and image-aware dialog history
features by a mulit-hop reasoning process in each channel. Additionally, we
also design an effective multimodal attention to further enhance the decoder to
generate more accurate responses. Experimental results on the VisDial v0.9 and
v1.0 datasets demonstrate that the proposed model is effective and outperforms
compared models by a significant margin.Comment: Accepted at AAAI 202
Jointly Modeling Heterogeneous Student Behaviors and Interactions Among Multiple Prediction Tasks
Prediction tasks about students have practical significance for both student
and college. Making multiple predictions about students is an important part of
a smart campus. For instance, predicting whether a student will fail to
graduate can alert the student affairs office to take predictive measures to
help the student improve his/her academic performance. With the development of
information technology in colleges, we can collect digital footprints which
encode heterogeneous behaviors continuously. In this paper, we focus on
modeling heterogeneous behaviors and making multiple predictions together,
since some prediction tasks are related and learning the model for a specific
task may have the data sparsity problem. To this end, we propose a variant of
LSTM and a soft-attention mechanism. The proposed LSTM is able to learn the
student profile-aware representation from heterogeneous behavior sequences. The
proposed soft-attention mechanism can dynamically learn different importance
degrees of different days for every student. In this way, heterogeneous
behaviors can be well modeled. In order to model interactions among multiple
prediction tasks, we propose a co-attention mechanism based unit. With the help
of the stacked units, we can explicitly control the knowledge transfer among
multiple tasks. We design three motivating behavior prediction tasks based on a
real-world dataset collected from a college. Qualitative and quantitative
experiments on the three prediction tasks have demonstrated the effectiveness
of our model
VILAS: Exploring the Effects of Vision and Language Context in Automatic Speech Recognition
Enhancing automatic speech recognition (ASR) performance by leveraging
additional multimodal information has shown promising results in previous
studies. However, most of these works have primarily focused on utilizing
visual cues derived from human lip motions. In fact, context-dependent visual
and linguistic cues can also benefit in many scenarios. In this paper, we first
propose ViLaS (Vision and Language into Automatic Speech Recognition), a novel
multimodal ASR model based on the continuous integrate-and-fire (CIF)
mechanism, which can integrate visual and textual context simultaneously or
separately, to facilitate speech recognition. Next, we introduce an effective
training strategy that improves performance in modal-incomplete test scenarios.
Then, to explore the effects of integrating vision and language, we create
VSDial, a multimodal ASR dataset with multimodal context cues in both Chinese
and English versions. Finally, empirical results are reported on the public
Flickr8K and self-constructed VSDial datasets. We explore various cross-modal
fusion schemes, analyze fine-grained crossmodal alignment on VSDial, and
provide insights into the effects of integrating multimodal information on
speech recognition.Comment: Accepted to ICASSP 202
- …