38 research outputs found
Cascaded Multi-task Adaptive Learning Based on Neural Architecture Search
Cascading multiple pre-trained models is an effective way to compose an
end-to-end system. However, fine-tuning the full cascaded model is parameter
and memory inefficient and our observations reveal that only applying adapter
modules on cascaded model can not achieve considerable performance as
fine-tuning. We propose an automatic and effective adaptive learning method to
optimize end-to-end cascaded multi-task models based on Neural Architecture
Search (NAS) framework. The candidate adaptive operations on each specific
module consist of frozen, inserting an adapter and fine-tuning. We further add
a penalty item on the loss to limit the learned structure which takes the
amount of trainable parameters into account. The penalty item successfully
restrict the searched architecture and the proposed approach is able to search
similar tuning scheme with hand-craft, compressing the optimizing parameters to
8.7% corresponding to full fine-tuning on SLURP with an even better
performance
Prompt Pool based Class-Incremental Continual Learning for Dialog State Tracking
Continual learning is crucial for dialog state tracking (DST) in dialog
systems, since requirements from users for new functionalities are often
encountered. However, most of existing continual learning methods for DST
require task identities during testing, which is a severe limit in real-world
applications. In this paper, we aim to address continual learning of DST in the
class-incremental scenario (namely the task identity is unknown in testing).
Inspired by the recently emerging prompt tuning method that performs well on
dialog systems, we propose to use the prompt pool method, where we maintain a
pool of key-value paired prompts and select prompts from the pool according to
the distance between the dialog history and the prompt keys. The proposed
method can automatically identify tasks and select appropriate prompts during
testing. We conduct experiments on Schema-Guided Dialog dataset (SGD) and
another dataset collected from a real-world dialog application. Experiment
results show that the prompt pool method achieves much higher joint goal
accuracy than the baseline. After combining with a rehearsal buffer, the model
performance can be further improved
Fine-grained Recognition with Learnable Semantic Data Augmentation
Fine-grained image recognition is a longstanding computer vision challenge
that focuses on differentiating objects belonging to multiple subordinate
categories within the same meta-category. Since images belonging to the same
meta-category usually share similar visual appearances, mining discriminative
visual cues is the key to distinguishing fine-grained categories. Although
commonly used image-level data augmentation techniques have achieved great
success in generic image classification problems, they are rarely applied in
fine-grained scenarios, because their random editing-region behavior is prone
to destroy the discriminative visual cues residing in the subtle regions. In
this paper, we propose diversifying the training data at the feature-level to
alleviate the discriminative region loss problem. Specifically, we produce
diversified augmented samples by translating image features along semantically
meaningful directions. The semantic directions are estimated with a covariance
prediction network, which predicts a sample-wise covariance matrix to adapt to
the large intra-class variation inherent in fine-grained images. Furthermore,
the covariance prediction network is jointly optimized with the classification
network in a meta-learning manner to alleviate the degenerate solution problem.
Experiments on four competitive fine-grained recognition benchmarks
(CUB-200-2011, Stanford Cars, FGVC Aircrafts, NABirds) demonstrate that our
method significantly improves the generalization performance on several popular
classification networks (e.g., ResNets, DenseNets, EfficientNets, RegNets and
ViT). Combined with a recently proposed method, our semantic data augmentation
approach achieves state-of-the-art performance on the CUB-200-2011 dataset. The
source code will be released
VE-KWS: Visual Modality Enhanced End-to-End Keyword Spotting
The performance of the keyword spotting (KWS) system based on audio modality,
commonly measured in false alarms and false rejects, degrades significantly
under the far field and noisy conditions. Therefore, audio-visual keyword
spotting, which leverages complementary relationships over multiple modalities,
has recently gained much attention. However, current studies mainly focus on
combining the exclusively learned representations of different modalities,
instead of exploring the modal relationships during each respective modeling.
In this paper, we propose a novel visual modality enhanced end-to-end KWS
framework (VE-KWS), which fuses audio and visual modalities from two aspects.
The first one is utilizing the speaker location information obtained from the
lip region in videos to assist the training of multi-channel audio beamformer.
By involving the beamformer as an audio enhancement module, the acoustic
distortions, caused by the far field or noisy environments, could be
significantly suppressed. The other one is conducting cross-attention between
different modalities to capture the inter-modal relationships and help the
representation learning of each modality. Experiments on the MSIP challenge
corpus show that our proposed model achieves 2.79% false rejection rate and
2.95% false alarm rate on the Eval set, resulting in a new SOTA performance
compared with the top-ranking systems in the ICASSP2022 MISP challenge.Comment: 5 pages. Accepted at ICASSP202
Learning to Check Contract Inconsistencies
Contract consistency is important in ensuring the legal validity of the
contract. In many scenarios, a contract is written by filling the blanks in a
precompiled form. Due to carelessness, two blanks that should be filled with
the same (or different)content may be incorrectly filled with different (or
same) content. This will result in the issue of contract inconsistencies, which
may severely impair the legal validity of the contract. Traditional methods to
address this issue mainly rely on manual contract review, which is
labor-intensive and costly. In this work, we formulate a novel Contract
Inconsistency Checking (CIC) problem, and design an end-to-end framework,
called Pair-wise Blank Resolution (PBR), to solve the CIC problem with high
accuracy. Our PBR model contains a novel BlankCoder to address the challenge of
modeling meaningless blanks. BlankCoder adopts a two-stage attention mechanism
that adequately associates a meaningless blank with its relevant descriptions
while avoiding the incorporation of irrelevant context words. Experiments
conducted on real-world datasets show the promising performance of our method
with a balanced accuracy of 94.05% and an F1 score of 90.90% in the CIC
problem.Comment: Accepted by AAAI 202
Dynamic Perceiver for Efficient Visual Recognition
Early exiting has become a promising approach to improving the inference
efficiency of deep networks. By structuring models with multiple classifiers
(exits), predictions for ``easy'' samples can be generated at earlier exits,
negating the need for executing deeper layers. Current multi-exit networks
typically implement linear classifiers at intermediate layers, compelling
low-level features to encapsulate high-level semantics. This sub-optimal design
invariably undermines the performance of later exits. In this paper, we propose
Dynamic Perceiver (Dyn-Perceiver) to decouple the feature extraction procedure
and the early classification task with a novel dual-branch architecture. A
feature branch serves to extract image features, while a classification branch
processes a latent code assigned for classification tasks. Bi-directional
cross-attention layers are established to progressively fuse the information of
both branches. Early exits are placed exclusively within the classification
branch, thus eliminating the need for linear separability in low-level
features. Dyn-Perceiver constitutes a versatile and adaptable framework that
can be built upon various architectures. Experiments on image classification,
action recognition, and object detection demonstrate that our method
significantly improves the inference efficiency of different backbones,
outperforming numerous competitive approaches across a broad range of
computational budgets. Evaluation on both CPU and GPU platforms substantiate
the superior practical efficiency of Dyn-Perceiver. Code is available at
https://www.github.com/LeapLabTHU/Dynamic_Perceiver.Comment: Accepted at ICCV 202
Deep Learning for Medication Recommendation: A Systematic Survey
ABSTRACTMaking medication prescriptions in response to the patient's diagnosis is a challenging task. The number of pharmaceutical companies, their inventory of medicines, and the recommended dosage confront a doctor with the well-known problem of information and cognitive overload. To assist a medical practitioner in making informed decisions regarding a medical prescription to a patient, researchers have exploited electronic health records (EHRs) in automatically recommending medication. In recent years, medication recommendation using EHRs has been a salient research direction, which has attracted researchers to apply various deep learning (DL) models to the EHRs of patients in recommending prescriptions. Yet, in the absence of a holistic survey article, it needs a lot of effort and time to study these publications in order to understand the current state of research and identify the best-performing models along with the trends and challenges. To fill this research gap, this survey reports on state-of-the-art DL-based medication recommendation methods. It reviews the classification of DL-based medication recommendation (MR) models, compares their performance, and the unavoidable issues they face. It reports on the most common datasets and metrics used in evaluating MR models. The findings of this study have implications for researchers interested in MR models