171 research outputs found
Efficient Multimodal Fusion via Interactive Prompting
Large-scale pre-training has brought unimodal fields such as computer vision
and natural language processing to a new era. Following this trend, the size of
multi-modal learning models constantly increases, leading to an urgent need to
reduce the massive computational cost of finetuning these models for downstream
tasks. In this paper, we propose an efficient and flexible multimodal fusion
method, namely PMF, tailored for fusing unimodally pre-trained transformers.
Specifically, we first present a modular multimodal fusion framework that
exhibits high flexibility and facilitates mutual interactions among different
modalities. In addition, we disentangle vanilla prompts into three types in
order to learn different optimizing objectives for multimodal learning. It is
also worth noting that we propose to add prompt vectors only on the deep layers
of the unimodal transformers, thus significantly reducing the training memory
usage. Experiment results show that our proposed method achieves comparable
performance to several other multimodal finetuning methods with less than 3%
trainable parameters and up to 66% saving of training memory usage.Comment: Camera-ready version for CVPR202
Exploiting Prompt Caption for Video Grounding
Video grounding aims to locate a moment of interest matching the given query
sentence from an untrimmed video. Previous works ignore the \emph{sparsity
dilemma} in video annotations, which fails to provide the context information
between potential events and query sentences in the dataset. In this paper, we
contend that exploiting easily available captions which describe general
actions \ie, prompt captions (PC) defined in our paper, will significantly
boost the performance. To this end, we propose a Prompt Caption Network (PCNet)
for video grounding. Specifically, we first introduce dense video captioning to
generate dense captions and then obtain prompt captions by Non-Prompt Caption
Suppression (NPCS). To capture the potential information in prompt captions, we
propose Caption Guided Attention (CGA) project the semantic relations between
prompt captions and query sentences into temporal space and fuse them into
visual representations. Considering the gap between prompt captions and ground
truth, we propose Asymmetric Cross-modal Contrastive Learning (ACCL) for
constructing more negative pairs to maximize cross-modal mutual information.
Without bells and whistles, extensive experiments on three public datasets
(\ie, ActivityNet Captions, TACoS and ActivityNet-CG) demonstrate that our
method significantly outperforms state-of-the-art methods
Recognizing Conditional Causal Relationships about Emotions and Their Corresponding Conditions
The study of causal relationships between emotions and causes in texts has
recently received much attention. Most works focus on extracting causally
related clauses from documents. However, none of these works has considered
that the causal relationships among the extracted emotion and cause clauses can
only be valid under some specific context clauses. To highlight the context in
such special causal relationships, we propose a new task to determine whether
or not an input pair of emotion and cause has a valid causal relationship under
different contexts and extract the specific context clauses that participate in
the causal relationship. Since the task is new for which no existing dataset is
available, we conduct manual annotation on a benchmark dataset to obtain the
labels for our tasks and the annotations of each context clause's type that can
also be used in some other applications. We adopt negative sampling to
construct the final dataset to balance the number of documents with and without
causal relationships. Based on the constructed dataset, we propose an
end-to-end multi-task framework, where we design two novel and general modules
to handle the two goals of our task. Specifically, we propose a context masking
module to extract the context clauses participating in the causal
relationships. We propose a prediction aggregation module to fine-tune the
prediction results according to whether the input emotion and causes depend on
specific context clauses. Results of extensive comparative experiments and
ablation studies demonstrate the effectiveness and generality of our proposed
framework
Unify, Align and Refine: Multi-Level Semantic Alignment for Radiology Report Generation
Automatic radiology report generation has attracted enormous research
interest due to its practical value in reducing the workload of radiologists.
However, simultaneously establishing global correspondences between the image
(e.g., Chest X-ray) and its related report and local alignments between image
patches and keywords remains challenging. To this end, we propose an Unify,
Align and then Refine (UAR) approach to learn multi-level cross-modal
alignments and introduce three novel modules: Latent Space Unifier (LSU),
Cross-modal Representation Aligner (CRA) and Text-to-Image Refiner (TIR).
Specifically, LSU unifies multimodal data into discrete tokens, making it
flexible to learn common knowledge among modalities with a shared network. The
modality-agnostic CRA learns discriminative features via a set of orthonormal
basis and a dual-gate mechanism first and then globally aligns visual and
textual representations under a triplet contrastive loss. TIR boosts
token-level local alignment via calibrating text-to-image attention with a
learnable mask. Additionally, we design a two-stage training procedure to make
UAR gradually grasp cross-modal alignments at different levels, which imitates
radiologists' workflow: writing sentence by sentence first and then checking
word by word. Extensive experiments and analyses on IU-Xray and MIMIC-CXR
benchmark datasets demonstrate the superiority of our UAR against varied
state-of-the-art methods.Comment: 8 pages,6 figures,4 table
KERM: Knowledge Enhanced Reasoning for Vision-and-Language Navigation
Vision-and-language navigation (VLN) is the task to enable an embodied agent
to navigate to a remote location following the natural language instruction in
real scenes. Most of the previous approaches utilize the entire features or
object-centric features to represent navigable candidates. However, these
representations are not efficient enough for an agent to perform actions to
arrive the target location. As knowledge provides crucial information which is
complementary to visible content, in this paper, we propose a Knowledge
Enhanced Reasoning Model (KERM) to leverage knowledge to improve agent
navigation ability. Specifically, we first retrieve facts (i.e., knowledge
described by language descriptions) for the navigation views based on local
regions from the constructed knowledge base. The retrieved facts range from
properties of a single object (e.g., color, shape) to relationships between
objects (e.g., action, spatial position), providing crucial information for
VLN. We further present the KERM which contains the purification, fact-aware
interaction, and instruction-guided aggregation modules to integrate visual,
history, instruction, and fact features. The proposed KERM can automatically
select and gather crucial and relevant cues, obtaining more accurate action
prediction. Experimental results on the REVERIE, R2R, and SOON datasets
demonstrate the effectiveness of the proposed method.Comment: Accepted by CVPR 2023. The code is available at
https://github.com/XiangyangLi20/KER
G2L: Semantically Aligned and Uniform Video Grounding via Geodesic and Game Theory
The recent video grounding works attempt to introduce vanilla contrastive
learning into video grounding. However, we claim that this naive solution is
suboptimal. Contrastive learning requires two key properties: (1)
\emph{alignment} of features of similar samples, and (2) \emph{uniformity} of
the induced distribution of the normalized features on the hypersphere. Due to
two annoying issues in video grounding: (1) the co-existence of some visual
entities in both ground truth and other moments, \ie semantic overlapping; (2)
only a few moments in the video are annotated, \ie sparse annotation dilemma,
vanilla contrastive learning is unable to model the correlations between
temporally distant moments and learned inconsistent video representations. Both
characteristics lead to vanilla contrastive learning being unsuitable for video
grounding. In this paper, we introduce Geodesic and Game Localization (G2L), a
semantically aligned and uniform video grounding framework via geodesic and
game theory. We quantify the correlations among moments leveraging the geodesic
distance that guides the model to learn the correct cross-modal
representations. Furthermore, from the novel perspective of game theory, we
propose semantic Shapley interaction based on geodesic distance sampling to
learn fine-grained semantic alignment in similar moments. Experiments on three
benchmarks demonstrate the effectiveness of our method.Comment: ICCV202
Differential Expression Levels of Genes Related to Myogenesis During Embryogenesis of Quail and Chicken
The present study was designed to investigate the expression dynamics of genes during myogenesis in quail and chicken. Real-time PCR was used to detect mRNA expressions of MyoD, MyoG, MLP and MSTN in breast muscle of quail and chicken embryos during the period of embryonic days E7-17. Results showed that expression profiles of each gene displayed similar trend in the experiment period between quail and chicken, however, the expression concentration between the two species differed at the same time detected. MyoD mRNA expression in quail was significantly lower in the early phase of the experiment period (E7-9) (P<0.01 on E7; P<0.05 on both E8 and E9). For MyoG and MLP, the mRNA expressions were both lower in quail than that in chicken during the experiment period. Additionally, the embryonic day when quail reached its peak expression was earlier than that in chicken (MyoG: quail E12 vs. chicken E13; MLP: quail E14 vs. chicken E15), and the peak expression for both in quail was significantly lower than that in chicken (P<0.01 for both). For MSTN, expression in quail was significantly higher in quail than that in chicken at each time detected (P<0.01). It is concluded that differential expression of these genes might or at least partially contributed to the different development of muscle development in quail and chicken
MixBCT: Towards Self-Adapting Backward-Compatible Training
The exponential growth of data, alongside advancements in model structures
and loss functions, has necessitated the enhancement of image retrieval systems
through the utilization of new models with superior feature embeddings.
However, the expensive process of updating the old retrieval database by
replacing embeddings poses a challenge. As a solution, backward-compatible
training can be employed to avoid the necessity of updating old retrieval
datasets. While previous methods achieved backward compatibility by aligning
prototypes of the old model, they often overlooked the distribution of the old
features, thus limiting their effectiveness when the old model's low quality
leads to a weakly discriminative feature distribution. On the other hand,
instance-based methods like L2 regression take into account the distribution of
old features but impose strong constraints on the performance of the new model
itself. In this paper, we propose MixBCT, a simple yet highly effective
backward-compatible training method that serves as a unified framework for old
models of varying qualities. Specifically, we summarize four constraints that
are essential for ensuring backward compatibility in an ideal scenario, and we
construct a single loss function to facilitate backward-compatible training.
Our approach adaptively adjusts the constraint domain for new features based on
the distribution of the old embeddings. We conducted extensive experiments on
the large-scale face recognition datasets MS1Mv3 and IJB-C to verify the
effectiveness of our method. The experimental results clearly demonstrate its
superiority over previous methods. Code is available at
https://github.com/yuleung/MixBC
- …