118 research outputs found
Contrastive Transformer Learning with Proximity Data Generation for Text-Based Person Search
Given a descriptive text query, text-based person search (TBPS) aims to
retrieve the best-matched target person from an image gallery. Such a
cross-modal retrieval task is quite challenging due to significant modality
gap, fine-grained differences and insufficiency of annotated data. To better
align the two modalities, most existing works focus on introducing
sophisticated network structures and auxiliary tasks, which are complex and
hard to implement. In this paper, we propose a simple yet effective dual
Transformer model for text-based person search. By exploiting a hardness-aware
contrastive learning strategy, our model achieves state-of-the-art performance
without any special design for local feature alignment or side information.
Moreover, we propose a proximity data generation (PDG) module to automatically
produce more diverse data for cross-modal training. The PDG module first
introduces an automatic generation algorithm based on a text-to-image diffusion
model, which generates new text-image pair samples in the proximity space of
original ones. Then it combines approximate text generation and feature-level
mixup during training to further strengthen the data diversity. The PDG module
can largely guarantee the reasonability of the generated samples that are
directly used for training without any human inspection for noise rejection. It
improves the performance of our model significantly, providing a feasible
solution to the data insufficiency problem faced by such fine-grained
visual-linguistic tasks. Extensive experiments on two popular datasets of the
TBPS task (i.e., CUHK-PEDES and ICFG-PEDES) show that the proposed approach
outperforms state-of-the-art approaches evidently, e.g., improving by 3.88%,
4.02%, 2.92% in terms of Top1, Top5, Top10 on CUHK-PEDES. The codes will be
available at https://github.com/HCPLab-SYSU/PersonSearch-CTLGComment: Accepted by IEEE T-CSV
Spatial-Temporal Knowledge-Embedded Transformer for Video Scene Graph Generation
Video scene graph generation (VidSGG) aims to identify objects in visual
scenes and infer their relationships for a given video. It requires not only a
comprehensive understanding of each object scattered on the whole scene but
also a deep dive into their temporal motions and interactions. Inherently,
object pairs and their relationships enjoy spatial co-occurrence correlations
within each image and temporal consistency/transition correlations across
different images, which can serve as prior knowledge to facilitate VidSGG model
learning and inference. In this work, we propose a spatial-temporal
knowledge-embedded transformer (STKET) that incorporates the prior
spatial-temporal knowledge into the multi-head cross-attention mechanism to
learn more representative relationship representations. Specifically, we first
learn spatial co-occurrence and temporal transition correlations in a
statistical manner. Then, we design spatial and temporal knowledge-embedded
layers that introduce the multi-head cross-attention mechanism to fully explore
the interaction between visual representation and the knowledge to generate
spatial- and temporal-embedded representations, respectively. Finally, we
aggregate these representations for each subject-object pair to predict the
final semantic labels and their relationships. Extensive experiments show that
STKET outperforms current competing algorithms by a large margin, e.g.,
improving the mR@50 by 8.1%, 4.7%, and 2.1% on different settings over current
algorithms.Comment: Technical Repor
Knowledge Graph Transfer Network for Few-Shot Recognition
Few-shot learning aims to learn novel categories from very few samples given
some base categories with sufficient training samples. The main challenge of
this task is the novel categories are prone to dominated by color, texture,
shape of the object or background context (namely specificity), which are
distinct for the given few training samples but not common for the
corresponding categories (see Figure 1). Fortunately, we find that transferring
information of the correlated based categories can help learn the novel
concepts and thus avoid the novel concept being dominated by the specificity.
Besides, incorporating semantic correlations among different categories can
effectively regularize this information transfer. In this work, we represent
the semantic correlations in the form of structured knowledge graph and
integrate this graph into deep neural networks to promote few-shot learning by
a novel Knowledge Graph Transfer Network (KGTN). Specifically, by initializing
each node with the classifier weight of the corresponding category, a
propagation mechanism is learned to adaptively propagate node message through
the graph to explore node interaction and transfer classifier information of
the base categories to those of the novel ones. Extensive experiments on the
ImageNet dataset show significant performance improvement compared with current
leading competitors. Furthermore, we construct an ImageNet-6K dataset that
covers larger scale categories, i.e, 6,000 categories, and experiments on this
dataset further demonstrate the effectiveness of our proposed model.Comment: accepted by AAAI 2020 as oral pape
Semantic Representation and Dependency Learning for Multi-Label Image Recognition
Recently many multi-label image recognition (MLR) works have made significant
progress by introducing pre-trained object detection models to generate lots of
proposals or utilizing statistical label co-occurrence enhance the correlation
among different categories. However, these works have some limitations: (1) the
effectiveness of the network significantly depends on pre-trained object
detection models that bring expensive and unaffordable computation; (2) the
network performance degrades when there exist occasional co-occurrence objects
in images, especially for the rare categories. To address these problems, we
propose a novel and effective semantic representation and dependency learning
(SRDL) framework to learn category-specific semantic representation for each
category and capture semantic dependency among all categories. Specifically, we
design a category-specific attentional regions (CAR) module to generate
channel/spatial-wise attention matrices to guide model to focus on
semantic-aware regions. We also design an object erasing (OE) module to
implicitly learn semantic dependency among categories by erasing semantic-aware
regions to regularize the network training. Extensive experiments and
comparisons on two popular MLR benchmark datasets (i.e., MS-COCO and Pascal VOC
2007) demonstrate the effectiveness of the proposed framework over current
state-of-the-art algorithms.Comment: 25 pages, 7 figure
Dual-Perspective Semantic-Aware Representation Blending for Multi-Label Image Recognition with Partial Labels
Despite achieving impressive progress, current multi-label image recognition
(MLR) algorithms heavily depend on large-scale datasets with complete labels,
making collecting large-scale datasets extremely time-consuming and
labor-intensive. Training the multi-label image recognition models with partial
labels (MLR-PL) is an alternative way, in which merely some labels are known
while others are unknown for each image. However, current MLP-PL algorithms
rely on pre-trained image similarity models or iteratively updating the image
classification models to generate pseudo labels for the unknown labels. Thus,
they depend on a certain amount of annotations and inevitably suffer from
obvious performance drops, especially when the known label proportion is low.
To address this dilemma, we propose a dual-perspective semantic-aware
representation blending (DSRB) that blends multi-granularity category-specific
semantic representation across different images, from instance and prototype
perspective respectively, to transfer information of known labels to complement
unknown labels. Specifically, an instance-perspective representation blending
(IPRB) module is designed to blend the representations of the known labels in
an image with the representations of the corresponding unknown labels in
another image to complement these unknown labels. Meanwhile, a
prototype-perspective representation blending (PPRB) module is introduced to
learn more stable representation prototypes for each category and blends the
representation of unknown labels with the prototypes of corresponding labels,
in a location-sensitive manner, to complement these unknown labels. Extensive
experiments on the MS-COCO, Visual Genome, and Pascal VOC 2007 datasets show
that the proposed DSRB consistently outperforms current state-of-the-art
algorithms on all known label proportion settings.Comment: Technical Report. arXiv admin note: text overlap with
arXiv:2203.0217
- …