35 research outputs found
DISC: Deep Image Saliency Computing via Progressive Representation Learning
Salient object detection increasingly receives attention as an important
component or step in several pattern recognition and image processing tasks.
Although a variety of powerful saliency models have been intensively proposed,
they usually involve heavy feature (or model) engineering based on priors (or
assumptions) about the properties of objects and backgrounds. Inspired by the
effectiveness of recently developed feature learning, we provide a novel Deep
Image Saliency Computing (DISC) framework for fine-grained image saliency
computing. In particular, we model the image saliency from both the coarse- and
fine-level observations, and utilize the deep convolutional neural network
(CNN) to learn the saliency representation in a progressive manner.
Specifically, our saliency model is built upon two stacked CNNs. The first CNN
generates a coarse-level saliency map by taking the overall image as the input,
roughly identifying saliency regions in the global context. Furthermore, we
integrate superpixel-based local context information in the first CNN to refine
the coarse-level saliency map. Guided by the coarse saliency map, the second
CNN focuses on the local context to produce fine-grained and accurate saliency
map while preserving object details. For a testing image, the two CNNs
collaboratively conduct the saliency computing in one shot. Our DISC framework
is capable of uniformly highlighting the objects-of-interest from complex
background while preserving well object details. Extensive experiments on
several standard benchmarks suggest that DISC outperforms other
state-of-the-art methods and it also generalizes well across datasets without
additional training. The executable version of DISC is available online:
http://vision.sysu.edu.cn/projects/DISC.Comment: This manuscript is the accepted version for IEEE Transactions on
Neural Networks and Learning Systems (T-NNLS), 201
Knowledge Graph Transfer Network for Few-Shot Recognition
Few-shot learning aims to learn novel categories from very few samples given
some base categories with sufficient training samples. The main challenge of
this task is the novel categories are prone to dominated by color, texture,
shape of the object or background context (namely specificity), which are
distinct for the given few training samples but not common for the
corresponding categories (see Figure 1). Fortunately, we find that transferring
information of the correlated based categories can help learn the novel
concepts and thus avoid the novel concept being dominated by the specificity.
Besides, incorporating semantic correlations among different categories can
effectively regularize this information transfer. In this work, we represent
the semantic correlations in the form of structured knowledge graph and
integrate this graph into deep neural networks to promote few-shot learning by
a novel Knowledge Graph Transfer Network (KGTN). Specifically, by initializing
each node with the classifier weight of the corresponding category, a
propagation mechanism is learned to adaptively propagate node message through
the graph to explore node interaction and transfer classifier information of
the base categories to those of the novel ones. Extensive experiments on the
ImageNet dataset show significant performance improvement compared with current
leading competitors. Furthermore, we construct an ImageNet-6K dataset that
covers larger scale categories, i.e, 6,000 categories, and experiments on this
dataset further demonstrate the effectiveness of our proposed model.Comment: accepted by AAAI 2020 as oral pape
Contrastive Transformer Learning with Proximity Data Generation for Text-Based Person Search
Given a descriptive text query, text-based person search (TBPS) aims to
retrieve the best-matched target person from an image gallery. Such a
cross-modal retrieval task is quite challenging due to significant modality
gap, fine-grained differences and insufficiency of annotated data. To better
align the two modalities, most existing works focus on introducing
sophisticated network structures and auxiliary tasks, which are complex and
hard to implement. In this paper, we propose a simple yet effective dual
Transformer model for text-based person search. By exploiting a hardness-aware
contrastive learning strategy, our model achieves state-of-the-art performance
without any special design for local feature alignment or side information.
Moreover, we propose a proximity data generation (PDG) module to automatically
produce more diverse data for cross-modal training. The PDG module first
introduces an automatic generation algorithm based on a text-to-image diffusion
model, which generates new text-image pair samples in the proximity space of
original ones. Then it combines approximate text generation and feature-level
mixup during training to further strengthen the data diversity. The PDG module
can largely guarantee the reasonability of the generated samples that are
directly used for training without any human inspection for noise rejection. It
improves the performance of our model significantly, providing a feasible
solution to the data insufficiency problem faced by such fine-grained
visual-linguistic tasks. Extensive experiments on two popular datasets of the
TBPS task (i.e., CUHK-PEDES and ICFG-PEDES) show that the proposed approach
outperforms state-of-the-art approaches evidently, e.g., improving by 3.88%,
4.02%, 2.92% in terms of Top1, Top5, Top10 on CUHK-PEDES. The codes will be
available at https://github.com/HCPLab-SYSU/PersonSearch-CTLGComment: Accepted by IEEE T-CSV
RestoreFormer++: Towards Real-World Blind Face Restoration from Undegraded Key-Value Pairs
Blind face restoration aims at recovering high-quality face images from those
with unknown degradations. Current algorithms mainly introduce priors to
complement high-quality details and achieve impressive progress. However, most
of these algorithms ignore abundant contextual information in the face and its
interplay with the priors, leading to sub-optimal performance. Moreover, they
pay less attention to the gap between the synthetic and real-world scenarios,
limiting the robustness and generalization to real-world applications. In this
work, we propose RestoreFormer++, which on the one hand introduces
fully-spatial attention mechanisms to model the contextual information and the
interplay with the priors, and on the other hand, explores an extending
degrading model to help generate more realistic degraded face images to
alleviate the synthetic-to-real-world gap. Compared with current algorithms,
RestoreFormer++ has several crucial benefits. First, instead of using a
multi-head self-attention mechanism like the traditional visual transformer, we
introduce multi-head cross-attention over multi-scale features to fully explore
spatial interactions between corrupted information and high-quality priors. In
this way, it can facilitate RestoreFormer++ to restore face images with higher
realness and fidelity. Second, in contrast to the recognition-oriented
dictionary, we learn a reconstruction-oriented dictionary as priors, which
contains more diverse high-quality facial details and better accords with the
restoration target. Third, we introduce an extending degrading model that
contains more realistic degraded scenarios for training data synthesizing, and
thus helps to enhance the robustness and generalization of our RestoreFormer++
model. Extensive experiments show that RestoreFormer++ outperforms
state-of-the-art algorithms on both synthetic and real-world datasets.Comment: Submitted to TPAMI. An extension of RestoreForme