35 research outputs found
Recurrent Attentional Reinforcement Learning for Multi-label Image Recognition
Recognizing multiple labels of images is a fundamental but challenging task
in computer vision, and remarkable progress has been attained by localizing
semantic-aware image regions and predicting their labels with deep
convolutional neural networks. The step of hypothesis regions (region
proposals) localization in these existing multi-label image recognition
pipelines, however, usually takes redundant computation cost, e.g., generating
hundreds of meaningless proposals with non-discriminative information and
extracting their features, and the spatial contextual dependency modeling among
the localized regions are often ignored or over-simplified. To resolve these
issues, this paper proposes a recurrent attention reinforcement learning
framework to iteratively discover a sequence of attentional and informative
regions that are related to different semantic objects and further predict
label scores conditioned on these regions. Besides, our method explicitly
models long-term dependencies among these attentional regions that help to
capture semantic label co-occurrence and thus facilitate multi-label
recognition. Extensive experiments and comparisons on two large-scale
benchmarks (i.e., PASCAL VOC and MS-COCO) show that our model achieves superior
performance over existing state-of-the-art methods in both performance and
efficiency as well as explicitly identifying image-level semantic labels to
specific object regions.Comment: Accepted at AAAI 201
DISC: Deep Image Saliency Computing via Progressive Representation Learning
Salient object detection increasingly receives attention as an important
component or step in several pattern recognition and image processing tasks.
Although a variety of powerful saliency models have been intensively proposed,
they usually involve heavy feature (or model) engineering based on priors (or
assumptions) about the properties of objects and backgrounds. Inspired by the
effectiveness of recently developed feature learning, we provide a novel Deep
Image Saliency Computing (DISC) framework for fine-grained image saliency
computing. In particular, we model the image saliency from both the coarse- and
fine-level observations, and utilize the deep convolutional neural network
(CNN) to learn the saliency representation in a progressive manner.
Specifically, our saliency model is built upon two stacked CNNs. The first CNN
generates a coarse-level saliency map by taking the overall image as the input,
roughly identifying saliency regions in the global context. Furthermore, we
integrate superpixel-based local context information in the first CNN to refine
the coarse-level saliency map. Guided by the coarse saliency map, the second
CNN focuses on the local context to produce fine-grained and accurate saliency
map while preserving object details. For a testing image, the two CNNs
collaboratively conduct the saliency computing in one shot. Our DISC framework
is capable of uniformly highlighting the objects-of-interest from complex
background while preserving well object details. Extensive experiments on
several standard benchmarks suggest that DISC outperforms other
state-of-the-art methods and it also generalizes well across datasets without
additional training. The executable version of DISC is available online:
http://vision.sysu.edu.cn/projects/DISC.Comment: This manuscript is the accepted version for IEEE Transactions on
Neural Networks and Learning Systems (T-NNLS), 201
Learning a Wavelet-like Auto-Encoder to Accelerate Deep Neural Networks
Accelerating deep neural networks (DNNs) has been attracting increasing
attention as it can benefit a wide range of applications, e.g., enabling mobile
systems with limited computing resources to own powerful visual recognition
ability. A practical strategy to this goal usually relies on a two-stage
process: operating on the trained DNNs (e.g., approximating the convolutional
filters with tensor decomposition) and fine-tuning the amended network, leading
to difficulty in balancing the trade-off between acceleration and maintaining
recognition performance. In this work, aiming at a general and comprehensive
way for neural network acceleration, we develop a Wavelet-like Auto-Encoder
(WAE) that decomposes the original input image into two low-resolution channels
(sub-images) and incorporate the WAE into the classification neural networks
for joint training. The two decomposed channels, in particular, are encoded to
carry the low-frequency information (e.g., image profiles) and high-frequency
(e.g., image details or noises), respectively, and enable reconstructing the
original input image through the decoding process. Then, we feed the
low-frequency channel into a standard classification network such as VGG or
ResNet and employ a very lightweight network to fuse with the high-frequency
channel to obtain the classification result. Compared to existing DNN
acceleration solutions, our framework has the following advantages: i) it is
tolerant to any existing convolutional neural networks for classification
without amending their structures; ii) the WAE provides an interpretable way to
preserve the main components of the input image for classification.Comment: Accepted at AAAI 201
Contrastive Transformer Learning with Proximity Data Generation for Text-Based Person Search
Given a descriptive text query, text-based person search (TBPS) aims to
retrieve the best-matched target person from an image gallery. Such a
cross-modal retrieval task is quite challenging due to significant modality
gap, fine-grained differences and insufficiency of annotated data. To better
align the two modalities, most existing works focus on introducing
sophisticated network structures and auxiliary tasks, which are complex and
hard to implement. In this paper, we propose a simple yet effective dual
Transformer model for text-based person search. By exploiting a hardness-aware
contrastive learning strategy, our model achieves state-of-the-art performance
without any special design for local feature alignment or side information.
Moreover, we propose a proximity data generation (PDG) module to automatically
produce more diverse data for cross-modal training. The PDG module first
introduces an automatic generation algorithm based on a text-to-image diffusion
model, which generates new text-image pair samples in the proximity space of
original ones. Then it combines approximate text generation and feature-level
mixup during training to further strengthen the data diversity. The PDG module
can largely guarantee the reasonability of the generated samples that are
directly used for training without any human inspection for noise rejection. It
improves the performance of our model significantly, providing a feasible
solution to the data insufficiency problem faced by such fine-grained
visual-linguistic tasks. Extensive experiments on two popular datasets of the
TBPS task (i.e., CUHK-PEDES and ICFG-PEDES) show that the proposed approach
outperforms state-of-the-art approaches evidently, e.g., improving by 3.88%,
4.02%, 2.92% in terms of Top1, Top5, Top10 on CUHK-PEDES. The codes will be
available at https://github.com/HCPLab-SYSU/PersonSearch-CTLGComment: Accepted by IEEE T-CSV
Knowledge Graph Transfer Network for Few-Shot Recognition
Few-shot learning aims to learn novel categories from very few samples given
some base categories with sufficient training samples. The main challenge of
this task is the novel categories are prone to dominated by color, texture,
shape of the object or background context (namely specificity), which are
distinct for the given few training samples but not common for the
corresponding categories (see Figure 1). Fortunately, we find that transferring
information of the correlated based categories can help learn the novel
concepts and thus avoid the novel concept being dominated by the specificity.
Besides, incorporating semantic correlations among different categories can
effectively regularize this information transfer. In this work, we represent
the semantic correlations in the form of structured knowledge graph and
integrate this graph into deep neural networks to promote few-shot learning by
a novel Knowledge Graph Transfer Network (KGTN). Specifically, by initializing
each node with the classifier weight of the corresponding category, a
propagation mechanism is learned to adaptively propagate node message through
the graph to explore node interaction and transfer classifier information of
the base categories to those of the novel ones. Extensive experiments on the
ImageNet dataset show significant performance improvement compared with current
leading competitors. Furthermore, we construct an ImageNet-6K dataset that
covers larger scale categories, i.e, 6,000 categories, and experiments on this
dataset further demonstrate the effectiveness of our proposed model.Comment: accepted by AAAI 2020 as oral pape