4,886 research outputs found
Diversified Visual Attention Networks for Fine-Grained Object Classification
Fine-grained object classification is a challenging task due to the subtle
inter-class difference and large intra-class variation. Recently, visual
attention models have been applied to automatically localize the discriminative
regions of an image for better capturing critical difference and demonstrated
promising performance. However, without consideration of the diversity in
attention process, most of existing attention models perform poorly in
classifying fine-grained objects. In this paper, we propose a diversified
visual attention network (DVAN) to address the problems of fine-grained object
classification, which substan- tially relieves the dependency on
strongly-supervised information for learning to localize discriminative regions
compared with attentionless models. More importantly, DVAN explicitly pursues
the diversity of attention and is able to gather discriminative information to
the maximal extent. Multiple attention canvases are generated to extract
convolutional features for attention. An LSTM recurrent unit is employed to
learn the attentiveness and discrimination of attention canvases. The proposed
DVAN has the ability to attend the object from coarse to fine granularity, and
a dynamic internal representation for classification is built up by
incrementally combining the information from different locations and scales of
the image. Extensive experiments con- ducted on CUB-2011, Stanford Dogs and
Stanford Cars datasets have demonstrated that the proposed diversified visual
attention networks achieve competitive performance compared to the state-
of-the-art approaches, without using any prior knowledge, user interaction or
external resource in training or testing
Object-Part Attention Model for Fine-grained Image Classification
Fine-grained image classification is to recognize hundreds of subcategories
belonging to the same basic-level category, such as 200 subcategories belonging
to the bird, which is highly challenging due to large variance in the same
subcategory and small variance among different subcategories. Existing methods
generally first locate the objects or parts and then discriminate which
subcategory the image belongs to. However, they mainly have two limitations:
(1) Relying on object or part annotations which are heavily labor consuming.
(2) Ignoring the spatial relationships between the object and its parts as well
as among these parts, both of which are significantly helpful for finding
discriminative parts. Therefore, this paper proposes the object-part attention
model (OPAM) for weakly supervised fine-grained image classification, and the
main novelties are: (1) Object-part attention model integrates two level
attentions: object-level attention localizes objects of images, and part-level
attention selects discriminative parts of object. Both are jointly employed to
learn multi-view and multi-scale features to enhance their mutual promotions.
(2) Object-part spatial constraint model combines two spatial constraints:
object spatial constraint ensures selected parts highly representative, and
part spatial constraint eliminates redundancy and enhances discrimination of
selected parts. Both are jointly employed to exploit the subtle and local
differences for distinguishing the subcategories. Importantly, neither object
nor part annotations are used in our proposed approach, which avoids the heavy
labor consumption of labeling. Comparing with more than 10 state-of-the-art
methods on 4 widely-used datasets, our OPAM approach achieves the best
performance.Comment: 14 pages, submitted to IEEE Transactions on Image Processin
Integrating Scene Text and Visual Appearance for Fine-Grained Image Classification
Text in natural images contains rich semantics that are often highly relevant
to objects or scene. In this paper, we focus on the problem of fully exploiting
scene text for visual understanding. The main idea is combining word
representations and deep visual features into a globally trainable deep
convolutional neural network. First, the recognized words are obtained by a
scene text reading system. Then, we combine the word embedding of the
recognized words and the deep visual features into a single representation,
which is optimized by a convolutional neural network for fine-grained image
classification. In our framework, the attention mechanism is adopted to reveal
the relevance between each recognized word and the given image, which further
enhances the recognition performance. We have performed experiments on two
datasets: Con-Text dataset and Drink Bottle dataset, that are proposed for
fine-grained classification of business places and drink bottles, respectively.
The experimental results consistently demonstrate that the proposed method
combining textual and visual cues significantly outperforms classification with
only visual representations. Moreover, we have shown that the learned
representation improves the retrieval performance on the drink bottle images by
a large margin, making it potentially useful in product search
Fine-grained pose prediction, normalization, and recognition
Pose variation and subtle differences in appearance are key challenges to
fine-grained classification. While deep networks have markedly improved general
recognition, many approaches to fine-grained recognition rely on anchoring
networks to parts for better accuracy. Identifying parts to find correspondence
discounts pose variation so that features can be tuned to appearance. To this
end previous methods have examined how to find parts and extract
pose-normalized features. These methods have generally separated fine-grained
recognition into stages which first localize parts using hand-engineered and
coarsely-localized proposal features, and then separately learn deep
descriptors centered on inferred part positions. We unify these steps in an
end-to-end trainable network supervised by keypoint locations and class labels
that localizes parts by a fully convolutional network to focus the learning of
feature representations for the fine-grained classification task. Experiments
on the popular CUB200 dataset show that our method is state-of-the-art and
suggest a continuing role for strong supervision
Cross-Modal Attentional Context Learning for RGB-D Object Detection
Recognizing objects from simultaneously sensed photometric (RGB) and depth
channels is a fundamental yet practical problem in many machine vision
applications such as robot grasping and autonomous driving. In this paper, we
address this problem by developing a Cross-Modal Attentional Context (CMAC)
learning framework, which enables the full exploitation of the context
information from both RGB and depth data. Compared to existing RGB-D object
detection frameworks, our approach has several appealing properties. First, it
consists of an attention-based global context model for exploiting adaptive
contextual information and incorporating this information into a region-based
CNN (e.g., Fast RCNN) framework to achieve improved object detection
performance. Second, our CMAC framework further contains a fine-grained object
part attention module to harness multiple discriminative object parts inside
each possible object region for superior local feature representation. While
greatly improving the accuracy of RGB-D object detection, the effective
cross-modal information fusion as well as attentional context modeling in our
proposed model provide an interpretable visualization scheme. Experimental
results demonstrate that the proposed method significantly improves upon the
state of the art on all public benchmarks.Comment: Accept as a regular paper to IEEE Transactions on Image Processin
Fine-grained Visual-textual Representation Learning
Fine-grained visual categorization is to recognize hundreds of subcategories
belonging to the same basic-level category, which is a highly challenging task
due to the quite subtle and local visual distinctions among similar
subcategories. Most existing methods generally learn part detectors to discover
discriminative regions for better categorization performance. However, not all
parts are beneficial and indispensable for visual categorization, and the
setting of part detector number heavily relies on prior knowledge as well as
experimental validation. As is known to all, when we describe the object of an
image via textual descriptions, we mainly focus on the pivotal characteristics,
and rarely pay attention to common characteristics as well as the background
areas. This is an involuntary transfer from human visual attention to textual
attention, which leads to the fact that textual attention tells us how many and
which parts are discriminative and significant to categorization. So textual
attention could help us to discover visual attention in image. Inspired by
this, we propose a fine-grained visual-textual representation learning (VTRL)
approach, and its main contributions are: (1) Fine-grained visual-textual
pattern mining devotes to discovering discriminative visual-textual pairwise
information for boosting categorization performance through jointly modeling
vision and text with generative adversarial networks (GANs), which
automatically and adaptively discovers discriminative parts. (2) Visual-textual
representation learning jointly combines visual and textual information, which
preserves the intra-modality and inter-modality information to generate
complementary fine-grained representation, as well as further improves
categorization performance.Comment: 12 pages, accepted by IEEE Transactions on Circuits and Systems for
Video Technology (TCSVT
Dual Skipping Networks
Inspired by the recent neuroscience studies on the left-right asymmetry of
the human brain in processing low and high spatial frequency information, this
paper introduces a dual skipping network which carries out coarse-to-fine
object categorization. Such a network has two branches to simultaneously deal
with both coarse and fine-grained classification tasks. Specifically, we
propose a layer-skipping mechanism that learns a gating network to predict
which layers to skip in the testing stage. This layer-skipping mechanism endows
the network with good flexibility and capability in practice. Evaluations are
conducted on several widely used coarse-to-fine object categorization
benchmarks, and promising results are achieved by our proposed network model.Comment: CVPR 2018 (poster); fix typ
Object Discovery From a Single Unlabeled Image by Mining Frequent Itemset With Multi-scale Features
TThe goal of our work is to discover dominant objects in a very general
setting where only a single unlabeled image is given. This is far more
challenge than typical co-localization or weakly-supervised localization tasks.
To tackle this problem, we propose a simple but effective pattern mining-based
method, called Object Location Mining (OLM), which exploits the advantages of
data mining and feature representation of pre-trained convolutional neural
networks (CNNs). Specifically, we first convert the feature maps from a
pre-trained CNN model into a set of transactions, and then discovers frequent
patterns from transaction database through pattern mining techniques. We
observe that those discovered patterns, i.e., co-occurrence highlighted
regions, typically hold appearance and spatial consistency. Motivated by this
observation, we can easily discover and localize possible objects by merging
relevant meaningful patterns. Extensive experiments on a variety of benchmarks
demonstrate that OLM achieves competitive localization performance compared
with the state-of-the-art methods. We also evaluate our approach compared with
unsupervised saliency detection methods and achieves competitive results on
seven benchmark datasets. Moreover, we conduct experiments on fine-grained
classification to show that our proposed method can locate the entire object
and parts accurately, which can benefit to improving the classification results
significantly
Top-Down Saliency Detection Driven by Visual Classification
This paper presents an approach for top-down saliency detection guided by
visual classification tasks. We first learn how to compute visual saliency when
a specific visual task has to be accomplished, as opposed to most
state-of-the-art methods which assess saliency merely through bottom-up
principles. Afterwards, we investigate if and to what extent visual saliency
can support visual classification in nontrivial cases. To achieve this, we
propose SalClassNet, a CNN framework consisting of two networks jointly
trained: a) the first one computing top-down saliency maps from input images,
and b) the second one exploiting the computed saliency maps for visual
classification. To test our approach, we collected a dataset of eye-gaze maps,
using a Tobii T60 eye tracker, by asking several subjects to look at images
from the Stanford Dogs dataset, with the objective of distinguishing dog
breeds. Performance analysis on our dataset and other saliency bench-marking
datasets, such as POET, showed that SalClassNet out-performs state-of-the-art
saliency detectors, such as SalNet and SALICON. Finally, we analyzed the
performance of SalClassNet in a fine-grained recognition task and found out
that it generalizes better than existing visual classifiers. The achieved
results, thus, demonstrate that 1) conditioning saliency detectors with object
classes reaches state-of-the-art performance, and 2) providing explicitly
top-down saliency maps to visual classifiers enhances classification accuracy
Two-View Fine-grained Classification of Plant Species
Automatic plant classification is a challenging problem due to the wide
biodiversity of the existing plant species in a fine-grained scenario. Powerful
deep learning architectures have been used to improve the classification
performance in such a fine-grained problem, but usually building models that
are highly dependent on a large training dataset and which are not scalable. In
this paper, we propose a novel method based on a two-view leaf image
representation and a hierarchical classification strategy for fine-grained
recognition of plant species. It uses the botanical taxonomy as a basis for a
coarse-to-fine strategy applied to identify the plant genus and species. The
two-view representation provides complementary global and local features of
leaf images. A deep metric based on Siamese convolutional neural networks is
used to reduce the dependence on a large number of training samples and make
the method scalable to new plant species. The experimental results on two
challenging fine-grained datasets of leaf images (i.e. LifeCLEF 2015 and
LeafSnap) have shown the effectiveness of the proposed method, which achieved
recognition accuracy of 0.87 and 0.96 respectively.Comment: Submitted to Ecological Informatic
- …