2 research outputs found
Fine-grained Visual-textual Representation Learning
Fine-grained visual categorization is to recognize hundreds of subcategories
belonging to the same basic-level category, which is a highly challenging task
due to the quite subtle and local visual distinctions among similar
subcategories. Most existing methods generally learn part detectors to discover
discriminative regions for better categorization performance. However, not all
parts are beneficial and indispensable for visual categorization, and the
setting of part detector number heavily relies on prior knowledge as well as
experimental validation. As is known to all, when we describe the object of an
image via textual descriptions, we mainly focus on the pivotal characteristics,
and rarely pay attention to common characteristics as well as the background
areas. This is an involuntary transfer from human visual attention to textual
attention, which leads to the fact that textual attention tells us how many and
which parts are discriminative and significant to categorization. So textual
attention could help us to discover visual attention in image. Inspired by
this, we propose a fine-grained visual-textual representation learning (VTRL)
approach, and its main contributions are: (1) Fine-grained visual-textual
pattern mining devotes to discovering discriminative visual-textual pairwise
information for boosting categorization performance through jointly modeling
vision and text with generative adversarial networks (GANs), which
automatically and adaptively discovers discriminative parts. (2) Visual-textual
representation learning jointly combines visual and textual information, which
preserves the intra-modality and inter-modality information to generate
complementary fine-grained representation, as well as further improves
categorization performance.Comment: 12 pages, accepted by IEEE Transactions on Circuits and Systems for
Video Technology (TCSVT
Part-Aware Fine-grained Object Categorization using Weakly Supervised Part Detection Network
Fine-grained object categorization aims for distinguishing objects of
subordinate categories that belong to the same entry-level object category. The
task is challenging due to the facts that (1) training images with ground-truth
labels are difficult to obtain, and (2) variations among different subordinate
categories are subtle. It is well established that characterizing features of
different subordinate categories are located on local parts of object
instances. In fact, careful part annotations are available in many fine-grained
categorization datasets. However, manually annotating object parts requires
expertise, which is also difficult to generalize to new fine-grained
categorization tasks. In this work, we propose a Weakly Supervised Part
Detection Network (PartNet) that is able to detect discriminative local parts
for use of fine-grained categorization. A vanilla PartNet builds on top of a
base subnetwork two parallel streams of upper network layers, which
respectively compute scores of classification probabilities (over subordinate
categories) and detection probabilities (over a specified number of
discriminative part detectors) for local regions of interest (RoIs). The
image-level prediction is obtained by aggregating element-wise products of
these region-level probabilities. To generate a diverse set of RoIs as inputs
of PartNet, we propose a simple Discretized Part Proposals module (DPP) that
directly targets for proposing candidates of discriminative local parts, with
no bridging via object-level proposals. Experiments on the benchmark
CUB-200-2011 and Oxford Flower 102 datasets show the efficacy of our proposed
method for both discriminative part detection and fine-grained categorization.
In particular, we achieve the new state-of-the-art performance on CUB-200-2011
dataset when ground-truth part annotations are not available.Comment: TMM paper version. Codes are available at:
https://github.com/YBZh/PartNe