12,139 research outputs found
Attribute-Aware Attention Model for Fine-grained Representation Learning
How to learn a discriminative fine-grained representation is a key point in
many computer vision applications, such as person re-identification,
fine-grained classification, fine-grained image retrieval, etc. Most of the
previous methods focus on learning metrics or ensemble to derive better global
representation, which are usually lack of local information. Based on the
considerations above, we propose a novel Attribute-Aware Attention Model
(), which can learn local attribute representation and global category
representation simultaneously in an end-to-end manner. The proposed model
contains two attention models: attribute-guided attention module uses attribute
information to help select category features in different regions, at the same
time, category-guided attention module selects local features of different
attributes with the help of category cues. Through this attribute-category
reciprocal process, local and global features benefit from each other. Finally,
the resulting feature contains more intrinsic information for image recognition
instead of the noisy and irrelevant features. Extensive experiments conducted
on Market-1501, CompCars, CUB-200-2011 and CARS196 demonstrate the
effectiveness of our . Code is available at
https://github.com/iamhankai/attribute-aware-attention.Comment: Accepted by ACM Multimedia 2018 (Oral). Code is available at
https://github.com/iamhankai/attribute-aware-attentio
Snap and Find: Deep Discrete Cross-domain Garment Image Retrieval
With the increasing number of online stores, there is a pressing need for
intelligent search systems to understand the item photos snapped by customers
and search against large-scale product databases to find their desired items.
However, it is challenging for conventional retrieval systems to match up the
item photos captured by customers and the ones officially released by stores,
especially for garment images. To bridge the customer- and store- provided
garment photos, existing studies have been widely exploiting the clothing
attributes (\textit{e.g.,} black) and landmarks (\textit{e.g.,} collar) to
learn a common embedding space for garment representations. Unfortunately they
omit the sequential correlation of attributes and consume large quantity of
human labors to label the landmarks. In this paper, we propose a deep
multi-task cross-domain hashing termed \textit{DMCH}, in which cross-domain
embedding and sequential attribute learning are modeled simultaneously.
Sequential attribute learning not only provides the semantic guidance for
embedding, but also generates rich attention on discriminative local details
(\textit{e.g.,} black buttons) of clothing items without requiring extra
landmark labels. This leads to promising performance and 306 boost on
efficiency when compared with the state-of-the-art models, which is
demonstrated through rigorous experiments on two public fashion datasets
Hierarchical Feature Embedding for Attribute Recognition
Attribute recognition is a crucial but challenging task due to viewpoint
changes, illumination variations and appearance diversities, etc. Most of
previous work only consider the attribute-level feature embedding, which might
perform poorly in complicated heterogeneous conditions. To address this
problem, we propose a hierarchical feature embedding (HFE) framework, which
learns a fine-grained feature embedding by combining attribute and ID
information. In HFE, we maintain the inter-class and intra-class feature
embedding simultaneously. Not only samples with the same attribute but also
samples with the same ID are gathered more closely, which could restrict the
feature embedding of visually hard samples with regard to attributes and
improve the robustness to variant conditions. We establish this hierarchical
structure by utilizing HFE loss consisted of attribute-level and ID-level
constraints. We also introduce an absolute boundary regularization and a
dynamic loss weight as supplementary components to help build up the feature
embedding. Experiments show that our method achieves the state-of-the-art
results on two pedestrian attribute datasets and a facial attribute dataset.Comment: CVPR 202
Localizing by Describing: Attribute-Guided Attention Localization for Fine-Grained Recognition
A key challenge in fine-grained recognition is how to find and represent
discriminative local regions. Recent attention models are capable of learning
discriminative region localizers only from category labels with reinforcement
learning. However, not utilizing any explicit part information, they are not
able to accurately find multiple distinctive regions. In this work, we
introduce an attribute-guided attention localization scheme where the local
region localizers are learned under the guidance of part attribute
descriptions. By designing a novel reward strategy, we are able to learn to
locate regions that are spatially and semantically distinctive with
reinforcement learning algorithm. The attribute labeling requirement of the
scheme is more amenable than the accurate part location annotation required by
traditional part-based fine-grained recognition methods. Experimental results
on the CUB-200-2011 dataset demonstrate the superiority of the proposed scheme
on both fine-grained recognition and attribute recognition
ViTAA: Visual-Textual Attributes Alignment in Person Search by Natural Language
Person search by natural language aims at retrieving a specific person in a
large-scale image pool that matches the given textual descriptions. While most
of the current methods treat the task as a holistic visual and textual feature
matching one, we approach it from an attribute-aligning perspective that allows
grounding specific attribute phrases to the corresponding visual regions. We
achieve success as well as the performance boosting by a robust feature
learning that the referred identity can be accurately bundled by multiple
attribute visual cues. To be concrete, our Visual-Textual Attribute Alignment
model (dubbed as ViTAA) learns to disentangle the feature space of a person
into subspaces corresponding to attributes using a light auxiliary attribute
segmentation computing branch. It then aligns these visual features with the
textual attributes parsed from the sentences by using a novel contrastive
learning loss. Upon that, we validate our ViTAA framework through extensive
experiments on tasks of person search by natural language and by
attribute-phrase queries, on which our system achieves state-of-the-art
performances. Code will be publicly available upon publication.Comment: ECCV2020, 18 pages, 6 figure
Fine-grained Apparel Classification and Retrieval without rich annotations
The ability to correctly classify and retrieve apparel images has a variety
of applications important to e-commerce, online advertising and internet
search. In this work, we propose a robust framework for fine-grained apparel
classification, in-shop and cross-domain retrieval which eliminates the
requirement of rich annotations like bounding boxes and human-joints or
clothing landmarks, and training of bounding box/ key-landmark detector for the
same. Factors such as subtle appearance differences, variations in human poses,
different shooting angles, apparel deformations, and self-occlusion add to the
challenges in classification and retrieval of apparel items. Cross-domain
retrieval is even harder due to the presence of large variation between online
shopping images, usually taken in ideal lighting, pose, positive angle and
clean background as compared with street photos captured by users in
complicated conditions with poor lighting and cluttered scenes. Our framework
uses compact bilinear CNN with tensor sketch algorithm to generate embeddings
that capture local pairwise feature interactions in a translationally invariant
manner. For apparel classification, we pass the feature embeddings through a
softmax classifier, while, the in-shop and cross-domain retrieval pipelines use
a triplet-loss based optimization approach, such that squared Euclidean
distance between embeddings measures the dissimilarity between the images.
Unlike previous works that relied on bounding box, key clothing landmarks or
human joint detectors to assist the final deep classifier, proposed framework
can be trained directly on the provided category labels or generated triplets
for triplet loss optimization. Lastly, Experimental results on the DeepFashion
fine-grained categorization, and in-shop and consumer-to-shop retrieval
datasets provide a comparative analysis with previous work performed in the
domain.Comment: 14 pages, 6 figures, 3 tables, Submitted to Springer Journal of
Applied Intelligenc
MagnifierNet: Towards Semantic Adversary and Fusion for Person Re-identification
Although person re-identification (ReID) has achieved significant improvement
recently by enforcing part alignment, it is still a challenging task when it
comes to distinguishing visually similar identities or identifying the occluded
person. In these scenarios, magnifying details in each part features and
selectively fusing them together may provide a feasible solution. In this work,
we propose MagnifierNet, a triple-branch network which accurately mines details
from whole to parts. Firstly, the holistic salient features are encoded by a
global branch. Secondly, to enhance detailed representation for each semantic
region, the "Semantic Adversarial Branch" is designed to learn from dynamically
generated semantic-occluded samples during training. Meanwhile, we introduce
"Semantic Fusion Branch" to filter out irrelevant noises by selectively fusing
semantic region information sequentially. To further improve feature diversity,
we introduce a novel loss function "Semantic Diversity Loss" to remove
redundant overlaps across learned semantic representations. State-of-the-art
performance has been achieved on three benchmarks by large margins.
Specifically, the mAP score is improved by 6% and 5% on the most challenging
CUHK03-L and CUHK03-D benchmarks
Vision-to-Language Tasks Based on Attributes and Attention Mechanism
Vision-to-language tasks aim to integrate computer vision and natural
language processing together, which has attracted the attention of many
researchers. For typical approaches, they encode image into feature
representations and decode it into natural language sentences. While they
neglect high-level semantic concepts and subtle relationships between image
regions and natural language elements. To make full use of these information,
this paper attempt to exploit the text guided attention and semantic-guided
attention (SA) to find the more correlated spatial information and reduce the
semantic gap between vision and language. Our method includes two level
attention networks. One is the text-guided attention network which is used to
select the text-related regions. The other is SA network which is used to
highlight the concept-related regions and the region-related concepts. At last,
all these information are incorporated to generate captions or answers.
Practically, image captioning and visual question answering experiments have
been carried out, and the experimental results have shown the excellent
performance of the proposed approach.Comment: 15 pages, 6 figures, 50 reference
A-Lamp: Adaptive Layout-Aware Multi-Patch Deep Convolutional Neural Network for Photo Aesthetic Assessment
Deep convolutional neural networks (CNN) have recently been shown to generate
promising results for aesthetics assessment. However, the performance of these
deep CNN methods is often compromised by the constraint that the neural network
only takes the fixed-size input. To accommodate this requirement, input images
need to be transformed via cropping, warping, or padding, which often alter
image composition, reduce image resolution, or cause image distortion. Thus the
aesthetics of the original images is impaired because of potential loss of fine
grained details and holistic image layout. However, such fine grained details
and holistic image layout is critical for evaluating an image's aesthetics. In
this paper, we present an Adaptive Layout-Aware Multi-Patch Convolutional
Neural Network (A-Lamp CNN) architecture for photo aesthetic assessment. This
novel scheme is able to accept arbitrary sized images, and learn from both
fined grained details and holistic image layout simultaneously. To enable
training on these hybrid inputs, we extend the method by developing a dedicated
double-subnet neural network structure, i.e. a Multi-Patch subnet and a
Layout-Aware subnet. We further construct an aggregation layer to effectively
combine the hybrid features from these two subnets. Extensive experiments on
the large-scale aesthetics assessment benchmark (AVA) demonstrate significant
performance improvement over the state-of-the-art in photo aesthetic
assessment
Pre-training of Context-aware Item Representation for Next Basket Recommendation
Next basket recommendation, which aims to predict the next a few items that a
user most probably purchases given his historical transactions, plays a vital
role in market basket analysis. From the viewpoint of item, an item could be
purchased by different users together with different items, for different
reasons. Therefore, an ideal recommender system should represent an item
considering its transaction contexts. Existing state-of-the-art deep learning
methods usually adopt the static item representations, which are invariant
among all of the transactions and thus cannot achieve the full potentials of
deep learning. Inspired by the pre-trained representations of BERT in natural
language processing, we propose to conduct context-aware item representation
for next basket recommendation, called Item Encoder Representations from
Transformers (IERT). In the offline phase, IERT pre-trains deep item
representations conditioning on their transaction contexts. In the online
recommendation phase, the pre-trained model is further fine-tuned with an
additional output layer. The output contextualized item embeddings are used to
capture users' sequential behaviors and general tastes to conduct
recommendation. Experimental results on the Ta-Feng data set show that IERT
outperforms the state-of-the-art baseline methods, which demonstrated the
effectiveness of IERT in next basket representation
- …