3,798 research outputs found
Improving Facial Attribute Prediction using Semantic Segmentation
Attributes are semantically meaningful characteristics whose applicability
widely crosses category boundaries. They are particularly important in
describing and recognizing concepts where no explicit training example is
given, \textit{e.g., zero-shot learning}. Additionally, since attributes are
human describable, they can be used for efficient human-computer interaction.
In this paper, we propose to employ semantic segmentation to improve facial
attribute prediction. The core idea lies in the fact that many facial
attributes describe local properties. In other words, the probability of an
attribute to appear in a face image is far from being uniform in the spatial
domain. We build our facial attribute prediction model jointly with a deep
semantic segmentation network. This harnesses the localization cues learned by
the semantic segmentation to guide the attention of the attribute prediction to
the regions where different attributes naturally show up. As a result of this
approach, in addition to recognition, we are able to localize the attributes,
despite merely having access to image level labels (weak supervision) during
training. We evaluate our proposed method on CelebA and LFWA datasets and
achieve superior results to the prior arts. Furthermore, we show that in the
reverse problem, semantic face parsing improves when facial attributes are
available. That reaffirms the need to jointly model these two interconnected
tasks
Persistent Evidence of Local Image Properties in Generic ConvNets
Supervised training of a convolutional network for object classification
should make explicit any information related to the class of objects and
disregard any auxiliary information associated with the capture of the image or
the variation within the object class. Does this happen in practice? Although
this seems to pertain to the very final layers in the network, if we look at
earlier layers we find that this is not the case. Surprisingly, strong spatial
information is implicit. This paper addresses this, in particular, exploiting
the image representation at the first fully connected layer, i.e. the global
image descriptor which has been recently shown to be most effective in a range
of visual recognition tasks. We empirically demonstrate evidences for the
finding in the contexts of four different tasks: 2d landmark detection, 2d
object keypoints prediction, estimation of the RGB values of input image, and
recovery of semantic label of each pixel. We base our investigation on a simple
framework with ridge rigression commonly across these tasks, and show results
which all support our insight. Such spatial information can be used for
computing correspondence of landmarks to a good accuracy, but should
potentially be useful for improving the training of the convolutional nets for
classification purposes
Face Captioning Using Prominent Feature Recognition
Humans rely on prominent feature recognition to correctly identify and describe previously seen faces. Despite this fact, there is little existing work investigating how prominent facial features can be automatically recognized and used to create natural language face descriptions. Facial attribute prediction, a more commonly studied problem in computer vision, has previously been used for this task. However, the evaluation metrics and baseline models currently used to compare different attribute prediction methods are insufficient for determining which approaches are best at classifying highly imbalanced attributes. We also show that CelebA, the largest and most widely used facial attribute dataset, is too poorly labeled to be suitable for prominent feature recognition. To deal with these issues, we propose a method for generating weak prominent feature labels using semantic segmentation and show that we can use these labels to improve attribute-based face description
Cross-stitch Networks for Multi-task Learning
Multi-task learning in Convolutional Networks has displayed remarkable
success in the field of recognition. This success can be largely attributed to
learning shared representations from multiple supervisory tasks. However,
existing multi-task approaches rely on enumerating multiple network
architectures specific to the tasks at hand, that do not generalize. In this
paper, we propose a principled approach to learn shared representations in
ConvNets using multi-task learning. Specifically, we propose a new sharing
unit: "cross-stitch" unit. These units combine the activations from multiple
networks and can be trained end-to-end. A network with cross-stitch units can
learn an optimal combination of shared and task-specific representations. Our
proposed method generalizes across multiple tasks and shows dramatically
improved performance over baseline methods for categories with few training
examples.Comment: To appear in CVPR 2016 (Spotlight
On Symbiosis of Attribute Prediction and Semantic Segmentation
In this paper, we propose to employ semantic segmentation to improve
person-related attribute prediction. The core idea lies in the fact that the
probability of an attribute to appear in an image is far from being uniform in
the spatial domain. We build our attribute prediction model jointly with a deep
semantic segmentation network. This harnesses the localization cues learned by
the semantic segmentation to guide the attention of the attribute prediction to
the regions where different attributes naturally show up. Therefore, in
addition to prediction, we are able to localize the attributes despite merely
having access to image-level labels (weak supervision) during training. We
first propose semantic segmentation-based pooling and gating, respectively
denoted as SSP and SSG. In the former, the estimated segmentation masks are
used to pool the final activations of the attribute prediction network, from
multiple semantically homogeneous regions. In SSG, the same idea is applied to
the intermediate layers of the network. SSP and SSG, while effective, impose
heavy memory utilization since each channel of the activations is pooled/gated
with all the semantic segmentation masks. To circumvent this, we propose
Symbiotic Augmentation (SA), where we learn only one mask per activation
channel. SA allows the model to either pick one, or combine (weighted
superposition) multiple semantic maps, in order to generate the proper mask for
each channel. SA simultaneously applies the same mechanism to the reverse
problem by leveraging output logits of attribute prediction to guide the
semantic segmentation task. We evaluate our proposed methods for facial
attributes on CelebA and LFWA datasets, while benchmarking WIDER Attribute and
Berkeley Attributes of People for whole body attributes. Our proposed methods
achieve superior results compared to the previous works.Comment: Accepted for publication in PAMI. arXiv admin note: substantial text
overlap with arXiv:1704.0874
- …