80,282 research outputs found
Person Recognition in Personal Photo Collections
People nowadays share large parts of their personal lives through social
media. Being able to automatically recognise people in personal photos may
greatly enhance user convenience by easing photo album organisation. For human
identification task, however, traditional focus of computer vision has been
face recognition and pedestrian re-identification. Person recognition in social
media photos sets new challenges for computer vision, including non-cooperative
subjects (e.g. backward viewpoints, unusual poses) and great changes in
appearance. To tackle this problem, we build a simple person recognition
framework that leverages convnet features from multiple image regions (head,
body, etc.). We propose new recognition scenarios that focus on the time and
appearance gap between training and testing samples. We present an in-depth
analysis of the importance of different features according to time and
viewpoint generalisability. In the process, we verify that our simple approach
achieves the state of the art result on the PIPA benchmark, arguably the
largest social media based benchmark for person recognition to date with
diverse poses, viewpoints, social groups, and events.
Compared the conference version of the paper, this paper additionally
presents (1) analysis of a face recogniser (DeepID2+), (2) new method naeil2
that combines the conference version method naeil and DeepID2+ to achieve state
of the art results even compared to post-conference works, (3) discussion of
related work since the conference version, (4) additional analysis including
the head viewpoint-wise breakdown of performance, and (5) results on the
open-world setup.Comment: 18 pages, 20 figures; to appear in IEEE Transactions on Pattern
Analysis and Machine Intelligenc
Multimodal Prediction and Personalization of Photo Edits with Deep Generative Models
Professional-grade software applications are powerful but
complicatedexpert users can achieve impressive results, but novices often
struggle to complete even basic tasks. Photo editing is a prime example: after
loading a photo, the user is confronted with an array of cryptic sliders like
"clarity", "temp", and "highlights". An automatically generated suggestion
could help, but there is no single "correct" edit for a given imagedifferent
experts may make very different aesthetic decisions when faced with the same
image, and a single expert may make different choices depending on the intended
use of the image (or on a whim). We therefore want a system that can propose
multiple diverse, high-quality edits while also learning from and adapting to a
user's aesthetic preferences. In this work, we develop a statistical model that
meets these objectives. Our model builds on recent advances in neural network
generative modeling and scalable inference, and uses hierarchical structure to
learn editing patterns across many diverse users. Empirically, we find that our
model outperforms other approaches on this challenging multimodal prediction
task
Image captioning with weakly-supervised attention penalty
Stories are essential for genealogy research since they can help build
emotional connections with people. A lot of family stories are reserved in
historical photos and albums. Recent development on image captioning models
makes it feasible to "tell stories" for photos automatically. The attention
mechanism has been widely adopted in many state-of-the-art encoder-decoder
based image captioning models, since it can bridge the gap between the visual
part and the language part. Most existing captioning models implicitly trained
attention modules with word-likelihood loss. Meanwhile, lots of studies have
investigated intrinsic attentions for visual models using gradient-based
approaches. Ideally, attention maps predicted by captioning models should be
consistent with intrinsic attentions from visual models for any given visual
concept. However, no work has been done to align implicitly learned attention
maps with intrinsic visual attentions. In this paper, we proposed a novel model
that measured consistency between captioning predicted attentions and intrinsic
visual attentions. This alignment loss allows explicit attention correction
without using any expensive bounding box annotations. We developed and
evaluated our model on COCO dataset as well as a genealogical dataset from
Ancestry.com Operations Inc., which contains billions of historical photos. The
proposed model achieved better performances on all commonly used language
evaluation metrics for both datasets.Comment: 10 pages, 5 figure
The Cross-Depiction Problem: Computer Vision Algorithms for Recognising Objects in Artwork and in Photographs
The cross-depiction problem is that of recognising visual objects regardless
of whether they are photographed, painted, drawn, etc. It is a potentially
significant yet under-researched problem. Emulating the remarkable human
ability to recognise objects in an astonishingly wide variety of depictive
forms is likely to advance both the foundations and the applications of
Computer Vision.
In this paper we benchmark classification, domain adaptation, and deep
learning methods; demonstrating that none perform consistently well in the
cross-depiction problem. Given the current interest in deep learning, the fact
such methods exhibit the same behaviour as all but one other method: they show
a significant fall in performance over inhomogeneous databases compared to
their peak performance, which is always over data comprising photographs only.
Rather, we find the methods that have strong models of spatial relations
between parts tend to be more robust and therefore conclude that such
information is important in modelling object classes regardless of appearance
details.Comment: 12 pages, 6 figure
Merge or Not? Learning to Group Faces via Imitation Learning
Given a large number of unlabeled face images, face grouping aims at
clustering the images into individual identities present in the data. This task
remains a challenging problem despite the remarkable capability of deep
learning approaches in learning face representation. In particular, grouping
results can still be egregious given profile faces and a large number of
uninteresting faces and noisy detections. Often, a user needs to correct the
erroneous grouping manually. In this study, we formulate a novel face grouping
framework that learns clustering strategy from ground-truth simulated behavior.
This is achieved through imitation learning (a.k.a apprenticeship learning or
learning by watching) via inverse reinforcement learning (IRL). In contrast to
existing clustering approaches that group instances by similarity, our
framework makes sequential decision to dynamically decide when to merge two
face instances/groups driven by short- and long-term rewards. Extensive
experiments on three benchmark datasets show that our framework outperforms
unsupervised and supervised baselines
Dual-Glance Model for Deciphering Social Relationships
Since the beginning of early civilizations, social relationships derived from
each individual fundamentally form the basis of social structure in our daily
life. In the computer vision literature, much progress has been made in scene
understanding, such as object detection and scene parsing. Recent research
focuses on the relationship between objects based on its functionality and
geometrical relations. In this work, we aim to study the problem of social
relationship recognition, in still images. We have proposed a dual-glance model
for social relationship recognition, where the first glance fixates at the
individual pair of interest and the second glance deploys attention mechanism
to explore contextual cues. We have also collected a new large scale People in
Social Context (PISC) dataset, which comprises of 22,670 images and 76,568
annotated samples from 9 types of social relationship. We provide benchmark
results on the PISC dataset, and qualitatively demonstrate the efficacy of the
proposed model.Comment: IEEE International Conference on Computer Vision (ICCV), 201
Face Recognition: A Novel Multi-Level Taxonomy based Survey
In a world where security issues have been gaining growing importance, face
recognition systems have attracted increasing attention in multiple application
areas, ranging from forensics and surveillance to commerce and entertainment.
To help understanding the landscape and abstraction levels relevant for face
recognition systems, face recognition taxonomies allow a deeper dissection and
comparison of the existing solutions. This paper proposes a new, more
encompassing and richer multi-level face recognition taxonomy, facilitating the
organization and categorization of available and emerging face recognition
solutions; this taxonomy may also guide researchers in the development of more
efficient face recognition solutions. The proposed multi-level taxonomy
considers levels related to the face structure, feature support and feature
extraction approach. Following the proposed taxonomy, a comprehensive survey of
representative face recognition solutions is presented. The paper concludes
with a discussion on current algorithmic and application related challenges
which may define future research directions for face recognition.Comment: This paper is a preprint of a paper submitted to IET Biometrics. If
accepted, the copy of record will be available at the IET Digital Librar
Learning Goal-Oriented Visual Dialog via Tempered Policy Gradient
Learning goal-oriented dialogues by means of deep reinforcement learning has
recently become a popular research topic. However, commonly used policy-based
dialogue agents often end up focusing on simple utterances and suboptimal
policies. To mitigate this problem, we propose a class of novel
temperature-based extensions for policy gradient methods, which are referred to
as Tempered Policy Gradients (TPGs). On a recent AI-testbed, i.e., the
GuessWhat?! game, we achieve significant improvements with two innovations. The
first one is an extension of the state-of-the-art solutions with Seq2Seq and
Memory Network structures that leads to an improvement of 7%. The second one is
the application of our newly developed TPG methods, which improves the
performance additionally by around 5% and, even more importantly, helps produce
more convincing utterances.Comment: Published in IEEE Spoken Language Technology (SLT 2018), Athens,
Greec
Generative Adversarial Networks: An Overview
Generative adversarial networks (GANs) provide a way to learn deep
representations without extensively annotated training data. They achieve this
through deriving backpropagation signals through a competitive process
involving a pair of networks. The representations that can be learned by GANs
may be used in a variety of applications, including image synthesis, semantic
image editing, style transfer, image super-resolution and classification. The
aim of this review paper is to provide an overview of GANs for the signal
processing community, drawing on familiar analogies and concepts where
possible. In addition to identifying different methods for training and
constructing GANs, we also point to remaining challenges in their theory and
application.Comment: Accepted in the IEEE Signal Processing Magazine Special Issue on Deep
Learning for Visual Understandin
Multiple-Human Parsing in the Wild
Human parsing is attracting increasing research attention. In this work, we
aim to push the frontier of human parsing by introducing the problem of
multi-human parsing in the wild. Existing works on human parsing mainly tackle
single-person scenarios, which deviates from real-world applications where
multiple persons are present simultaneously with interaction and occlusion. To
address the multi-human parsing problem, we introduce a new multi-human parsing
(MHP) dataset and a novel multi-human parsing model named MH-Parser. The MHP
dataset contains multiple persons captured in real-world scenes with
pixel-level fine-grained semantic annotations in an instance-aware setting. The
MH-Parser generates global parsing maps and person instance masks
simultaneously in a bottom-up fashion with the help of a new Graph-GAN model.
We envision that the MHP dataset will serve as a valuable data resource to
develop new multi-human parsing models, and the MH-Parser offers a strong
baseline to drive future research for multi-human parsing in the wild.Comment: The first two authors are with equal contributio
- …