363 research outputs found
Can deep learning help you find the perfect match?
Is he/she my type or not? The answer to this question depends on the personal
preferences of the one asking it. The individual process of obtaining a full
answer may generally be difficult and time consuming, but often an approximate
answer can be obtained simply by looking at a photo of the potential match.
Such approximate answers based on visual cues can be produced in a fraction of
a second, a phenomenon that has led to a series of recently successful dating
apps in which users rate others positively or negatively using primarily a
single photo. In this paper we explore using convolutional networks to create a
model of an individual's personal preferences based on rated photos. This
introduced task is difficult due to the large number of variations in profile
pictures and the noise in attractiveness labels. Toward this task we collect a
dataset comprised of pictures and binary labels for each. We compare
performance of convolutional models trained in three ways: first directly on
the collected dataset, second with features transferred from a network trained
to predict gender, and third with features transferred from a network trained
on ImageNet. Our findings show that ImageNet features transfer best, producing
a model that attains accuracy on the test set and is moderately
successful at predicting matches
Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images
Deep neural networks (DNNs) have recently been achieving state-of-the-art
performance on a variety of pattern-recognition tasks, most notably visual
classification problems. Given that DNNs are now able to classify objects in
images with near-human-level performance, questions naturally arise as to what
differences remain between computer and human vision. A recent study revealed
that changing an image (e.g. of a lion) in a way imperceptible to humans can
cause a DNN to label the image as something else entirely (e.g. mislabeling a
lion a library). Here we show a related result: it is easy to produce images
that are completely unrecognizable to humans, but that state-of-the-art DNNs
believe to be recognizable objects with 99.99% confidence (e.g. labeling with
certainty that white noise static is a lion). Specifically, we take
convolutional neural networks trained to perform well on either the ImageNet or
MNIST datasets and then find images with evolutionary algorithms or gradient
ascent that DNNs label with high confidence as belonging to each dataset class.
It is possible to produce images totally unrecognizable to human eyes that DNNs
believe with near certainty are familiar objects, which we call "fooling
images" (more generally, fooling examples). Our results shed light on
interesting differences between human vision and current DNNs, and raise
questions about the generality of DNN computer vision.Comment: To appear at CVPR 201
Recombinator Networks: Learning Coarse-to-Fine Feature Aggregation
Deep neural networks with alternating convolutional, max-pooling and
decimation layers are widely used in state of the art architectures for
computer vision. Max-pooling purposefully discards precise spatial information
in order to create features that are more robust, and typically organized as
lower resolution spatial feature maps. On some tasks, such as whole-image
classification, max-pooling derived features are well suited; however, for
tasks requiring precise localization, such as pixel level prediction and
segmentation, max-pooling destroys exactly the information required to perform
well. Precise localization may be preserved by shallow convnets without pooling
but at the expense of robustness. Can we have our max-pooled multi-layered cake
and eat it too? Several papers have proposed summation and concatenation based
methods for combining upsampled coarse, abstract features with finer features
to produce robust pixel level predictions. Here we introduce another model ---
dubbed Recombinator Networks --- where coarse features inform finer features
early in their formation such that finer features can make use of several
layers of computation in deciding how to use coarse features. The model is
trained once, end-to-end and performs better than summation-based
architectures, reducing the error from the previous state of the art on two
facial keypoint datasets, AFW and AFLW, by 30\% and beating the current
state-of-the-art on 300W without using extra data. We improve performance even
further by adding a denoising prediction model based on a novel convnet
formulation.Comment: accepted in CVPR 201
- …