388 research outputs found
Face R-CNN
Faster R-CNN is one of the most representative and successful methods for
object detection, and has been becoming increasingly popular in various
objection detection applications. In this report, we propose a robust deep face
detection approach based on Faster R-CNN. In our approach, we exploit several
new techniques including new multi-task loss function design, online hard
example mining, and multi-scale training strategy to improve Faster R-CNN in
multiple aspects. The proposed approach is well suited for face detection, so
we call it Face R-CNN. Extensive experiments are conducted on two most popular
and challenging face detection benchmarks, FDDB and WIDER FACE, to demonstrate
the superiority of the proposed approach over state-of-the-arts
Generic 3D Representation via Pose Estimation and Matching
Though a large body of computer vision research has investigated developing
generic semantic representations, efforts towards developing a similar
representation for 3D has been limited. In this paper, we learn a generic 3D
representation through solving a set of foundational proxy 3D tasks:
object-centric camera pose estimation and wide baseline feature matching. Our
method is based upon the premise that by providing supervision over a set of
carefully selected foundational tasks, generalization to novel tasks and
abstraction capabilities can be achieved. We empirically show that the internal
representation of a multi-task ConvNet trained to solve the above core problems
generalizes to novel 3D tasks (e.g., scene layout estimation, object pose
estimation, surface normal estimation) without the need for fine-tuning and
shows traits of abstraction abilities (e.g., cross-modality pose estimation).
In the context of the core supervised tasks, we demonstrate our representation
achieves state-of-the-art wide baseline feature matching results without
requiring apriori rectification (unlike SIFT and the majority of learned
features). We also show 6DOF camera pose estimation given a pair local image
patches. The accuracy of both supervised tasks come comparable to humans.
Finally, we contribute a large-scale dataset composed of object-centric street
view scenes along with point correspondences and camera pose information, and
conclude with a discussion on the learned representation and open research
questions.Comment: Published in ECCV16. See the project website
http://3drepresentation.stanford.edu/ and dataset website
https://github.com/amir32002/3D_Street_Vie
An Empirical Study of Recent Face Alignment Methods
The problem of face alignment has been intensively studied in the past years.
A large number of novel methods have been proposed and reported very good
performance on benchmark dataset such as 300W. However, the differences in the
experimental setting and evaluation metric, missing details in the description
of the methods make it hard to reproduce the results reported and evaluate the
relative merits. For instance, most recent face alignment methods are built on
top of face detection but from different face detectors. In this paper, we
carry out a rigorous evaluation of these methods by making the following
contributions: 1) we proposes a new evaluation metric for face alignment on a
set of images, i.e., area under error distribution curve within a threshold,
AUC, given the fact that the traditional evaluation measure (mean
error) is very sensitive to big alignment error. 2) we extend the 300W database
with more practical face detections to make fair comparison possible. 3) we
carry out face alignment sensitivity analysis w.r.t. face detection, on both
synthetic and real data, using both off-the-shelf and re-retrained models. 4)
we study factors that are particularly important to achieve good performance
and provide suggestions for practical applications. Most of the conclusions
drawn from our comparative analysis cannot be inferred from the original
publications.Comment: under review of a conference. Project page:
https://www.cl.cam.ac.uk/~hy306/FaceAlignment.htm
A Review on Deep Learning Techniques Applied to Semantic Segmentation
Image semantic segmentation is more and more being of interest for computer
vision and machine learning researchers. Many applications on the rise need
accurate and efficient segmentation mechanisms: autonomous driving, indoor
navigation, and even virtual or augmented reality systems to name a few. This
demand coincides with the rise of deep learning approaches in almost every
field or application target related to computer vision, including semantic
segmentation or scene understanding. This paper provides a review on deep
learning methods for semantic segmentation applied to various application
areas. Firstly, we describe the terminology of this field as well as mandatory
background concepts. Next, the main datasets and challenges are exposed to help
researchers decide which are the ones that best suit their needs and their
targets. Then, existing methods are reviewed, highlighting their contributions
and their significance in the field. Finally, quantitative results are given
for the described methods and the datasets in which they were evaluated,
following up with a discussion of the results. At last, we point out a set of
promising future works and draw our own conclusions about the state of the art
of semantic segmentation using deep learning techniques.Comment: Submitted to TPAMI on Apr. 22, 201
von Mises-Fisher Mixture Model-based Deep learning: Application to Face Verification
A number of pattern recognition tasks, \textit{e.g.}, face verification, can
be boiled down to classification or clustering of unit length directional
feature vectors whose distance can be simply computed by their angle. In this
paper, we propose the von Mises-Fisher (vMF) mixture model as the theoretical
foundation for an effective deep-learning of such directional features and
derive a novel vMF Mixture Loss and its corresponding vMF deep features. The
proposed vMF feature learning achieves the characteristics of discriminative
learning, \textit{i.e.}, compacting the instances of the same class while
increasing the distance of instances from different classes. Moreover, it
subsumes a number of popular loss functions as well as an effective method in
deep learning, namely normalization. We conduct extensive experiments on face
verification using 4 different challenging face datasets, \textit{i.e.}, LFW,
YouTube faces, CACD and IJB-A. Results show the effectiveness and excellent
generalization ability of the proposed approach as it achieves state-of-the-art
results on the LFW, YouTube faces and CACD datasets and competitive results on
the IJB-A dataset.Comment: Under revie
Object Detection with Deep Learning: A Review
Due to object detection's close relationship with video analysis and image
understanding, it has attracted much research attention in recent years.
Traditional object detection methods are built on handcrafted features and
shallow trainable architectures. Their performance easily stagnates by
constructing complex ensembles which combine multiple low-level image features
with high-level context from object detectors and scene classifiers. With the
rapid development in deep learning, more powerful tools, which are able to
learn semantic, high-level, deeper features, are introduced to address the
problems existing in traditional architectures. These models behave differently
in network architecture, training strategy and optimization function, etc. In
this paper, we provide a review on deep learning based object detection
frameworks. Our review begins with a brief introduction on the history of deep
learning and its representative tool, namely Convolutional Neural Network
(CNN). Then we focus on typical generic object detection architectures along
with some modifications and useful tricks to improve detection performance
further. As distinct specific detection tasks exhibit different
characteristics, we also briefly survey several specific tasks, including
salient object detection, face detection and pedestrian detection. Experimental
analyses are also provided to compare various methods and draw some meaningful
conclusions. Finally, several promising directions and tasks are provided to
serve as guidelines for future work in both object detection and relevant
neural network based learning systems
End-to-end 3D face reconstruction with deep neural networks
Monocular 3D facial shape reconstruction from a single 2D facial image has
been an active research area due to its wide applications. Inspired by the
success of deep neural networks (DNN), we propose a DNN-based approach for
End-to-End 3D FAce Reconstruction (UH-E2FAR) from a single 2D image. Different
from recent works that reconstruct and refine the 3D face in an iterative
manner using both an RGB image and an initial 3D facial shape rendering, our
DNN model is end-to-end, and thus the complicated 3D rendering process can be
avoided. Moreover, we integrate in the DNN architecture two components, namely
a multi-task loss function and a fusion convolutional neural network (CNN) to
improve facial expression reconstruction. With the multi-task loss function, 3D
face reconstruction is divided into neutral 3D facial shape reconstruction and
expressive 3D facial shape reconstruction. The neutral 3D facial shape is
class-specific. Therefore, higher layer features are useful. In comparison, the
expressive 3D facial shape favors lower or intermediate layer features. With
the fusion-CNN, features from different intermediate layers are fused and
transformed for predicting the 3D expressive facial shape. Through extensive
experiments, we demonstrate the superiority of our end-to-end framework in
improving the accuracy of 3D face reconstruction.Comment: Accepted to CVPR1
360 Depth Estimation from Multiple Fisheye Images with Origami Crown Representation of Icosahedron
In this study, we present a method for all-around depth estimation from
multiple omnidirectional images for indoor environments. In particular, we
focus on plane-sweeping stereo as the method for depth estimation from the
images. We propose a new icosahedron-based representation and ConvNets for
omnidirectional images, which we name "CrownConv" because the representation
resembles a crown made of origami. CrownConv can be applied to both fisheye
images and equirectangular images to extract features. Furthermore, we propose
icosahedron-based spherical sweeping for generating the cost volume on an
icosahedron from the extracted features. The cost volume is regularized using
the three-dimensional CrownConv, and the final depth is obtained by depth
regression from the cost volume. Our proposed method is robust to camera
alignments by using the extrinsic camera parameters; therefore, it can achieve
precise depth estimation even when the camera alignment differs from that in
the training dataset. We evaluate the proposed model on synthetic datasets and
demonstrate its effectiveness. As our proposed method is computationally
efficient, the depth is estimated from four fisheye images in less than a
second using a laptop with a GPU. Therefore, it is suitable for real-world
robotics applications. Our source code is available at
https://github.com/matsuren/crownconv360depth.Comment: 8 pages, Accepted to the 2020 IEEE/RSJ International Conference on
Intelligent Robots and Systems (IROS 2020). For supplementary video, see
https://youtu.be/_vVD-zDMvy
Deep Global Registration
We present Deep Global Registration, a differentiable framework for pairwise
registration of real-world 3D scans. Deep global registration is based on three
modules: a 6-dimensional convolutional network for correspondence confidence
prediction, a differentiable Weighted Procrustes algorithm for closed-form pose
estimation, and a robust gradient-based SE(3) optimizer for pose refinement.
Experiments demonstrate that our approach outperforms state-of-the-art methods,
both learning-based and classical, on real-world data.Comment: Accepted for CVPR'20 oral presentatio
- …