2,206 research outputs found
PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes
Estimating the 6D pose of known objects is important for robots to interact
with the real world. The problem is challenging due to the variety of objects
as well as the complexity of a scene caused by clutter and occlusions between
objects. In this work, we introduce PoseCNN, a new Convolutional Neural Network
for 6D object pose estimation. PoseCNN estimates the 3D translation of an
object by localizing its center in the image and predicting its distance from
the camera. The 3D rotation of the object is estimated by regressing to a
quaternion representation. We also introduce a novel loss function that enables
PoseCNN to handle symmetric objects. In addition, we contribute a large scale
video dataset for 6D object pose estimation named the YCB-Video dataset. Our
dataset provides accurate 6D poses of 21 objects from the YCB dataset observed
in 92 videos with 133,827 frames. We conduct extensive experiments on our
YCB-Video dataset and the OccludedLINEMOD dataset to show that PoseCNN is
highly robust to occlusions, can handle symmetric objects, and provide accurate
pose estimation using only color images as input. When using depth data to
further refine the poses, our approach achieves state-of-the-art results on the
challenging OccludedLINEMOD dataset. Our code and dataset are available at
https://rse-lab.cs.washington.edu/projects/posecnn/.Comment: Accepted to RSS 201
An original framework for understanding human actions and body language by using deep neural networks
The evolution of both fields of Computer Vision (CV) and Artificial Neural Networks (ANNs) has allowed the development of efficient automatic systems for the analysis of people's behaviour.
By studying hand movements it is possible to recognize gestures, often used by people to communicate information in a non-verbal way.
These gestures can also be used to control or interact with devices without physically touching them. In particular, sign language and semaphoric hand gestures are the two foremost areas of interest due to their importance in Human-Human Communication (HHC) and Human-Computer Interaction (HCI), respectively.
While the processing of body movements play a key role in the action recognition and affective computing fields. The former is essential to understand how people act in an environment, while the latter tries to interpret people's emotions based on their poses and movements;
both are essential tasks in many computer vision applications, including event recognition, and video surveillance.
In this Ph.D. thesis, an original framework for understanding Actions and body language is presented. The framework is composed of three main modules: in the first one, a Long Short Term Memory Recurrent Neural Networks (LSTM-RNNs) based method for the Recognition of Sign Language and Semaphoric Hand Gestures is proposed; the second module presents a solution based on 2D skeleton and two-branch stacked LSTM-RNNs for action recognition in video sequences; finally, in the last module, a solution for basic non-acted emotion recognition by using 3D skeleton and Deep Neural Networks (DNNs) is provided.
The performances of RNN-LSTMs are explored in depth, due to their ability to model the long term contextual information of temporal sequences, making them suitable for analysing body movements.
All the modules were tested by using challenging datasets, well known in the state of the art, showing remarkable results compared to the current literature methods
Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks
Recognizing arbitrary multi-character text in unconstrained natural
photographs is a hard problem. In this paper, we address an equally hard
sub-problem in this domain viz. recognizing arbitrary multi-digit numbers from
Street View imagery. Traditional approaches to solve this problem typically
separate out the localization, segmentation, and recognition steps. In this
paper we propose a unified approach that integrates these three steps via the
use of a deep convolutional neural network that operates directly on the image
pixels. We employ the DistBelief implementation of deep neural networks in
order to train large, distributed neural networks on high quality images. We
find that the performance of this approach increases with the depth of the
convolutional network, with the best performance occurring in the deepest
architecture we trained, with eleven hidden layers. We evaluate this approach
on the publicly available SVHN dataset and achieve over accuracy in
recognizing complete street numbers. We show that on a per-digit recognition
task, we improve upon the state-of-the-art, achieving accuracy. We
also evaluate this approach on an even more challenging dataset generated from
Street View imagery containing several tens of millions of street number
annotations and achieve over accuracy. To further explore the
applicability of the proposed system to broader text recognition tasks, we
apply it to synthetic distorted text from reCAPTCHA. reCAPTCHA is one of the
most secure reverse turing tests that uses distorted text to distinguish humans
from bots. We report a accuracy on the hardest category of reCAPTCHA.
Our evaluations on both tasks indicate that at specific operating thresholds,
the performance of the proposed system is comparable to, and in some cases
exceeds, that of human operators
Hotels-50K: A Global Hotel Recognition Dataset
Recognizing a hotel from an image of a hotel room is important for human
trafficking investigations. Images directly link victims to places and can help
verify where victims have been trafficked, and where their traffickers might
move them or others in the future. Recognizing the hotel from images is
challenging because of low image quality, uncommon camera perspectives, large
occlusions (often the victim), and the similarity of objects (e.g., furniture,
art, bedding) across different hotel rooms.
To support efforts towards this hotel recognition task, we have curated a
dataset of over 1 million annotated hotel room images from 50,000 hotels. These
images include professionally captured photographs from travel websites and
crowd-sourced images from a mobile application, which are more similar to the
types of images analyzed in real-world investigations. We present a baseline
approach based on a standard network architecture and a collection of
data-augmentation approaches tuned to this problem domain
Analysis of Hand Segmentation in the Wild
A large number of works in egocentric vision have concentrated on action and
object recognition. Detection and segmentation of hands in first-person videos,
however, has less been explored. For many applications in this domain, it is
necessary to accurately segment not only hands of the camera wearer but also
the hands of others with whom he is interacting. Here, we take an in-depth look
at the hand segmentation problem. In the quest for robust hand segmentation
methods, we evaluated the performance of the state of the art semantic
segmentation methods, off the shelf and fine-tuned, on existing datasets. We
fine-tune RefineNet, a leading semantic segmentation method, for hand
segmentation and find that it does much better than the best contenders.
Existing hand segmentation datasets are collected in the laboratory settings.
To overcome this limitation, we contribute by collecting two new datasets: a)
EgoYouTubeHands including egocentric videos containing hands in the wild, and
b) HandOverFace to analyze the performance of our models in presence of similar
appearance occlusions. We further explore whether conditional random fields can
help refine generated hand segmentations. To demonstrate the benefit of
accurate hand maps, we train a CNN for hand-based activity recognition and
achieve higher accuracy when a CNN was trained using hand maps produced by the
fine-tuned RefineNet. Finally, we annotate a subset of the EgoHands dataset for
fine-grained action recognition and show that an accuracy of 58.6% can be
achieved by just looking at a single hand pose which is much better than the
chance level (12.5%).Comment: Accepted at CVPR 201
Multi-view Face Detection Using Deep Convolutional Neural Networks
In this paper we consider the problem of multi-view face detection. While
there has been significant research on this problem, current state-of-the-art
approaches for this task require annotation of facial landmarks, e.g. TSM [25],
or annotation of face poses [28, 22]. They also require training dozens of
models to fully capture faces in all orientations, e.g. 22 models in HeadHunter
method [22]. In this paper we propose Deep Dense Face Detector (DDFD), a method
that does not require pose/landmark annotation and is able to detect faces in a
wide range of orientations using a single model based on deep convolutional
neural networks. The proposed method has minimal complexity; unlike other
recent deep learning object detection methods [9], it does not require
additional components such as segmentation, bounding-box regression, or SVM
classifiers. Furthermore, we analyzed scores of the proposed face detector for
faces in different orientations and found that 1) the proposed method is able
to detect faces from different angles and can handle occlusion to some extent,
2) there seems to be a correlation between dis- tribution of positive examples
in the training set and scores of the proposed face detector. The latter
suggests that the proposed methods performance can be further improved by using
better sampling strategies and more sophisticated data augmentation techniques.
Evaluations on popular face detection benchmark datasets show that our
single-model face detector algorithm has similar or better performance compared
to the previous methods, which are more complex and require annotations of
either different poses or facial landmarks.Comment: in International Conference on Multimedia Retrieval 2015 (ICMR
- …