16,658 research outputs found
Human Pose Estimation using Deep Consensus Voting
In this paper we consider the problem of human pose estimation from a single
still image. We propose a novel approach where each location in the image votes
for the position of each keypoint using a convolutional neural net. The voting
scheme allows us to utilize information from the whole image, rather than rely
on a sparse set of keypoint locations. Using dense, multi-target votes, not
only produces good keypoint predictions, but also enables us to compute
image-dependent joint keypoint probabilities by looking at consensus voting.
This differs from most previous methods where joint probabilities are learned
from relative keypoint locations and are independent of the image. We finally
combine the keypoints votes and joint probabilities in order to identify the
optimal pose configuration. We show our competitive performance on the MPII
Human Pose and Leeds Sports Pose datasets
Discovery and recognition of motion primitives in human activities
We present a novel framework for the automatic discovery and recognition of
motion primitives in videos of human activities. Given the 3D pose of a human
in a video, human motion primitives are discovered by optimizing the `motion
flux', a quantity which captures the motion variation of a group of skeletal
joints. A normalization of the primitives is proposed in order to make them
invariant with respect to a subject anatomical variations and data sampling
rate. The discovered primitives are unknown and unlabeled and are
unsupervisedly collected into classes via a hierarchical non-parametric Bayes
mixture model. Once classes are determined and labeled they are further
analyzed for establishing models for recognizing discovered primitives. Each
primitive model is defined by a set of learned parameters.
Given new video data and given the estimated pose of the subject appearing on
the video, the motion is segmented into primitives, which are recognized with a
probability given according to the parameters of the learned models.
Using our framework we build a publicly available dataset of human motion
primitives, using sequences taken from well-known motion capture datasets. We
expect that our framework, by providing an objective way for discovering and
categorizing human motion, will be a useful tool in numerous research fields
including video analysis, human inspired motion generation, learning by
demonstration, intuitive human-robot interaction, and human behavior analysis
Anchor Loss: Modulating Loss Scale Based on Prediction Difficulty
We propose a novel loss function that dynamically re-scales the cross entropy based on prediction difficulty regarding a sample. Deep neural network architectures in image classification tasks struggle to disambiguate visually similar objects. Likewise, in human pose estimation symmetric body parts often confuse the network with assigning indiscriminative scores to them. This is due to the output prediction, in which only the highest confidence label is selected without taking into consideration a measure of uncertainty. In this work, we define the prediction difficulty as a relative property coming from the confidence score gap between positive and negative labels. More precisely, the proposed loss function penalizes the network to avoid the score of a false prediction being significant. To demonstrate the efficacy of our loss function, we evaluate it on two different domains: image classification and human pose estimation. We find improvements in both applications by achieving higher accuracy compared to the baseline methods
Eye in the Sky: Real-time Drone Surveillance System (DSS) for Violent Individuals Identification using ScatterNet Hybrid Deep Learning Network
Drone systems have been deployed by various law enforcement agencies to
monitor hostiles, spy on foreign drug cartels, conduct border control
operations, etc. This paper introduces a real-time drone surveillance system to
identify violent individuals in public areas. The system first uses the Feature
Pyramid Network to detect humans from aerial images. The image region with the
human is used by the proposed ScatterNet Hybrid Deep Learning (SHDL) network
for human pose estimation. The orientations between the limbs of the estimated
pose are next used to identify the violent individuals. The proposed deep
network can learn meaningful representations quickly using ScatterNet and
structural priors with relatively fewer labeled examples. The system detects
the violent individuals in real-time by processing the drone images in the
cloud. This research also introduces the aerial violent individual dataset used
for training the deep network which hopefully may encourage researchers
interested in using deep learning for aerial surveillance. The pose estimation
and violent individuals identification performance is compared with the
state-of-the-art techniques.Comment: To Appear in the Efficient Deep Learning for Computer Vision (ECV)
workshop at IEEE Computer Vision and Pattern Recognition (CVPR) 2018. Youtube
demo at this: https://www.youtube.com/watch?v=zYypJPJipY
MoDeep: A Deep Learning Framework Using Motion Features for Human Pose Estimation
In this work, we propose a novel and efficient method for articulated human
pose estimation in videos using a convolutional network architecture, which
incorporates both color and motion features. We propose a new human body pose
dataset, FLIC-motion, that extends the FLIC dataset with additional motion
features. We apply our architecture to this dataset and report significantly
better performance than current state-of-the-art pose detection systems
F-formation Detection: Individuating Free-standing Conversational Groups in Images
Detection of groups of interacting people is a very interesting and useful
task in many modern technologies, with application fields spanning from
video-surveillance to social robotics. In this paper we first furnish a
rigorous definition of group considering the background of the social sciences:
this allows us to specify many kinds of group, so far neglected in the Computer
Vision literature. On top of this taxonomy, we present a detailed state of the
art on the group detection algorithms. Then, as a main contribution, we present
a brand new method for the automatic detection of groups in still images, which
is based on a graph-cuts framework for clustering individuals; in particular we
are able to codify in a computational sense the sociological definition of
F-formation, that is very useful to encode a group having only proxemic
information: position and orientation of people. We call the proposed method
Graph-Cuts for F-formation (GCFF). We show how GCFF definitely outperforms all
the state of the art methods in terms of different accuracy measures (some of
them are brand new), demonstrating also a strong robustness to noise and
versatility in recognizing groups of various cardinality.Comment: 32 pages, submitted to PLOS On
Efficient Object Localization Using Convolutional Networks
Recent state-of-the-art performance on human-body pose estimation has been
achieved with Deep Convolutional Networks (ConvNets). Traditional ConvNet
architectures include pooling and sub-sampling layers which reduce
computational requirements, introduce invariance and prevent over-training.
These benefits of pooling come at the cost of reduced localization accuracy. We
introduce a novel architecture which includes an efficient `position
refinement' model that is trained to estimate the joint offset location within
a small region of the image. This refinement model is jointly trained in
cascade with a state-of-the-art ConvNet model to achieve improved accuracy in
human joint location estimation. We show that the variance of our detector
approaches the variance of human annotations on the FLIC dataset and
outperforms all existing approaches on the MPII-human-pose dataset.Comment: 8 pages with 1 page of citation
A multi-projector CAVE system with commodity hardware and gesture-based interaction
Spatially-immersive systems such as CAVEs provide users with surrounding worlds by projecting 3D models on multiple screens around the viewer. Compared to alternative immersive systems such as HMDs, CAVE systems are a powerful tool for collaborative inspection of virtual environments due to better use of peripheral vision, less sensitivity to tracking errors, and higher communication possibilities among users. Unfortunately, traditional CAVE setups require sophisticated equipment including stereo-ready projectors and tracking systems with high acquisition and maintenance costs. In this paper we present the design and construction of a passive-stereo, four-wall CAVE system based on commodity hardware. Our system works with any mix of a wide range of projector models that can be replaced independently at any time, and achieves high resolution and brightness at a minimum cost. The key ingredients of our CAVE are a self-calibration approach that guarantees continuity across the screen, as well as a gesture-based interaction approach based on a clever
combination of skeletal data from multiple Kinect sensors.Preprin
- …