708 research outputs found
Action Recognition in Videos: from Motion Capture Labs to the Web
This paper presents a survey of human action recognition approaches based on
visual data recorded from a single video camera. We propose an organizing
framework which puts in evidence the evolution of the area, with techniques
moving from heavily constrained motion capture scenarios towards more
challenging, realistic, "in the wild" videos. The proposed organization is
based on the representation used as input for the recognition task, emphasizing
the hypothesis assumed and thus, the constraints imposed on the type of video
that each technique is able to address. Expliciting the hypothesis and
constraints makes the framework particularly useful to select a method, given
an application. Another advantage of the proposed organization is that it
allows categorizing newest approaches seamlessly with traditional ones, while
providing an insightful perspective of the evolution of the action recognition
task up to now. That perspective is the basis for the discussion in the end of
the paper, where we also present the main open issues in the area.Comment: Preprint submitted to CVIU, survey paper, 46 pages, 2 figures, 4
table
Robust Face Tracking in Video Sequences
Ce travail prĂ©sente une analyse et une discussion dĂ©taillĂ©es dâun nouveau systĂšme de suivi des visages qui utilise plusieurs modĂšles dâapparence ainsi quâun e approche suivi par dĂ©tection. Ce systĂšme peut aider un systĂšme de reconnaissance de visages basĂ© sur la vidĂ©o en donnant
des emplacements de visages dâindividus spĂ©cifiques (rĂ©gion dâintĂ©rĂȘt, ROI) pour chaque cadre. Un systĂšme de reconnaissance faciale peut utiliser les ROI fournis par le suivi du visage pour obtenir des preuves accumulĂ©es de la prĂ©sence dâune personne dâune personne prĂ©sente dans une vidĂ©o, afin dâidentifier une personne dâintĂ©rĂȘt dĂ©jĂ inscrite dans le systĂšme de reconnaissance faciale.
La tĂąche principale dâune mĂ©thode de suivi est de trouver lâemplacement dâun visage prĂ©sent dans une image en utilisant des informations de localisation Ă partir de la trame prĂ©cĂ©dente. Le processus de recherche se fait en trouvant la meilleure rĂ©gion qui maximise la possibilitĂ© dâun visage prĂ©sent dans la trame en comparant la rĂ©gion avec un modĂšle dâapparence du visage. Cependant, au cours de ce processus, plusieurs facteurs externes nuisent aux performances dâune mĂ©thode de suivi. Ces facteurs externes sont qualifiĂ©s de nuisances et apparaissent habituellement sous la forme dâune variation dâĂ©clairage, dâun encombrement de la scĂšne, dâun flou de mouvement, dâune occlusion partielle, etc. Ainsi, le principal dĂ©fi pour une mĂ©thode
de suivi est de trouver la meilleure rĂ©gion malgrĂ© les changements dâapparence frĂ©quents du visage pendant le processus de suivi. Ătant donnĂ© quâil nâest pas possible de contrĂŽler ces nuisances, des modĂšles dâapparence faciale robustes sont conçus et dĂ©veloppĂ©s de telle sorte quâils soient moins affectĂ©s par ces nuisances et peuvent encore suivre un visage avec succĂšs lors de ces scĂ©narios.
Bien quâun modĂšle dâapparence unique puisse ĂȘtre utilisĂ© pour le suivi dâun visage, il ne peut pas sâattaquer Ă toutes les nuisances de suivi. Par consĂ©quent, la mĂ©thode proposĂ©e utilise plusieurs modĂšles dâapparence faciale pour sâattaquer Ă ces nuisances. En outre, la mĂ©thode
proposĂ©e combine la mĂ©thodologie du suivi par dĂ©tection en employant un dĂ©tecteur de visage qui fournit des rectangles englobants pour chaque image. Par consĂ©quent, le dĂ©tecteur de visage aide la mĂ©thode de suivi Ă aborder les nuisances de suivi. De plus, un dĂ©tecteur de visage contribue Ă la rĂ©initialisation du suivi pendant un cas de dĂ©rive. Cependant, la prĂ©cision suivi peut encore ĂȘtre amĂ©liorĂ©e en gĂ©nĂ©rant des candidats additionnels autour de lâestimation
de la position de lâobjet par la mĂ©thode de suivi et en choisissant le meilleur parmi eux. Ainsi, dans la mĂ©thode proposĂ©e, le suivi du visage est formulĂ© comme le visage candidat qui maximise la similitude de tous les modĂšles dâapparence.----------ABSTRACT: This work presents a detailed analysis and discussion of a novel face tracking system that utilizes multiple appearance models along with a tracking-by-detection framework that can aid a video-based face recognition system by giving face locations of specific individuals (Region Of Interest, ROI) for every frame. A face recognition system can utilize the ROIs provided by the face tracker to get accumulated evidence of a person being present in a video, in order to identify a person of interest that is already enrolled in the face recognition system. The primary task of a face tracker is to find the location of a face present in an image by utilizing its location information from the previous frame. The searching process is done by finding the best region that maximizes the possibility of a face being present in the frame by
comparing the region with a face appearance model. However, during this face search, several external factors inhibit the performance of a face tracker. These external factors are termed as tracking nuisances, and usually appear in the form of illumination variation, background clutter, motion blur, partial occlusion, etc. Thus, the main challenge for a face tracker is to find the best region in spite of frequent appearance changes of the face during the tracking process. Since, it is not possible to control these nuisances. Robust face appearance models are designed and developed such that they do not too much affected by these nuisances and still can track a face successfully during such scenarios.
Although a single face appearance model can be used for tracking a face, it cannot tackle all the tracking nuisances. Hence, the proposed method utilizes multiple face appearance models. By doing this, different appearance models can facilitate tracking in the presence of tracking nuisances. In addition, the proposed method, combines the tracking-by-detection methodology by employing a face detector that outputs a bounding box for every frame.
Therefore, the face detector aids the face tracker in tackling the tracking nuisances. In addition, a face detector aids in the re-initialization of the tracker during tracking drift. However, the precision of the tracker can further be improved by generating face candidates around the face tracking output and choosing the best among them. Thus, in the proposed method, face tracking is formulated as the face candidate that maximizes the similarity of all the appearance models
Using Prior Knowledge for Verification and Elimination of Stationary and Variable Objects in Real-time Images
With the evolving technologies in the autonomous vehicle industry, now it has become possible for automobile passengers to sit relaxed instead of driving the car. Technologies like object detection, object identification, and image segmentation have enabled an autonomous car to identify and detect an object on the road in order to drive safely. While an autonomous car drives by itself on the road, the types of objects surrounding the car can be dynamic (e.g., cars and pedestrians), stationary (e.g., buildings and benches), and variable (e.g., trees) depending on if the location or shape of an object changes or not. Different from the existing image-based approaches to detect and recognize objects in the scene, in this research 3D virtual world is employed to verify and eliminate stationary and variable objects to allow the autonomous car to focus on dynamic objects that may cause danger to its driving. This methodology takes advantage of prior knowledge of stationary and variable objects presented in a virtual city and verifies their existence in a real-time scene by matching keypoints between the virtual and real objects. In case of a stationary or variable object that does not exist in the virtual world due to incomplete pre-existing information, this method uses machine learning for object detection. Verified objects are then removed from the real-time image with a combined algorithm using contour detection and class activation map (CAM), which helps to enhance the efficiency and accuracy when recognizing moving objects
A Voting Algorithm for Dynamic Object Identification and Pose Estimation
While object identification enables autonomous vehicles to detect and recognize objects from real-time images, pose estimation further enhances their capability of navigating in a dynamically changing environment. This thesis proposes an approach which makes use of keypoint features from 3D object models for recognition and pose estimation of dynamic objects in the context of self-driving vehicles. A voting technique is developed to vote out a suitable model from the repository of 3D models that offers the best match with the dynamic objects in the input image. The matching is done based on the identified keypoints on the image and the keypoints corresponding to each template model stored in the repository. A confidence score value is then assigned to measure the confidence with which the system can confirm the presence of the matched object in the input image. Being dynamic objects with complex structure, human models in the COCO-DensePose dataset, along with the DensePose deep-learning model developed by the Facebook research team, have been adopted and integrated into the system for 3D pose estimation of pedestrians on the road. Additionally, object tracking is performed to find the speed and location details for each of the recognized dynamic objects from consecutive image frames of the input video. This research demonstrates with experimental results that the use of 3D object models enhances the confidence of recognition and pose estimation of dynamic objects in the real-time input image. The 3D pose information of the recognized dynamic objects along with their corresponding speed and location information would help the autonomous navigation system of the self-driving cars to take appropriate navigation decisions, thus ensuring smooth and safe driving
DART: Distribution Aware Retinal Transform for Event-based Cameras
We introduce a generic visual descriptor, termed as distribution aware
retinal transform (DART), that encodes the structural context using log-polar
grids for event cameras. The DART descriptor is applied to four different
problems, namely object classification, tracking, detection and feature
matching: (1) The DART features are directly employed as local descriptors in a
bag-of-features classification framework and testing is carried out on four
standard event-based object datasets (N-MNIST, MNIST-DVS, CIFAR10-DVS,
NCaltech-101). (2) Extending the classification system, tracking is
demonstrated using two key novelties: (i) For overcoming the low-sample problem
for the one-shot learning of a binary classifier, statistical bootstrapping is
leveraged with online learning; (ii) To achieve tracker robustness, the scale
and rotation equivariance property of the DART descriptors is exploited for the
one-shot learning. (3) To solve the long-term object tracking problem, an
object detector is designed using the principle of cluster majority voting. The
detection scheme is then combined with the tracker to result in a high
intersection-over-union score with augmented ground truth annotations on the
publicly available event camera dataset. (4) Finally, the event context encoded
by DART greatly simplifies the feature correspondence problem, especially for
spatio-temporal slices far apart in time, which has not been explicitly tackled
in the event-based vision domain.Comment: 12 pages, revision submitted to TPAMI in Nov 201
A Comprehensive Performance Evaluation of Deformable Face Tracking "In-the-Wild"
Recently, technologies such as face detection, facial landmark localisation
and face recognition and verification have matured enough to provide effective
and efficient solutions for imagery captured under arbitrary conditions
(referred to as "in-the-wild"). This is partially attributed to the fact that
comprehensive "in-the-wild" benchmarks have been developed for face detection,
landmark localisation and recognition/verification. A very important technology
that has not been thoroughly evaluated yet is deformable face tracking
"in-the-wild". Until now, the performance has mainly been assessed
qualitatively by visually assessing the result of a deformable face tracking
technology on short videos. In this paper, we perform the first, to the best of
our knowledge, thorough evaluation of state-of-the-art deformable face tracking
pipelines using the recently introduced 300VW benchmark. We evaluate many
different architectures focusing mainly on the task of on-line deformable face
tracking. In particular, we compare the following general strategies: (a)
generic face detection plus generic facial landmark localisation, (b) generic
model free tracking plus generic facial landmark localisation, as well as (c)
hybrid approaches using state-of-the-art face detection, model free tracking
and facial landmark localisation technologies. Our evaluation reveals future
avenues for further research on the topic.Comment: E. Antonakos and P. Snape contributed equally and have joint second
authorshi
- âŠ