708 research outputs found

    Action Recognition in Videos: from Motion Capture Labs to the Web

    Full text link
    This paper presents a survey of human action recognition approaches based on visual data recorded from a single video camera. We propose an organizing framework which puts in evidence the evolution of the area, with techniques moving from heavily constrained motion capture scenarios towards more challenging, realistic, "in the wild" videos. The proposed organization is based on the representation used as input for the recognition task, emphasizing the hypothesis assumed and thus, the constraints imposed on the type of video that each technique is able to address. Expliciting the hypothesis and constraints makes the framework particularly useful to select a method, given an application. Another advantage of the proposed organization is that it allows categorizing newest approaches seamlessly with traditional ones, while providing an insightful perspective of the evolution of the action recognition task up to now. That perspective is the basis for the discussion in the end of the paper, where we also present the main open issues in the area.Comment: Preprint submitted to CVIU, survey paper, 46 pages, 2 figures, 4 table

    Robust Face Tracking in Video Sequences

    Get PDF
    Ce travail prĂ©sente une analyse et une discussion dĂ©taillĂ©es d’un nouveau systĂšme de suivi des visages qui utilise plusieurs modĂšles d’apparence ainsi qu’un e approche suivi par dĂ©tection. Ce systĂšme peut aider un systĂšme de reconnaissance de visages basĂ© sur la vidĂ©o en donnant des emplacements de visages d’individus spĂ©cifiques (rĂ©gion d’intĂ©rĂȘt, ROI) pour chaque cadre. Un systĂšme de reconnaissance faciale peut utiliser les ROI fournis par le suivi du visage pour obtenir des preuves accumulĂ©es de la prĂ©sence d’une personne d’une personne prĂ©sente dans une vidĂ©o, afin d’identifier une personne d’intĂ©rĂȘt dĂ©jĂ  inscrite dans le systĂšme de reconnaissance faciale. La tĂąche principale d’une mĂ©thode de suivi est de trouver l’emplacement d’un visage prĂ©sent dans une image en utilisant des informations de localisation Ă  partir de la trame prĂ©cĂ©dente. Le processus de recherche se fait en trouvant la meilleure rĂ©gion qui maximise la possibilitĂ© d’un visage prĂ©sent dans la trame en comparant la rĂ©gion avec un modĂšle d’apparence du visage. Cependant, au cours de ce processus, plusieurs facteurs externes nuisent aux performances d’une mĂ©thode de suivi. Ces facteurs externes sont qualifiĂ©s de nuisances et apparaissent habituellement sous la forme d’une variation d’éclairage, d’un encombrement de la scĂšne, d’un flou de mouvement, d’une occlusion partielle, etc. Ainsi, le principal dĂ©fi pour une mĂ©thode de suivi est de trouver la meilleure rĂ©gion malgrĂ© les changements d’apparence frĂ©quents du visage pendant le processus de suivi. Étant donnĂ© qu’il n’est pas possible de contrĂŽler ces nuisances, des modĂšles d’apparence faciale robustes sont conçus et dĂ©veloppĂ©s de telle sorte qu’ils soient moins affectĂ©s par ces nuisances et peuvent encore suivre un visage avec succĂšs lors de ces scĂ©narios. Bien qu’un modĂšle d’apparence unique puisse ĂȘtre utilisĂ© pour le suivi d’un visage, il ne peut pas s’attaquer Ă  toutes les nuisances de suivi. Par consĂ©quent, la mĂ©thode proposĂ©e utilise plusieurs modĂšles d’apparence faciale pour s’attaquer Ă  ces nuisances. En outre, la mĂ©thode proposĂ©e combine la mĂ©thodologie du suivi par dĂ©tection en employant un dĂ©tecteur de visage qui fournit des rectangles englobants pour chaque image. Par consĂ©quent, le dĂ©tecteur de visage aide la mĂ©thode de suivi Ă  aborder les nuisances de suivi. De plus, un dĂ©tecteur de visage contribue Ă  la rĂ©initialisation du suivi pendant un cas de dĂ©rive. Cependant, la prĂ©cision suivi peut encore ĂȘtre amĂ©liorĂ©e en gĂ©nĂ©rant des candidats additionnels autour de l’estimation de la position de l’objet par la mĂ©thode de suivi et en choisissant le meilleur parmi eux. Ainsi, dans la mĂ©thode proposĂ©e, le suivi du visage est formulĂ© comme le visage candidat qui maximise la similitude de tous les modĂšles d’apparence.----------ABSTRACT: This work presents a detailed analysis and discussion of a novel face tracking system that utilizes multiple appearance models along with a tracking-by-detection framework that can aid a video-based face recognition system by giving face locations of specific individuals (Region Of Interest, ROI) for every frame. A face recognition system can utilize the ROIs provided by the face tracker to get accumulated evidence of a person being present in a video, in order to identify a person of interest that is already enrolled in the face recognition system. The primary task of a face tracker is to find the location of a face present in an image by utilizing its location information from the previous frame. The searching process is done by finding the best region that maximizes the possibility of a face being present in the frame by comparing the region with a face appearance model. However, during this face search, several external factors inhibit the performance of a face tracker. These external factors are termed as tracking nuisances, and usually appear in the form of illumination variation, background clutter, motion blur, partial occlusion, etc. Thus, the main challenge for a face tracker is to find the best region in spite of frequent appearance changes of the face during the tracking process. Since, it is not possible to control these nuisances. Robust face appearance models are designed and developed such that they do not too much affected by these nuisances and still can track a face successfully during such scenarios. Although a single face appearance model can be used for tracking a face, it cannot tackle all the tracking nuisances. Hence, the proposed method utilizes multiple face appearance models. By doing this, different appearance models can facilitate tracking in the presence of tracking nuisances. In addition, the proposed method, combines the tracking-by-detection methodology by employing a face detector that outputs a bounding box for every frame. Therefore, the face detector aids the face tracker in tackling the tracking nuisances. In addition, a face detector aids in the re-initialization of the tracker during tracking drift. However, the precision of the tracker can further be improved by generating face candidates around the face tracking output and choosing the best among them. Thus, in the proposed method, face tracking is formulated as the face candidate that maximizes the similarity of all the appearance models

    Using Prior Knowledge for Verification and Elimination of Stationary and Variable Objects in Real-time Images

    Get PDF
    With the evolving technologies in the autonomous vehicle industry, now it has become possible for automobile passengers to sit relaxed instead of driving the car. Technologies like object detection, object identification, and image segmentation have enabled an autonomous car to identify and detect an object on the road in order to drive safely. While an autonomous car drives by itself on the road, the types of objects surrounding the car can be dynamic (e.g., cars and pedestrians), stationary (e.g., buildings and benches), and variable (e.g., trees) depending on if the location or shape of an object changes or not. Different from the existing image-based approaches to detect and recognize objects in the scene, in this research 3D virtual world is employed to verify and eliminate stationary and variable objects to allow the autonomous car to focus on dynamic objects that may cause danger to its driving. This methodology takes advantage of prior knowledge of stationary and variable objects presented in a virtual city and verifies their existence in a real-time scene by matching keypoints between the virtual and real objects. In case of a stationary or variable object that does not exist in the virtual world due to incomplete pre-existing information, this method uses machine learning for object detection. Verified objects are then removed from the real-time image with a combined algorithm using contour detection and class activation map (CAM), which helps to enhance the efficiency and accuracy when recognizing moving objects

    A Voting Algorithm for Dynamic Object Identification and Pose Estimation

    Get PDF
    While object identification enables autonomous vehicles to detect and recognize objects from real-time images, pose estimation further enhances their capability of navigating in a dynamically changing environment. This thesis proposes an approach which makes use of keypoint features from 3D object models for recognition and pose estimation of dynamic objects in the context of self-driving vehicles. A voting technique is developed to vote out a suitable model from the repository of 3D models that offers the best match with the dynamic objects in the input image. The matching is done based on the identified keypoints on the image and the keypoints corresponding to each template model stored in the repository. A confidence score value is then assigned to measure the confidence with which the system can confirm the presence of the matched object in the input image. Being dynamic objects with complex structure, human models in the COCO-DensePose dataset, along with the DensePose deep-learning model developed by the Facebook research team, have been adopted and integrated into the system for 3D pose estimation of pedestrians on the road. Additionally, object tracking is performed to find the speed and location details for each of the recognized dynamic objects from consecutive image frames of the input video. This research demonstrates with experimental results that the use of 3D object models enhances the confidence of recognition and pose estimation of dynamic objects in the real-time input image. The 3D pose information of the recognized dynamic objects along with their corresponding speed and location information would help the autonomous navigation system of the self-driving cars to take appropriate navigation decisions, thus ensuring smooth and safe driving

    DART: Distribution Aware Retinal Transform for Event-based Cameras

    Full text link
    We introduce a generic visual descriptor, termed as distribution aware retinal transform (DART), that encodes the structural context using log-polar grids for event cameras. The DART descriptor is applied to four different problems, namely object classification, tracking, detection and feature matching: (1) The DART features are directly employed as local descriptors in a bag-of-features classification framework and testing is carried out on four standard event-based object datasets (N-MNIST, MNIST-DVS, CIFAR10-DVS, NCaltech-101). (2) Extending the classification system, tracking is demonstrated using two key novelties: (i) For overcoming the low-sample problem for the one-shot learning of a binary classifier, statistical bootstrapping is leveraged with online learning; (ii) To achieve tracker robustness, the scale and rotation equivariance property of the DART descriptors is exploited for the one-shot learning. (3) To solve the long-term object tracking problem, an object detector is designed using the principle of cluster majority voting. The detection scheme is then combined with the tracker to result in a high intersection-over-union score with augmented ground truth annotations on the publicly available event camera dataset. (4) Finally, the event context encoded by DART greatly simplifies the feature correspondence problem, especially for spatio-temporal slices far apart in time, which has not been explicitly tackled in the event-based vision domain.Comment: 12 pages, revision submitted to TPAMI in Nov 201

    A Comprehensive Performance Evaluation of Deformable Face Tracking "In-the-Wild"

    Full text link
    Recently, technologies such as face detection, facial landmark localisation and face recognition and verification have matured enough to provide effective and efficient solutions for imagery captured under arbitrary conditions (referred to as "in-the-wild"). This is partially attributed to the fact that comprehensive "in-the-wild" benchmarks have been developed for face detection, landmark localisation and recognition/verification. A very important technology that has not been thoroughly evaluated yet is deformable face tracking "in-the-wild". Until now, the performance has mainly been assessed qualitatively by visually assessing the result of a deformable face tracking technology on short videos. In this paper, we perform the first, to the best of our knowledge, thorough evaluation of state-of-the-art deformable face tracking pipelines using the recently introduced 300VW benchmark. We evaluate many different architectures focusing mainly on the task of on-line deformable face tracking. In particular, we compare the following general strategies: (a) generic face detection plus generic facial landmark localisation, (b) generic model free tracking plus generic facial landmark localisation, as well as (c) hybrid approaches using state-of-the-art face detection, model free tracking and facial landmark localisation technologies. Our evaluation reveals future avenues for further research on the topic.Comment: E. Antonakos and P. Snape contributed equally and have joint second authorshi
    • 

    corecore