553 research outputs found

    Going Deeper into Action Recognition: A Survey

    Full text link
    Understanding human actions in visual data is tied to advances in complementary research areas including object recognition, human dynamics, domain adaptation and semantic segmentation. Over the last decade, human action analysis evolved from earlier schemes that are often limited to controlled environments to nowadays advanced solutions that can learn from millions of videos and apply to almost all daily activities. Given the broad range of applications from video surveillance to human-computer interaction, scientific milestones in action recognition are achieved more rapidly, eventually leading to the demise of what used to be good in a short time. This motivated us to provide a comprehensive review of the notable steps taken towards recognizing human actions. To this end, we start our discussion with the pioneering methods that use handcrafted representations, and then, navigate into the realm of deep learning based approaches. We aim to remain objective throughout this survey, touching upon encouraging improvements as well as inevitable fallbacks, in the hope of raising fresh questions and motivating new research directions for the reader

    Exploiting Spatio-Temporal Coherence for Video Object Detection in Robotics

    Get PDF
    This paper proposes a method to enhance video object detection for indoor environments in robotics. Concretely, it exploits knowledge about the camera motion between frames to propagate previously detected objects to successive frames. The proposal is rooted in the concepts of planar homography to propose regions of interest where to find objects, and recursive Bayesian filtering to integrate observations over time. The proposal is evaluated on six virtual, indoor environments, accounting for the detection of nine object classes over a total of ∼ 7k frames. Results show that our proposal improves the recall and the F1-score by a factor of 1.41 and 1.27, respectively, as well as it achieves a significant reduction of the object categorization entropy (58.8%) when compared to a two-stage video object detection method used as baseline, at the cost of small time overheads (120 ms) and precision loss (0.92).</p

    Dynamic texture recognition using time-causal and time-recursive spatio-temporal receptive fields

    Full text link
    This work presents a first evaluation of using spatio-temporal receptive fields from a recently proposed time-causal spatio-temporal scale-space framework as primitives for video analysis. We propose a new family of video descriptors based on regional statistics of spatio-temporal receptive field responses and evaluate this approach on the problem of dynamic texture recognition. Our approach generalises a previously used method, based on joint histograms of receptive field responses, from the spatial to the spatio-temporal domain and from object recognition to dynamic texture recognition. The time-recursive formulation enables computationally efficient time-causal recognition. The experimental evaluation demonstrates competitive performance compared to state-of-the-art. Especially, it is shown that binary versions of our dynamic texture descriptors achieve improved performance compared to a large range of similar methods using different primitives either handcrafted or learned from data. Further, our qualitative and quantitative investigation into parameter choices and the use of different sets of receptive fields highlights the robustness and flexibility of our approach. Together, these results support the descriptive power of this family of time-causal spatio-temporal receptive fields, validate our approach for dynamic texture recognition and point towards the possibility of designing a range of video analysis methods based on these new time-causal spatio-temporal primitives.Comment: 29 pages, 16 figure

    Deep Learning for Crowd Anomaly Detection

    Get PDF
    Today, public areas across the globe are monitored by an increasing amount of surveillance cameras. This widespread usage has presented an ever-growing volume of data that cannot realistically be examined in real-time. Therefore, efforts to understand crowd dynamics have brought light to automatic systems for the detection of anomalies in crowds. This thesis explores the methods used across literature for this purpose, with a focus on those fusing dense optical flow in a feature extraction stage to the crowd anomaly detection problem. To this extent, five different deep learning architectures are trained using optical flow maps estimated by three deep learning-based techniques. More specifically, a 2D convolutional network, a 3D convolutional network, and LSTM-based convolutional recurrent network, a pre-trained variant of the latter, and a ConvLSTM-based autoencoder is trained using both regular frames and optical flow maps estimated by LiteFlowNet3, RAFT, and GMA on the UCSD Pedestrian 1 dataset. The experimental results have shown that while prone to overfitting, the use of optical flow maps may improve the performance of supervised spatio-temporal architectures

    Action recognition from RGB-D data

    Get PDF
    In recent years, action recognition based on RGB-D data has attracted increasing attention. Different from traditional 2D action recognition, RGB-D data contains extra depth and skeleton modalities. Different modalities have their own characteristics. This thesis presents seven novel methods to take advantages of the three modalities for action recognition. First, effective handcrafted features are designed and frequent pattern mining method is employed to mine the most discriminative, representative and nonredundant features for skeleton-based action recognition. Second, to take advantages of powerful Convolutional Neural Networks (ConvNets), it is proposed to represent spatio-temporal information carried in 3D skeleton sequences in three 2D images by encoding the joint trajectories and their dynamics into color distribution in the images, and ConvNets are adopted to learn the discriminative features for human action recognition. Third, for depth-based action recognition, three strategies of data augmentation are proposed to apply ConvNets to small training datasets. Forth, to take full advantage of the 3D structural information offered in the depth modality and its being insensitive to illumination variations, three simple, compact yet effective images-based representations are proposed and ConvNets are adopted for feature extraction and classification. However, both of previous two methods are sensitive to noise and could not differentiate well fine-grained actions. Fifth, it is proposed to represent a depth map sequence into three pairs of structured dynamic images at body, part and joint levels respectively through bidirectional rank pooling to deal with the issue. The structured dynamic image preserves the spatial-temporal information, enhances the structure information across both body parts/joints and different temporal scales, and takes advantages of ConvNets for action recognition. Sixth, it is proposed to extract and use scene flow for action recognition from RGB and depth data. Last, to exploit the joint information in multi-modal features arising from heterogeneous sources (RGB, depth), it is proposed to cooperatively train a single ConvNet (referred to as c-ConvNet) on both RGB features and depth features, and deeply aggregate the two modalities to achieve robust action recognition

    Event-based Vision: A Survey

    Get PDF
    Event cameras are bio-inspired sensors that differ from conventional frame cameras: Instead of capturing images at a fixed rate, they asynchronously measure per-pixel brightness changes, and output a stream of events that encode the time, location and sign of the brightness changes. Event cameras offer attractive properties compared to traditional cameras: high temporal resolution (in the order of microseconds), very high dynamic range (140 dB vs. 60 dB), low power consumption, and high pixel bandwidth (on the order of kHz) resulting in reduced motion blur. Hence, event cameras have a large potential for robotics and computer vision in challenging scenarios for traditional cameras, such as low-latency, high speed, and high dynamic range. However, novel methods are required to process the unconventional output of these sensors in order to unlock their potential. This paper provides a comprehensive overview of the emerging field of event-based vision, with a focus on the applications and the algorithms developed to unlock the outstanding properties of event cameras. We present event cameras from their working principle, the actual sensors that are available and the tasks that they have been used for, from low-level vision (feature detection and tracking, optic flow, etc.) to high-level vision (reconstruction, segmentation, recognition). We also discuss the techniques developed to process events, including learning-based techniques, as well as specialized processors for these novel sensors, such as spiking neural networks. Additionally, we highlight the challenges that remain to be tackled and the opportunities that lie ahead in the search for a more efficient, bio-inspired way for machines to perceive and interact with the world

    Sur la Restauration et l'Edition de Vidéo : Détection de Rayures et Inpainting de Scènes Complexes

    Get PDF
    The inevitable degradation of visual content such as images and films leads to the goal ofimage and video restoration. In this thesis, we look at two specific restoration problems : the detection ofline scratches in old films and the automatic completion of videos, or video inpainting as it is also known.Line scratches are caused when the film physically rubs against a mechanical part. This origin resultsin the specific characteristics of the defect, such as verticality and temporal persistence. We propose adetection algorithm based on the statistical approach known as a contrario methods. We also proposea temporal filtering step to remove false alarms present in the first detection step. Comparisons withprevious work show improved recall and precision, and robustness with respect to the presence of noiseand clutter in the film.The second part of the thesis concerns video inpainting. We propose an algorithm based on theminimisation of a patch-based functional of the video content. In this framework, we address the followingproblems : extremely high execution times, the correct handling of textures in the video and inpaintingwith moving cameras. We also address some convergence issues in a very simplified inpainting context.La degradation inévitable des contenus visuels (images, films) conduit nécessairementà la tâche de la restauration des images et des vidéos. Dans cetre thèse, nous nous intéresserons àdeux sous-problèmes de restauration : la détection des rayures dans les vieux films, et le remplissageautomatique des vidéos (“inpainting vidéo en anglais).En général, les rayures sont dues aux frottements de la pellicule du film avec un objet lors de laprojection du film. Les origines physiques de ce défaut lui donnent des caractéristiques très particuliers.Les rayures sont des lignes plus ou moins verticales qui peuvent être blanches ou noires (ou parfois encouleur) et qui sont temporellement persistantes, c’est-à-dire qu’elles ont une position qui est continuedans le temps. Afin de détecter ces défauts, nous proposons d’abord un algorithme de détection basésur un ensemble d’approches statistiques appelées les méthodes a contrario. Cet algorithme fournitune détection précise et robuste aux bruits et aux textures présentes dans l’image. Nous proposonségalement une étape de filtrage temporel afin d’écarter les fausses alarmes de la première étape dedétection. Celle-ci améliore la précision de l’algorithme en analysant le mouvement des détections spatiales.L’ensemble de l’algorithme (détection spatiale et filtrage temporel) est comparé à des approchesde la littérature et montre un rappel et une précision grandement améliorés.La deuxième partie de cette thèse est consacrée à l’inpainting vidéo. Le but ici est de remplirune région d’une vidéo avec un contenu qui semble visuellement cohérent et convaincant. Il existeune pléthore de méthodes qui traite ce problème dans le cas des images. La littérature dans le casdes vidéos est plus restreinte, notamment car le temps d’exécution représente un véritable obstacle.Nous proposons un algorithme d’inpainting vidéo qui vise l’optimisation d’une fonctionnelle d’énergiequi intègre la notion de patchs, c’est-à-dire des petits cubes de contenu vidéo. Nous traitons d’abord leprobl’‘eme du temps d’exécution avant d’attaquer celui de l’inpainting satisfaisant des textures dans lesvidéos. Nous traitons également le cas des vidéos dont le fond est en mouvement ou qui ont été prisesavec des caméras en mouvement. Enfin, nous nous intéressons à certaines questions de convergencede l’algorithme dans des cas très simplifiés
    • …
    corecore