59 research outputs found
Vision-based traffic surveys in urban environments
This paper presents a state-of-the-art, vision-based vehicle detection and type classification to perform traffic surveys from a roadside closed-circuit television camera. Vehicles are detected using background subtraction based on a Gaussian mixture model that can cope with vehicles that become stationary over a significant period of time. Vehicle silhouettes are described using a combination of shape and appearance features using an intensity-based pyramid histogram of orientation gradients (HOG). Classification is performed using a support vector machine, which is trained on a small set of hand-labeled silhouette exemplars. These exemplars are identified using a model-based preclassifier that utilizes calibrated images mapped by Google Earth to provide accurately surveyed scene geometry matched to visible image landmarks. Kalman filters track the vehicles to enable classification by majority voting over several consecutive frames. The system counts vehicles and separates them into four categories: car, van, bus, and motorcycle (including bicycles). Experiments with real-world data have been undertaken to evaluate system performance and vehicle detection rates of 96.45% and classification accuracy of 95.70% have been achieved on this data.The authors gratefully acknowledge the Royal Borough of Kingston for providing the video data. S.A. Velastin is grateful to funding received from the Universidad Carlos III de Madrid, the European Union’s Seventh Framework Programme for research, technological development and demonstration under grant agreement nº 600371, el Ministerio de Economía y Competitividad (COFUND2013-51509) and Banco Santander
Learning to recognise 3D human action from a new skeleton-based representation using deep convolutional neural networks
Recognising human actions in untrimmed videos is an important challenging task. An effective three-dimensional (3D) motion representation and a powerful learning model are two key factors influencing recognition performance. In this study, the authors introduce a new skeleton-based representation for 3D action recognition in videos. The key idea of the proposed representation is to transform 3D joint coordinates of the human body carried in skeleton sequences into RGB images via a colour encoding process. By normalising the 3D joint coordinates and dividing each skeleton frame into five parts, where the joints are concatenated according to the order of their physical connections, the colour-coded representation is able to represent spatio-temporal evolutions of complex 3D motions, independently of the length of each sequence. They then design and train different deep convolutional neural networks based on the residual network architecture on the obtained image-based representations to learn 3D motion features and classify them into classes. Their proposed method is evaluated on two widely used action recognition benchmarks: MSR Action3D and NTU-RGB+D, a very large-scale dataset for 3D human action recognition. The experimental results demonstrate that the proposed method outperforms previous state-of-the-art approaches while requiring less computation for training and prediction
Evaluation framework for crowd behaviour simulation and analysis based on real videos and scene reconstruction
This paper has been presented at : 6th Latin-American Conference on Networked and Electronic Media (LACNEM 2015)Crowd simulation has been regarded as an important research topic in computer graphics, computer vision, and related areas. Various approaches have been proposed to simulate real life scenarios. In this paper, a novel framework that evaluates the accuracy and the realism of crowd simulation algorithms is presented. The framework is based on the concept of recreating real video scenes in 3D environments and applying crowd and pedestrian simulation algorithms to the agents using a plug-in architecture. The real videos are compared with recorded videos of the simulated scene and novel Human Visual System (HVS) based similarity features and metrics are introduced in order to compare and evaluate simulation methods. The experiments show that the proposed framework provides efficient methods to evaluate crowd and pedestrian simulation algorithms with high accuracy and low cost
3D-Hog Embedding Frameworks for Single and Multi-Viewpoints Action Recognition Based on Human Silhouettes
This paper has been presented at : 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)Given the high demand for automated systems for human action recognition, great efforts have been undertaken in recent decades to progress the field. In this paper, we present frameworks for single and multi-viewpoints action recognition based on Space-Time Volume (STV) of human silhouettes and 3D-Histogram of Oriented Gradient (3D-HOG) embedding. We exploit fast-computational approaches involving Principal Component Analysis (PCA) over the local feature spaces for compactly describing actions as combinations of local gestures and L 2 -Regularized Logistic Regression (L 2 -RLR) for learning the action model from local features. Outperforming results on Weizmann and i3DPost datasets confirm efficacy of the proposed approaches as compared to the baseline method and other works, in terms of accuracy and robustness to appearance changes
A Bag of Expression framework for improved human action recognition
The Bag of Words (BoW) approach has been widely used for human action recognition in recent state-of-the-art methods. In this paper, we introduce what we call a Bag of Expression (BoE) framework, based on the bag of words method, for recognizing human action in simple and realistic scenarios. The proposed approach includes space time neighborhood information in addition to visual words. The main focus is to enhance the existing strengths of the BoW approach like view independence, scale invariance and occlusion handling. BOE includes independent pairs of neighbors for building expressions, therefore it is tolerant to occlusion and capable of handling view independence up to some extent in realistic scenarios. Our main contribution includes learning a class specific visual words extraction approach for establishing a relationship between these extracted visual words in both space and time dimension. Finally, we have carried out a set of experiments to optimize different parameters and compare its performance with recent state-of-the-art-methods. Our approach outperforms existing Bag of Words based approaches, when evaluated using the same performance evaluation methods. We tested our approach on four publicly available datasets for human action recognition i.e. UCF-Sports, KTH, UCF11 and UCF50 and achieve significant results i.e. 97.3%, 99.5%, 96.7% and 93.42% respectively in terms of average accuracy.Sergio A Velastin has received funding from the Universidad Carlos III de Madrid, the European Unions Seventh Framework Programme for research, technological development and demonstration under grant agreement nº 600371, el Ministerio de Economía, Industria y Competitividad (COFUND2013-51509) el Ministerio de Educación, Cultura y Deporte (CEI-15-17) and Banco Santander
Exploiting deep residual networks for human action recognition from skeletal data
The computer vision community is currently focusing on solving action recognition problems in real videos, which contain thousands of samples with many challenges. In this process, Deep Convolutional Neural Networks (D-CNNs) have played a significant role in advancing the state-of-the-art in various vision-based action recognition systems. Recently, the introduction of residual connections in conjunction with a more traditional CNN model in a single architecture called Residual Network (ResNet) has shown impressive performance and great potential for image recognition tasks. In this paper, we investigate and apply deep ResNets for human action recognition using skeletal data provided by depth sensors. Firstly, the 3D coordinates of the human body joints carried in skeleton sequences are transformed into image-based representations and stored as RGB images. These color images are able to capture the spatial-temporal evolutions of 3D motions from skeleton sequences and can be efficiently learned by D-CNNs. We then propose a novel deep learning architecture based on ResNets to learn features from obtained color-based representations and classify them into action classes. The proposed method is evaluated on three challenging benchmark datasets including MSR Action 3D, KARD, and NTU-RGB+D datasets. Experimental results demonstrate that our method achieves state-of-the-art performance for all these benchmarks whilst requiring less computation resource. In particular, the proposed method surpasses previous approaches by a significant margin of 3.4% on MSR Action 3D dataset, 0.67% on KARD dataset, and 2.5% on NTU-RGB+D dataset
Fall detection and activity recognition using human skeleton features
Human activity recognition has attracted the attention of researchers around the world. This is an interesting problem that can be addressed in different ways. Many approaches have been presented during the last years. These applications present solutions to recognize different kinds of activities such as if the person is walking, running, jumping, jogging, or falling, among others. Amongst all these activities, fall detection has special importance because it is a common dangerous event for people of all ages with a more negative impact on the elderly population. Usually, these applications use sensors to detect sudden changes in the movement of the person. These kinds of sensors can be embedded in smartphones, necklaces, or smart wristbands to make them “wearable” devices. The main inconvenience is that these devices have to be placed on the subjects’ bodies. This might be uncomfortable and is not always feasible because this type of sensor must be monitored constantly, and can not be used in open spaces with unknown people. In this way, fall detection from video camera images presents some advantages over the wearable sensor-based approaches. This paper presents a vision-based approach to fall detection and activity recognition. The main contribution of the proposed method is to detect falls only by using images from a standard video-camera without the need to use environmental sensors. It carries out the detection using human skeleton estimation for features extraction. The use of human skeleton detection opens the possibility for detecting not only falls but also different kind of activities for several subjects in the same scene. So this approach can be used in real environments, where a large number of people may be present at the same time. The method is evaluated with the UP-FALL public dataset and surpasses the performance of other fall detection and activities recognition systems that use that dataset
Dynamic Spatio-Temporal Bag of Expressions (D-STBoE) model for human action recognition
This article belongs to the Section Intelligent SensorsHuman action recognition (HAR) has emerged as a core research domain for video understanding and analysis, thus attracting many researchers. Although significant results have been achieved in simple scenarios, HAR is still a challenging task due to issues associated with view independence, occlusion and inter-class variation observed in realistic scenarios. In previous research efforts, the classical bag of visual words approach along with its variations has been widely used. In this paper, we propose a Dynamic Spatio-Temporal Bag of Expressions (D-STBoE) model for human action recognition without compromising the strengths of the classical bag of visual words approach. Expressions are formed based on the density of a spatio-temporal cube of a visual word. To handle inter-class variation, we use class-specific visual word representation for visual expression generation. In contrast to the Bag of Expressions (BoE) model, the formation of visual expressions is based on the density of spatio-temporal cubes built around each visual word, as constructing neighborhoods with a fixed number of neighbors could include non-relevant information making a visual expression less discriminative in scenarios with occlusion and changing viewpoints. Thus, the proposed approach makes the model more robust to occlusion and changing viewpoint challenges present in realistic scenarios. Furthermore, we train a multi-class Support Vector Machine (SVM) for classifying bag of expressions into action classes. Comprehensive experiments on four publicly available datasets: KTH, UCF Sports, UCF11 and UCF50 show that the proposed model outperforms existing state-of-the-art human action recognition methods in term of accuracy to 99.21%, 98.60%, 96.94 and 94.10%, respectively.Sergio A. Velastin is grateful for funding received from the Universidad Carlos III de Madrid, the European Union’s Seventh Framework Programme for research, technological development and demonstration under grant agreement N° 600371, el Ministerio de Economía, Industria y Competitividad (COFUND2013-51509) el Ministerio de Educación, Cultura y Deporte (CEI-15-17) and Banco Santander. Muhammad Haroon Yousaf received funding from the Higher Education Commission, Pakistan for Swarm Robotics Lab under the National Centre for Robotics and Automation (NCRA). The authors also acknowledge support from the Directorate of ASR&TD, University of Engineering and Technology Taxila, Pakistan
Learning to Recognize 3D Human Action from A New Skeleton-based Representation Using Deep Convolutional Neural Networks
Recognizing human actions in untrimmed videos is an important challenging task. An effective 3D motion representation and a powerful learning model are two key factors influencing recognition performance. In this paper we introduce a new skeletonbased
representation for 3D action recognition in videos. The key idea of the proposed representation is to transform 3D joint coordinates of the human body carried in skeleton sequences into RGB images via a color encoding process. By normalizing the
3D joint coordinates and dividing each skeleton frame into five parts, where the joints are concatenated according to the order of their physical connections, the color-coded representation is able to represent spatio-temporal evolutions of complex 3D motions,
independently of the length of each sequence. We then design and train different Deep Convolutional Neural Networks (D-CNNs) based on the Residual Network architecture (ResNet) on the obtained image-based representations to learn 3D motion features
and classify them into classes. Our method is evaluated on two widely used action recognition benchmarks: MSR Action3D and NTU-RGB+D, a very large-scale dataset for 3D human action recognition. The experimental results demonstrate that the proposed
method outperforms previous state-of-the-art approaches whilst requiring less computation for training and prediction.This research was carried out at the Cerema Research Center
(CEREMA) and Toulouse Institute of Computer Science Research
(IRIT), Toulouse, France. Sergio A. Velastin is grateful for funding
received from the Universidad Carlos III de Madrid, the European
Union’s Seventh Framework Programme for Research, Technological
Development and demonstration under grant agreement
N. 600371, el Ministerio de Economia, Industria y Competitividad
(COFUND2013-51509) el Ministerio de Educación, cultura y
Deporte (CEI-15-17) and Banco Santander
Federated learning enables big data for rare cancer boundary detection.
Although machine learning (ML) has shown promise across disciplines, out-of-sample generalizability is concerning. This is currently addressed by sharing multi-site data, but such centralization is challenging/infeasible to scale due to various limitations. Federated ML (FL) provides an alternative paradigm for accurate and generalizable ML, by only sharing numerical model updates. Here we present the largest FL study to-date, involving data from 71 sites across 6 continents, to generate an automatic tumor boundary detector for the rare disease of glioblastoma, reporting the largest such dataset in the literature (n = 6, 314). We demonstrate a 33% delineation improvement for the surgically targetable tumor, and 23% for the complete tumor extent, over a publicly trained model. We anticipate our study to: 1) enable more healthcare studies informed by large diverse data, ensuring meaningful results for rare diseases and underrepresented populations, 2) facilitate further analyses for glioblastoma by releasing our consensus model, and 3) demonstrate the FL effectiveness at such scale and task-complexity as a paradigm shift for multi-site collaborations, alleviating the need for data-sharing
- …