3,063 research outputs found

    Hierarchical Attention Network for Action Segmentation

    Full text link
    The temporal segmentation of events is an essential task and a precursor for the automatic recognition of human actions in the video. Several attempts have been made to capture frame-level salient aspects through attention but they lack the capacity to effectively map the temporal relationships in between the frames as they only capture a limited span of temporal dependencies. To this end we propose a complete end-to-end supervised learning approach that can better learn relationships between actions over time, thus improving the overall segmentation performance. The proposed hierarchical recurrent attention framework analyses the input video at multiple temporal scales, to form embeddings at frame level and segment level, and perform fine-grained action segmentation. This generates a simple, lightweight, yet extremely effective architecture for segmenting continuous video streams and has multiple application domains. We evaluate our system on multiple challenging public benchmark datasets, including MERL Shopping, 50 salads, and Georgia Tech Egocentric datasets, and achieves state-of-the-art performance. The evaluated datasets encompass numerous video capture settings which are inclusive of static overhead camera views and dynamic, ego-centric head-mounted camera views, demonstrating the direct applicability of the proposed framework in a variety of settings.Comment: Published in Pattern Recognition Letter

    Action recognition based on efficient deep feature learning in the spatio-temporal domain

    Get PDF
    © 20xx IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Hand-crafted feature functions are usually designed based on the domain knowledge of a presumably controlled environment and often fail to generalize, as the statistics of real-world data cannot always be modeled correctly. Data-driven feature learning methods, on the other hand, have emerged as an alternative that often generalize better in uncontrolled environments. We present a simple, yet robust, 2D convolutional neural network extended to a concatenated 3D network that learns to extract features from the spatio-temporal domain of raw video data. The resulting network model is used for content-based recognition of videos. Relying on a 2D convolutional neural network allows us to exploit a pretrained network as a descriptor that yielded the best results on the largest and challenging ILSVRC-2014 dataset. Experimental results on commonly used benchmarking video datasets demonstrate that our results are state-of-the-art in terms of accuracy and computational time without requiring any preprocessing (e.g., optic flow) or a priori knowledge on data capture (e.g., camera motion estimation), which makes it more general and flexible than other approaches. Our implementation is made available.Peer ReviewedPostprint (author's final draft

    Recent advances in video-based human action recognition using deep learning: A review

    Full text link
    © 2017 IEEE. Video-based human action recognition has become one of the most popular research areas in the field of computer vision and pattern recognition in recent years. It has a wide variety of applications such as surveillance, robotics, health care, video searching and human-computer interaction. There are many challenges involved in human action recognition in videos, such as cluttered backgrounds, occlusions, viewpoint variation, execution rate, and camera motion. A large number of techniques have been proposed to address the challenges over the decades. Three different types of datasets namely, single viewpoint, multiple viewpoint and RGB-depth videos, are used for research. This paper presents a review of various state-of-the-art deep learning-based techniques proposed for human action recognition on the three types of datasets. In light of the growing popularity and the recent developments in video-based human action recognition, this review imparts details of current trends and potential directions for future work to assist researchers

    Applied deep learning in intelligent transportation systems and embedding exploration

    Get PDF
    Deep learning techniques have achieved tremendous success in many real applications in recent years and show their great potential in many areas including transportation. Even though transportation becomes increasingly indispensable in people’s daily life, its related problems, such as traffic congestion and energy waste, have not been completely solved, yet some problems have become even more critical. This dissertation focuses on solving the following fundamental problems: (1) passenger demand prediction, (2) transportation mode detection, (3) traffic light control, in the transportation field using deep learning. The dissertation also extends the application of deep learning to an embedding system for visualization and data retrieval. The first part of this dissertation is about a Spatio-TEmporal Fuzzy neural Network (STEF-Net) which accurately predicts passenger demand by incorporating the complex interaction of all known important factors, such as temporal, spatial and external information. Specifically, a convolutional long short-term memory network is employed to simultaneously capture spatio-temporal feature interaction, and a fuzzy neural network to model external factors. A novel feature fusion method with convolution and an attention layer is proposed to keep the temporal relation and discriminative spatio-temporal feature interaction. Experiments on a large-scale real-world dataset show the proposed model outperforms the state-of-the-art approaches. The second part is a light-weight and energy-efficient system which detects transportation modes using only accelerometer sensors in smartphones. Understanding people’s transportation modes is beneficial to many civilian applications, such as urban transportation planning. The system collects accelerometer data in an efficient way and leverages a convolutional neural network to determine transportation modes. Different architectures and classification methods are tested with the proposed convolutional neural network to optimize the system design. Performance evaluation shows that the proposed approach achieves better accuracy than existing work in detecting people’s transportation modes. The third component of this dissertation is a deep reinforcement learning model, based on Q learning, to control the traffic light. Existing inefficient traffic light control causes numerous problems, such as long delay and waste of energy. In the proposed model, the complex traffic scenario is quantified as states by collecting data and dividing the whole intersection into grids. The timing changes of a traffic light are the actions, which are modeled as a high-dimension Markov decision process. The reward is the cumulative waiting time difference between two cycles. To solve the model, a convolutional neural network is employed to map states to rewards, which is further optimized by several components, such as dueling network, target network, double Q-learning network, and prioritized experience replay. The simulation results in Simulation of Urban MObility (SUMO) show the efficiency of the proposed model in controlling traffic lights. The last part of this dissertation studies the hierarchical structure in an embedding system. Traditional embedding approaches associate a real-valued embedding vector with each symbol or data point, which generates storage-inefficient representation and fails to effectively encode the internal semantic structure of data. A regularized autoencoder framework is proposed to learn compact Hierarchical K-way D-dimensional (HKD) discrete embedding of data points, aiming at capturing semantic structures of data. Experimental results on synthetic and real-world datasets show that the proposed HKD embedding can effectively reveal the semantic structure of data via visualization and greatly reduce the search space of nearest neighbor retrieval while preserving high accuracy

    Fuzzy Layered Convolution Neutral Network for Feature Level Fusion Based On Multimodal Sentiment Classification

    Get PDF
    Multimodal sentiment analysis (MSA) is one of the core research topics of natural language processing (NLP). MSA has become a challenge for scholars and is equally complicated for an appliance to comprehend. One study that supports MS difficulties is the MSA, which is learning opinions, emotions, and attitudes in an audio-visual format. In order words, using such diverse modalities to obtain opinions and identify emotions is necessary. Such utilization can be achieved via modality data fusion, such as feature fusion. In handling the data fusion of such diverse modalities while obtaining high performance, a typical machine learning algorithm is Deep Learning (DL), particularly the Convolutional Neutral Network (CNN), which has the capacity to handle tasks of great intricacy and difficulty. In this paper, we present a CNN architecture with an integrated layer via fuzzy methodologies for MSA, a task yet to be explored in improving the accuracy performance of CNN for diverse inputs. Experiments conducted on a benchmark multimodal dataset, MOSI, obtaining 37.5% and 81% on seven (7) class and binary classification respectively, reveals an improved accuracy performance compared with the typical CNN, which acquired 28.9% and 78%, respectively

    SPACE-TIME GRAPH-BASED CONVOLUTIONAL NEURAL NETWORKS OF STUDY ON MOVEMENT RECOGNITION OF FOOTBALL PLAYERS

    Get PDF
    Behaviour recognition technology is an interdisciplinary technology, integrating many research achievements in computer vision, deep learning, pattern recognition and other fields. The key information of bone data on human behavior can not only accurately describe the motion posture of the human body in three-dimensional space, but also its rigid connection structure is robust to various external interference factors. However, the behavioral recognition algorithm is influenced by different factors such as background, light and environment, which is easy to lead to unstable recognition accuracy and limited application scenarios. To address this problem, in this paper, we propose a noise filtering algorithm based on data correlation and skeleton energy model filtering, construct a set of football player data sets, using the ST-GCN algorithm to train the skeleton characteristics of football players, and construct a behavior recognition system applied to football players. Finally, by comparing the accuracy of Deep LSTM, 2s-AGCN and the algorithm in this paper, the accuracy of TOP1 and TOP5 is 39.97% and 66.34%, respectively, which are significantly higher than the other two algorithms. It can realize the statistics of athletes and analyze the technical and tactical movements of players on the football field

    An original framework for understanding human actions and body language by using deep neural networks

    Get PDF
    The evolution of both fields of Computer Vision (CV) and Artificial Neural Networks (ANNs) has allowed the development of efficient automatic systems for the analysis of people's behaviour. By studying hand movements it is possible to recognize gestures, often used by people to communicate information in a non-verbal way. These gestures can also be used to control or interact with devices without physically touching them. In particular, sign language and semaphoric hand gestures are the two foremost areas of interest due to their importance in Human-Human Communication (HHC) and Human-Computer Interaction (HCI), respectively. While the processing of body movements play a key role in the action recognition and affective computing fields. The former is essential to understand how people act in an environment, while the latter tries to interpret people's emotions based on their poses and movements; both are essential tasks in many computer vision applications, including event recognition, and video surveillance. In this Ph.D. thesis, an original framework for understanding Actions and body language is presented. The framework is composed of three main modules: in the first one, a Long Short Term Memory Recurrent Neural Networks (LSTM-RNNs) based method for the Recognition of Sign Language and Semaphoric Hand Gestures is proposed; the second module presents a solution based on 2D skeleton and two-branch stacked LSTM-RNNs for action recognition in video sequences; finally, in the last module, a solution for basic non-acted emotion recognition by using 3D skeleton and Deep Neural Networks (DNNs) is provided. The performances of RNN-LSTMs are explored in depth, due to their ability to model the long term contextual information of temporal sequences, making them suitable for analysing body movements. All the modules were tested by using challenging datasets, well known in the state of the art, showing remarkable results compared to the current literature methods
    • …
    corecore