9 research outputs found

    Artificial Intelligence Enabled Methods for Human Action Recognition using Surveillance Videos

    Get PDF
    Computer vision applications have been attracting researchers and academia. It is more so with cloud computing resources enabling such applications. Analysing video surveillance applications became an important research area due to its widespread applications. For instance, CCTV camera are used in public places in order to monitor situations, identify any theft or crime instances. In presence of thousands of such surveillance videos streaming simultaneously, manual analysis is very tedious and time consuming task. There is need for automated approach for analysis and giving notifications or findings to officers concerned. It is very useful to police and investigation agencies to ascertain facts, recover evidences and even exploit digital forensics. In this context, this paper throws light on different methods of human action recognition (HAR) using machine learning (ML) and deep learning (DL) that come under Artificial Intelligence (AI). It also reviews methods on privacy preserving action recognition and Generative Adversarial Networks (GANs). This paper also provides different datasets being used for human action recognition research besides giving an account of research gaps that help in pursuing further research in the area of human action recognition

    A Cluster-Based Method for Action Segmentation Using Spatio-Temporal and Positional Encoded Embeddings

    Get PDF
    A crucial task to overall video understanding is the recognition and localisation in time of different actions or events that are present along the scenes. To address this problem, action segmentation must be achieved. Action segmentation consists of temporally segmenting a video by labeling each frame with a specific action. In this work, we propose a novel action segmentation method that requires no prior video analysis and no annotated data. Our method involves extracting spatio-Temporal features from videos in samples of 0.5s using a pre-Trained deep network. Data is then transformed using a positional encoder and finally a clustering algorithm is applied with the use of a silhouette score to find the optimal number of clusters where each cluster presumably corresponds to a different single and distinguishable action. In experiments, we show that our method produces competitive results on Breakfast and Inria Instructional Videos dataset benchmarks

    Localizing the Common Action Among a Few Videos

    Get PDF
    This paper strives to localize the temporal extent of an action in a long untrimmed video. Where existing work leverages many examples with their start, their ending, and/or the class of the action during training time, we propose few-shot common action localization. The start and end of an action in a long untrimmed video is determined based on just a hand-full of trimmed video examples containing the same action, without knowing their common class label. To address this task, we introduce a new 3D convolutional network architecture able to align representations from the support videos with the relevant query video segments. The network contains: (\textit{i}) a mutual enhancement module to simultaneously complement the representation of the few trimmed support videos and the untrimmed query video; (\textit{ii}) a progressive alignment module that iteratively fuses the support videos into the query branch; and (\textit{iii}) a pairwise matching module to weigh the importance of different support videos. Evaluation of few-shot common action localization in untrimmed videos containing a single or multiple action instances demonstrates the effectiveness and general applicability of our proposal.Comment: ECCV 202

    Unsupervised Activity Segmentation by Joint Representation Learning and Online Clustering

    Full text link
    We present a novel approach for unsupervised activity segmentation, which uses video frame clustering as a pretext task and simultaneously performs representation learning and online clustering. This is in contrast with prior works where representation learning and clustering are often performed sequentially. We leverage temporal information in videos by employing temporal optimal transport. In particular, we incorporate a temporal regularization term which preserves the temporal order of the activity into the standard optimal transport module for computing pseudo-label cluster assignments. The temporal optimal transport module enables our approach to learn effective representations for unsupervised activity segmentation. Furthermore, previous methods require storing learned features for the entire dataset before clustering them in an offline manner, whereas our approach processes one mini-batch at a time in an online manner. Extensive evaluations on three public datasets, i.e. 50-Salads, YouTube Instructions, and Breakfast, and our dataset, i.e., Desktop Assembly, show that our approach performs on par or better than previous methods for unsupervised activity segmentation, despite having significantly less memory constraints.Comment: Preprint. Under revie

    Temporal Segmentation of Human Actions in Videos

    Get PDF
    Understanding human actions in videos is of great interest in various scenarios ranging from surveillance over quality control in production processes to content-based video search. Algorithms for automatic temporal action segmentation need to overcome severe difficulties in order to be reliable and provide sufficiently good quality. Not only can human actions occur in different scenes and surroundings, the definition on an action itself is also inherently fuzzy, leading to a significant amount of inter-class variations. Moreover, besides finding the correct action label for a pre-defined temporal segment in a video, localizing an action in the first place is anything but trivial. Different actions not only vary in their appearance and duration but also can have long-range temporal dependencies that span over the complete video. Further, getting reliable annotations of large amounts of video data is time consuming and expensive. The goal of this thesis is to advance current approaches to temporal action segmentation. We therefore propose a generic framework that models the three components of the task explicitly, ie long-range temporal dependencies are handled by a context model, variations in segment durations are represented by a length model, and short-term appearance and motion of actions are addressed with a visual model. While the inspiration for the context model mainly comes from word sequence models in natural language processing, the visual model builds upon recent advances in the classification of pre-segmented action clips. Considering that long-range temporal context is crucial, we avoid local segmentation decisions and find the globally optimal temporal segmentation of a video under the explicit models. Throughout the thesis, we provide explicit formulations and training strategies for the proposed generic action segmentation framework under different supervision conditions. First, we address the task of fully supervised temporal action segmentation, where frame-level annotations are available during training. We show that our approach can outperform early sliding window baselines and recent deep architectures and that explicit length and context modeling leads to substantial improvements. Considering that full frame-level annotation is expensive to obtain, we then formulate a weakly supervised training algorithm that uses ordered sequences of actions occurring in the video as only supervision. While a first approach reduces the weakly supervised setup to a fully supervised setup by generating a pseudo ground-truth during training, we propose a second approach that avoids this intermediate step and allows to directly optimize a loss based on the weak supervision. Closing the gap between the fully and the weakly supervised setup, we moreover evaluate semi-supervised learning, where video frames are sparsely annotated. With the motivation that the vast amount of video data on the Internet only comes with meta-tags or content keywords that do not provide any temporal ordering information, we finally propose a method for action segmentation that learns from unordered sets of actions only. All approaches are evaluated on several commonly used benchmark datasets. With the proposed methods, we reach state-of-the-art performance for both, fully and weakly supervised action segmentation
    corecore