314,086 research outputs found

    An Investigation and Application of Biology and Bioinformatics for Activity Recognition

    Get PDF
    Activity recognition in a smart home context is inherently difficult due to the variable nature of human activities and tracking artifacts introduced by video-based tracking systems. This thesis addresses the activity recognition problem via introducing a biologically-inspired chemotactic approach and bioinformatics-inspired sequence alignment techniques to recognise spatial activities. The approaches are demonstrated in real world conditions to improve robustness and recognise activities in the presence of innate activity variability and tracking noise

    Video-Mined Task Graphs for Keystep Recognition in Instructional Videos

    Full text link
    Procedural activity understanding requires perceiving human actions in terms of a broader task, where multiple keysteps are performed in sequence across a long video to reach a final goal state -- such as the steps of a recipe or a DIY fix-it task. Prior work largely treats keystep recognition in isolation of this broader structure, or else rigidly confines keysteps to align with a predefined sequential script. We propose discovering a task graph automatically from how-to videos to represent probabilistically how people tend to execute keysteps, and then leverage this graph to regularize keystep recognition in novel videos. On multiple datasets of real-world instructional videos, we show the impact: more reliable zero-shot keystep localization and improved video representation learning, exceeding the state of the art.Comment: Technical Repor

    Multimodal Generation of Novel Action Appearances for Synthetic-to-Real Recognition of Activities of Daily Living

    Full text link
    Domain shifts, such as appearance changes, are a key challenge in real-world applications of activity recognition models, which range from assistive robotics and smart homes to driver observation in intelligent vehicles. For example, while simulations are an excellent way of economical data collection, a Synthetic-to-Real domain shift leads to a > 60% drop in accuracy when recognizing activities of Daily Living (ADLs). We tackle this challenge and introduce an activity domain generation framework which creates novel ADL appearances (novel domains) from different existing activity modalities (source domains) inferred from video training data. Our framework computes human poses, heatmaps of body joints, and optical flow maps and uses them alongside the original RGB videos to learn the essence of source domains in order to generate completely new ADL domains. The model is optimized by maximizing the distance between the existing source appearances and the generated novel appearances while ensuring that the semantics of an activity is preserved through an additional classification loss. While source data multimodality is an important concept in this design, our setup does not rely on multi-sensor setups, (i.e., all source modalities are inferred from a single video only.) The newly created activity domains are then integrated in the training of the ADL classification networks, resulting in models far less susceptible to changes in data distributions. Extensive experiments on the Synthetic-to-Real benchmark Sims4Action demonstrate the potential of the domain generation paradigm for cross-domain ADL recognition, setting new state-of-the-art results. Our code is publicly available at https://github.com/Zrrr1997/syn2real_DGComment: 8 pages, 7 figures, to be published in IROS 202

    Qualitative and quantitative spatio-temporal relations in daily living activity recognition

    Get PDF
    For the effective operation of intelligent assistive systems working in real-world human environments, it is important to be able to recognise human activities and their intentions. In this paper we propose a novel approach to activity recognition from visual data. Our approach is based on qualitative and quantitative spatio-temporal features which encode the interactions between human subjects and objects in an efficient manner. Unlike the state of the art, our approach uses significantly fewer assumptions and does not require knowledge about object types, their affordances, or the sub-level activities that high-level activities consist of. We perform an automatic feature selection process which provides the most representative descriptions of the learnt activities. We validated the method using these descriptions on the CAD-120 benchmark dataset, consisting of video sequences showing humans performing daily real-world activities. The method is shown to outperform state of the art benchmarks

    Vision-Language Models can Identify Distracted Driver Behavior from Naturalistic Videos

    Full text link
    Recognizing the activities, causing distraction, in real-world driving scenarios is critical for ensuring the safety and reliability of both drivers and pedestrians on the roadways. Conventional computer vision techniques are typically data-intensive and require a large volume of annotated training data to detect and classify various distracted driving behaviors, thereby limiting their efficiency and scalability. We aim to develop a generalized framework that showcases robust performance with access to limited or no annotated training data. Recently, vision-language models have offered large-scale visual-textual pretraining that can be adapted to task-specific learning like distracted driving activity recognition. Vision-language pretraining models, such as CLIP, have shown significant promise in learning natural language-guided visual representations. This paper proposes a CLIP-based driver activity recognition approach that identifies driver distraction from naturalistic driving images and videos. CLIP's vision embedding offers zero-shot transfer and task-based finetuning, which can classify distracted activities from driving video data. Our results show that this framework offers state-of-the-art performance on zero-shot transfer and video-based CLIP for predicting the driver's state on two public datasets. We propose both frame-based and video-based frameworks developed on top of the CLIP's visual representation for distracted driving detection and classification task and report the results.Comment: 15 pages, 10 figure

    REACT: Recognize Every Action Everywhere All At Once

    Full text link
    Group Activity Recognition (GAR) is a fundamental problem in computer vision, with diverse applications in sports video analysis, video surveillance, and social scene understanding. Unlike conventional action recognition, GAR aims to classify the actions of a group of individuals as a whole, requiring a deep understanding of their interactions and spatiotemporal relationships. To address the challenges in GAR, we present REACT (\textbf{R}ecognize \textbf{E}very \textbf{Act}ion Everywhere All At Once), a novel architecture inspired by the transformer encoder-decoder model explicitly designed to model complex contextual relationships within videos, including multi-modality and spatio-temporal features. Our architecture features a cutting-edge Vision-Language Encoder block for integrated temporal, spatial, and multi-modal interaction modeling. This component efficiently encodes spatiotemporal interactions, even with sparsely sampled frames, and recovers essential local information. Our Action Decoder Block refines the joint understanding of text and video data, allowing us to precisely retrieve bounding boxes, enhancing the link between semantics and visual reality. At the core, our Actor Fusion Block orchestrates a fusion of actor-specific data and textual features, striking a balance between specificity and context. Our method outperforms state-of-the-art GAR approaches in extensive experiments, demonstrating superior accuracy in recognizing and understanding group activities. Our architecture's potential extends to diverse real-world applications, offering empirical evidence of its performance gains. This work significantly advances the field of group activity recognition, providing a robust framework for nuanced scene comprehension.Comment: 10 pages, 4 figures, 5 table

    Data Efficient Learning: Towards Reducing Risk and Uncertainty of Data Driven Learning Paradigm

    Get PDF
    The success of Deep Learning in various tasks is highly dependent on the large amount of domain-specific annotated data, which are expensive to acquire and may contain varying degrees of noise. In this doctoral journey, our research goal is first to identify and then tackle the issues relating to data that causes significant performance degradation to real-world applications of Deep Learning algorithms. Human Activity Recognition from RGB data is challenging due to the lack of relative motion parameters. To address this issue, we propose a novel framework that introduces the skeleton information from RGB data for activity recognition. With experimentation, we demonstrate that our RGB-only solution surpasses the state-of-the-art, all exploit RGB-D video streams, by a notable margin. The predictive uncertainty of Deep Neural Networks (DNNs) makes them unreliable for real-world deployment. Moreover, available labeled data may contain noise. We aim to address these two issues holistically by proposing a unified density-driven framework, which can effectively denoise training data as well as avoid predicting uncertain test data points. Our plug-and-play framework is easy to deploy on real-world applications while achieving superior performance over state-of-the-art techniques. To assess effectiveness of our proposed framework in a real-world scenario, we experimented with x-ray images from COVID-19 patients. Supervised learning of DNNs inherits the limitation of a very narrow field of view in terms of known data distributions. Moreover, annotating data is costly. Hence, we explore self-supervised Siamese networks to avoid these constraints. Through extensive experimentation, we demonstrate that self supervised method perform surprisingly comparative to its supervised counterpart in a real world use-case. We also delve deeper with activation mapping and feature distribution visualization to understand the causality of this method. Through our research, we achieve a better understanding of issues relating to data-driven learning while solving some of the core problems of this paradigm and expose some novel and intriguing research questions to the community

    Exploring real-time video interactivity with Scratch

    Get PDF
    Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2012.Cataloged from PDF version of thesis.Includes bibliographical references (p. 40-41).Real-time video interactivity is becoming increasingly popular in today's world with the advent of better and more affordable video input devices. With the recent release of the Microsoft Kinect followed by an official Kinect SDK, there has been an explosion of activity around utilizing this now easily-accessible video sensor data. Many creative uses have surfaced including object recognition, gesture recognition, and more. The audience capable of taking full advantage of these video technologies continues to be a technical crowd, likely with a background in computer science. But what if such video technology were made accessible to a much more diverse crowd? This thesis presents a set of computer vision tools for exploration of the real-time video interactivity space in the context of Scratch (scratch.mit.edu), a graphical block-based programming language accessible to all ages. To decide what functionality to provide to Scratch users, various computer vision algorithms are tested, including object detection, object recognition, face recognition and optical flow. Ultimately, an optical flow implementation is realized and its generative abilities are observed through testing with different user groups.by Ting-Hsiang Tony Hwang.M.Eng
    • …
    corecore