314,112 research outputs found
An Investigation and Application of Biology and Bioinformatics for Activity Recognition
Activity recognition in a smart home context is inherently difficult due to the variable nature of human activities and tracking artifacts introduced by video-based tracking systems. This thesis addresses the activity recognition problem via introducing a biologically-inspired chemotactic approach and bioinformatics-inspired sequence alignment techniques to recognise spatial activities. The approaches are demonstrated in real world conditions to improve robustness and recognise activities in the presence of innate activity variability and tracking noise
Video-Mined Task Graphs for Keystep Recognition in Instructional Videos
Procedural activity understanding requires perceiving human actions in terms
of a broader task, where multiple keysteps are performed in sequence across a
long video to reach a final goal state -- such as the steps of a recipe or a
DIY fix-it task. Prior work largely treats keystep recognition in isolation of
this broader structure, or else rigidly confines keysteps to align with a
predefined sequential script. We propose discovering a task graph automatically
from how-to videos to represent probabilistically how people tend to execute
keysteps, and then leverage this graph to regularize keystep recognition in
novel videos. On multiple datasets of real-world instructional videos, we show
the impact: more reliable zero-shot keystep localization and improved video
representation learning, exceeding the state of the art.Comment: Technical Repor
Multimodal Generation of Novel Action Appearances for Synthetic-to-Real Recognition of Activities of Daily Living
Domain shifts, such as appearance changes, are a key challenge in real-world
applications of activity recognition models, which range from assistive
robotics and smart homes to driver observation in intelligent vehicles. For
example, while simulations are an excellent way of economical data collection,
a Synthetic-to-Real domain shift leads to a > 60% drop in accuracy when
recognizing activities of Daily Living (ADLs). We tackle this challenge and
introduce an activity domain generation framework which creates novel ADL
appearances (novel domains) from different existing activity modalities (source
domains) inferred from video training data. Our framework computes human poses,
heatmaps of body joints, and optical flow maps and uses them alongside the
original RGB videos to learn the essence of source domains in order to generate
completely new ADL domains. The model is optimized by maximizing the distance
between the existing source appearances and the generated novel appearances
while ensuring that the semantics of an activity is preserved through an
additional classification loss. While source data multimodality is an important
concept in this design, our setup does not rely on multi-sensor setups, (i.e.,
all source modalities are inferred from a single video only.) The newly created
activity domains are then integrated in the training of the ADL classification
networks, resulting in models far less susceptible to changes in data
distributions. Extensive experiments on the Synthetic-to-Real benchmark
Sims4Action demonstrate the potential of the domain generation paradigm for
cross-domain ADL recognition, setting new state-of-the-art results. Our code is
publicly available at https://github.com/Zrrr1997/syn2real_DGComment: 8 pages, 7 figures, to be published in IROS 202
Qualitative and quantitative spatio-temporal relations in daily living activity recognition
For the effective operation of intelligent assistive systems working in real-world human environments, it is important to be able to recognise human activities and their intentions. In this paper we propose a novel approach to activity recognition from visual data. Our approach is based on qualitative and quantitative spatio-temporal features which encode the interactions between human subjects and objects in an efficient manner. Unlike the state of the art, our approach uses significantly fewer assumptions and does not require knowledge about object types, their affordances, or the sub-level activities that high-level activities consist of. We perform an automatic feature selection process which provides the most representative descriptions of the learnt activities. We validated the method using these descriptions on the CAD-120 benchmark dataset, consisting of video sequences showing humans performing daily real-world activities. The method is shown to outperform state of the art benchmarks
Vision-Language Models can Identify Distracted Driver Behavior from Naturalistic Videos
Recognizing the activities, causing distraction, in real-world driving
scenarios is critical for ensuring the safety and reliability of both drivers
and pedestrians on the roadways. Conventional computer vision techniques are
typically data-intensive and require a large volume of annotated training data
to detect and classify various distracted driving behaviors, thereby limiting
their efficiency and scalability. We aim to develop a generalized framework
that showcases robust performance with access to limited or no annotated
training data. Recently, vision-language models have offered large-scale
visual-textual pretraining that can be adapted to task-specific learning like
distracted driving activity recognition. Vision-language pretraining models,
such as CLIP, have shown significant promise in learning natural
language-guided visual representations. This paper proposes a CLIP-based driver
activity recognition approach that identifies driver distraction from
naturalistic driving images and videos. CLIP's vision embedding offers
zero-shot transfer and task-based finetuning, which can classify distracted
activities from driving video data. Our results show that this framework offers
state-of-the-art performance on zero-shot transfer and video-based CLIP for
predicting the driver's state on two public datasets. We propose both
frame-based and video-based frameworks developed on top of the CLIP's visual
representation for distracted driving detection and classification task and
report the results.Comment: 15 pages, 10 figure
REACT: Recognize Every Action Everywhere All At Once
Group Activity Recognition (GAR) is a fundamental problem in computer vision,
with diverse applications in sports video analysis, video surveillance, and
social scene understanding. Unlike conventional action recognition, GAR aims to
classify the actions of a group of individuals as a whole, requiring a deep
understanding of their interactions and spatiotemporal relationships. To
address the challenges in GAR, we present REACT (\textbf{R}ecognize
\textbf{E}very \textbf{Act}ion Everywhere All At Once), a novel architecture
inspired by the transformer encoder-decoder model explicitly designed to model
complex contextual relationships within videos, including multi-modality and
spatio-temporal features. Our architecture features a cutting-edge
Vision-Language Encoder block for integrated temporal, spatial, and multi-modal
interaction modeling. This component efficiently encodes spatiotemporal
interactions, even with sparsely sampled frames, and recovers essential local
information. Our Action Decoder Block refines the joint understanding of text
and video data, allowing us to precisely retrieve bounding boxes, enhancing the
link between semantics and visual reality. At the core, our Actor Fusion Block
orchestrates a fusion of actor-specific data and textual features, striking a
balance between specificity and context. Our method outperforms
state-of-the-art GAR approaches in extensive experiments, demonstrating
superior accuracy in recognizing and understanding group activities. Our
architecture's potential extends to diverse real-world applications, offering
empirical evidence of its performance gains. This work significantly advances
the field of group activity recognition, providing a robust framework for
nuanced scene comprehension.Comment: 10 pages, 4 figures, 5 table
Data Efficient Learning: Towards Reducing Risk and Uncertainty of Data Driven Learning Paradigm
The success of Deep Learning in various tasks is highly dependent on the large amount of domain-specific annotated data, which are expensive to acquire and may contain varying degrees of noise. In this doctoral journey, our research goal is first to identify and then tackle the issues relating to data that causes significant performance degradation to real-world applications of Deep Learning algorithms.
Human Activity Recognition from RGB data is challenging due to the lack of relative motion parameters. To address this issue, we propose a novel framework that introduces the skeleton information from RGB data for activity recognition. With experimentation, we demonstrate that our RGB-only solution surpasses the state-of-the-art, all exploit RGB-D video streams, by a notable margin.
The predictive uncertainty of Deep Neural Networks (DNNs) makes them unreliable for real-world deployment. Moreover, available labeled data may contain noise. We aim to address these two issues holistically by proposing a unified density-driven framework, which can effectively denoise training data as well as avoid predicting uncertain test data points. Our plug-and-play framework is easy to deploy on real-world applications while achieving superior performance over state-of-the-art techniques. To assess effectiveness of our proposed framework in a real-world scenario, we experimented with x-ray images from COVID-19 patients.
Supervised learning of DNNs inherits the limitation of a very narrow field of view in terms of known data distributions. Moreover, annotating data is costly. Hence, we explore self-supervised Siamese networks to avoid these constraints. Through extensive experimentation, we demonstrate that self supervised method perform surprisingly comparative to its supervised counterpart in a real world use-case. We also delve deeper with activation mapping and feature distribution visualization to understand the causality of this method.
Through our research, we achieve a better understanding of issues relating to data-driven learning while solving some of the core problems of this paradigm and expose some novel and intriguing research questions to the community
Exploring real-time video interactivity with Scratch
Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2012.Cataloged from PDF version of thesis.Includes bibliographical references (p. 40-41).Real-time video interactivity is becoming increasingly popular in today's world with the advent of better and more affordable video input devices. With the recent release of the Microsoft Kinect followed by an official Kinect SDK, there has been an explosion of activity around utilizing this now easily-accessible video sensor data. Many creative uses have surfaced including object recognition, gesture recognition, and more. The audience capable of taking full advantage of these video technologies continues to be a technical crowd, likely with a background in computer science. But what if such video technology were made accessible to a much more diverse crowd? This thesis presents a set of computer vision tools for exploration of the real-time video interactivity space in the context of Scratch (scratch.mit.edu), a graphical block-based programming language accessible to all ages. To decide what functionality to provide to Scratch users, various computer vision algorithms are tested, including object detection, object recognition, face recognition and optical flow. Ultimately, an optical flow implementation is realized and its generative abilities are observed through testing with different user groups.by Ting-Hsiang Tony Hwang.M.Eng
- …