6 research outputs found
Human Activity Recognition with Pose-driven Attention to RGB
International audienceWe address human action recognition from multi-modal video data involving articulated pose and RGB frames and propose a two-stream approach. The pose stream is processed with a convolutional model taking as input a 3D tensor holding data from a sub-sequence. A specific joint ordering, which respects the topology of the human body, ensures that different convolutional layers correspond to meaningful levels of abstraction. The raw RGB stream is handled by a spatio-temporal soft-attention mechanism conditioned on features from the pose network. An LSTM network receives input from a set of image locations at each instant. A trainable glimpse sensor extracts features on a set of pre-defined locations specified by the pose stream, namely the 4 hands of the two people involved in the activity. Appearance features give important cues on hand motion and on objects held in each hand. We show that it is of high interest to shift the attention to different hands at different time steps depending on the activity itself. Finally a temporal attention mechanism learns how to fuse LSTM features over time. State-of-the-art results are achieved on the largest dataset for human activity recognition, namely NTU-RGB+D
Actor-Transformers for Group Activity Recognition
This paper strives to recognize individual actions and group activities from
videos. While existing solutions for this challenging problem explicitly model
spatial and temporal relationships based on location of individual actors, we
propose an actor-transformer model able to learn and selectively extract
information relevant for group activity recognition. We feed the transformer
with rich actor-specific static and dynamic representations expressed by
features from a 2D pose network and 3D CNN, respectively. We empirically study
different ways to combine these representations and show their complementary
benefits. Experiments show what is important to transform and how it should be
transformed. What is more, actor-transformers achieve state-of-the-art results
on two publicly available benchmarks for group activity recognition,
outperforming the previous best published results by a considerable margin.Comment: CVPR 202
Video Action Transformer Network
We introduce the Action Transformer model for recognizing and localizing
human actions in video clips. We repurpose a Transformer-style architecture to
aggregate features from the spatiotemporal context around the person whose
actions we are trying to classify. We show that by using high-resolution,
person-specific, class-agnostic queries, the model spontaneously learns to
track individual people and to pick up on semantic context from the actions of
others. Additionally its attention mechanism learns to emphasize hands and
faces, which are often crucial to discriminate an action - all without explicit
supervision other than boxes and class labels. We train and test our Action
Transformer network on the Atomic Visual Actions (AVA) dataset, outperforming
the state-of-the-art by a significant margin using only raw RGB frames as
input.Comment: CVPR 201
Toyota Smarthome: Real-World Activities of Daily Living
International audienceThe performance of deep neural networks is strongly influenced by the quantity and quality of annotated data. Most of the large activity recognition datasets consist of data sourced from the web, which does not reflect challenges that exist in activities of daily living. In this paper, we introduce a large real-world video dataset for activities of daily living: Toyota Smarthome. The dataset consists of 16K RGB+D clips of 31 activity classes, performed by seniors in a smarthome. Unlike previous datasets, videos were fully unscripted. As a result, the dataset poses several challenges: high intra-class variation, high class imbalance, simple and composite activities, and activities with similar motion and variable duration. Activities were annotated with both coarse and fine-grained labels. These characteristics differentiate Toyota Smarthome from other datasets for activity recognition. As recent activity recognition approaches fail to address the challenges posed by Toyota Smarthome, we present a novel activity recognition method with attention mechanism. We propose a pose driven spatio-temporal attention mechanism through 3D ConvNets. We show that our novel method outperforms state-of-the-art methods on benchmark datasets, as well as on the Toyota Smarthome dataset. We release the dataset for research use
Practical and Rich User Digitization
A long-standing vision in computer science has been to evolve computing
devices into proactive assistants that enhance our productivity, health and
wellness, and many other facets of our lives. User digitization is crucial in
achieving this vision as it allows computers to intimately understand their
users, capturing activity, pose, routine, and behavior. Today's consumer
devices - like smartphones and smartwatches provide a glimpse of this
potential, offering coarse digital representations of users with metrics such
as step count, heart rate, and a handful of human activities like running and
biking. Even these very low-dimensional representations are already bringing
value to millions of people's lives, but there is significant potential for
improvement. On the other end, professional, high-fidelity comprehensive user
digitization systems exist. For example, motion capture suits and multi-camera
rigs that digitize our full body and appearance, and scanning machines such as
MRI capture our detailed anatomy. However, these carry significant user
practicality burdens, such as financial, privacy, ergonomic, aesthetic, and
instrumentation considerations, that preclude consumer use. In general, the
higher the fidelity of capture, the lower the user's practicality. Most
conventional approaches strike a balance between user practicality and
digitization fidelity.
My research aims to break this trend, developing sensing systems that
increase user digitization fidelity to create new and powerful computing
experiences while retaining or even improving user practicality and
accessibility, allowing such technologies to have a societal impact. Armed with
such knowledge, our future devices could offer longitudinal health tracking,
more productive work environments, full body avatars in extended reality, and
embodied telepresence experiences, to name just a few domains.Comment: PhD thesi