2,514 research outputs found
STMT: A Spatial-Temporal Mesh Transformer for MoCap-Based Action Recognition
We study the problem of human action recognition using motion capture (MoCap)
sequences. Unlike existing techniques that take multiple manual steps to derive
standardized skeleton representations as model input, we propose a novel
Spatial-Temporal Mesh Transformer (STMT) to directly model the mesh sequences.
The model uses a hierarchical transformer with intra-frame off-set attention
and inter-frame self-attention. The attention mechanism allows the model to
freely attend between any two vertex patches to learn non-local relationships
in the spatial-temporal domain. Masked vertex modeling and future frame
prediction are used as two self-supervised tasks to fully activate the
bi-directional and auto-regressive attention in our hierarchical transformer.
The proposed method achieves state-of-the-art performance compared to
skeleton-based and point-cloud-based models on common MoCap benchmarks. Code is
available at https://github.com/zgzxy001/STMT.Comment: CVPR 202
3DFCNN: real-time action recognition using 3D deep neural networks with raw depth information
This work describes an end-to-end approach for real-time human action recognition from
raw depth image-sequences. The proposal is based on a 3D fully convolutional neural network, named 3DFCNN, which automatically encodes spatio-temporal patterns from raw
depth sequences. The described 3D-CNN allows actions classification from the spatial and
temporal encoded information of depth sequences. The use of depth data ensures that action
recognition is carried out protecting people"s privacy, since their identities can not be recognized from these data. The proposed 3DFCNN has been optimized to reach a good
performance in terms of accuracy while working in real-time. Then, it has been evaluated and compared with other state-of-the-art systems in three widely used public datasets
with different characteristics, demonstrating that 3DFCNN outperforms all the non-DNNbased state-of-the-art methods with a maximum accuracy of 83.6% and obtains results that
are comparable to the DNN-based approaches, while maintaining a much lower computational cost of 1.09 seconds, what significantly increases its applicability in real-world
environments.Agencia Estatal de InvestigaciónUniversidad de Alcal
On the Benefits of 3D Pose and Tracking for Human Action Recognition
In this work we study the benefits of using tracking and 3D poses for action
recognition. To achieve this, we take the Lagrangian view on analysing actions
over a trajectory of human motion rather than at a fixed point in space. Taking
this stand allows us to use the tracklets of people to predict their actions.
In this spirit, first we show the benefits of using 3D pose to infer actions,
and study person-person interactions. Subsequently, we propose a Lagrangian
Action Recognition model by fusing 3D pose and contextualized appearance over
tracklets. To this end, our method achieves state-of-the-art performance on the
AVA v2.2 dataset on both pose only settings and on standard benchmark settings.
When reasoning about the action using only pose cues, our pose model achieves
+10.0 mAP gain over the corresponding state-of-the-art while our fused model
has a gain of +2.8 mAP over the best state-of-the-art model. Code and results
are available at: https://brjathu.github.io/LARTComment: CVPR2023 (project page: https://brjathu.github.io/LART
Self-Supervised Object-in-Gripper Segmentation from Robotic Motions
Accurate object segmentation is a crucial task in the context of robotic
manipulation. However, creating sufficient annotated training data for neural
networks is particularly time consuming and often requires manual labeling. To
this end, we propose a simple, yet robust solution for learning to segment
unknown objects grasped by a robot. Specifically, we exploit motion and
temporal cues in RGB video sequences. Using optical flow estimation we first
learn to predict segmentation masks of our given manipulator. Then, these
annotations are used in combination with motion cues to automatically
distinguish between background, manipulator and unknown, grasped object. In
contrast to existing systems our approach is fully self-supervised and
independent of precise camera calibration, 3D models or potentially imperfect
depth data. We perform a thorough comparison with alternative baselines and
approaches from literature. The object masks and views are shown to be suitable
training data for segmentation networks that generalize to novel environments
and also allow for watertight 3D reconstruction.Comment: 15 pages, 11 figures. Video:
https://www.youtube.com/watch?v=srEwuuIIgz
Capturing Hand-Object Interaction and Reconstruction of Manipulated Objects
Hand motion capture with an RGB-D sensor gained recently a lot of research attention, however, even most recent approaches focus on the case of a single isolated hand. We focus instead on hands that interact with other hands or with a rigid or articulated object. Our framework successfully captures motion in such scenarios by combining a generative model with discriminatively trained salient points, collision detection and physics simulation to achieve a low tracking error with physically plausible poses. All components are unified in a single objective function that can be optimized with standard optimization techniques. We initially assume a-priori knowledge of the object’s shape and skeleton. In case of unknown object shape there are existing 3d reconstruction methods that capitalize on distinctive geometric or texture features. These methods though fail for textureless and highly symmetric objects like household articles, mechanical parts or toys. We show that extracting 3d hand motion for in-hand scanning e↵ectively facilitates the reconstruction of such objects and we fuse the rich additional information of hands into a 3d reconstruction pipeline. Finally, although shape reconstruction is enough for rigid objects, there is a lack of tools that build rigged models of articulated objects that deform realistically using RGB-D data. We propose a method that creates a fully rigged model consisting of a watertight mesh, embedded skeleton and skinning weights by employing a combination of deformable mesh tracking, motion segmentation based on spectral clustering and skeletonization based on mean curvature flow
Gesture retrieval and its application to the study of multimodal communication
Comprehending communication is dependent on analyzing the different modalities of conversation, including audio, visual, and others. This is a natural process for humans, but in digital libraries, where preservation and dissemination of digital information are crucial, it is a complex task. A rich conversational model, encompassing all modalities and their co-occurrences, is required to effectively analyze and interact with digital information. Currently, the analysis of co-speech gestures in videos is done through manual annotation by linguistic experts based on textual searches. However, this approach is limited and does not fully utilize the visual modality of gestures. This paper proposes a visual gesture retrieval method using a deep learning architecture to extend current research in this area. The method is based on body keypoints and uses an attention mechanism to focus on specific groups. Experiments were conducted on a subset of the NewsScape dataset, which presents challenges such as multiple people, camera perspective changes, and occlusions. A user study was conducted to assess the usability of the results, establishing a baseline for future gesture retrieval methods in real-world video collections. The results of the experiment demonstrate the high potential of the proposed method in multimodal communication research and highlight the significance of visual gesture retrieval in enhancing interaction with video content. The integration of visual similarity search for gestures in the open-source multimedia retrieval stack, vitrivr, can greatly contribute to the field of computational linguistics. This research advances the understanding of the role of the visual modality in co-speech gestures and highlights the need for further development in this area
EgoHumans: An Egocentric 3D Multi-Human Benchmark
We present EgoHumans, a new multi-view multi-human video benchmark to advance
the state-of-the-art of egocentric human 3D pose estimation and tracking.
Existing egocentric benchmarks either capture single subject or indoor-only
scenarios, which limit the generalization of computer vision algorithms for
real-world applications. We propose a novel 3D capture setup to construct a
comprehensive egocentric multi-human benchmark in the wild with annotations to
support diverse tasks such as human detection, tracking, 2D/3D pose estimation,
and mesh recovery. We leverage consumer-grade wearable camera-equipped glasses
for the egocentric view, which enables us to capture dynamic activities like
playing tennis, fencing, volleyball, etc. Furthermore, our multi-view setup
generates accurate 3D ground truth even under severe or complete occlusion. The
dataset consists of more than 125k egocentric images, spanning diverse scenes
with a particular focus on challenging and unchoreographed multi-human
activities and fast-moving egocentric views. We rigorously evaluate existing
state-of-the-art methods and highlight their limitations in the egocentric
scenario, specifically on multi-human tracking. To address such limitations, we
propose EgoFormer, a novel approach with a multi-stream transformer
architecture and explicit 3D spatial reasoning to estimate and track the human
pose. EgoFormer significantly outperforms prior art by 13.6% IDF1 on the
EgoHumans dataset.Comment: Accepted to ICCV 2023 (Oral
Recommended from our members
Visual Dynamics Models for Robotic Planning and Control
For a robot to interact with its environment, it must perceive the world and understand how the world evolves as a consequence of its actions. This thesis studies a few methods that a robot can use to respond to its observations, with a focus on instances that can leverage visual dynamic models. In general, these are models of how the visual observations of a robot evolves as a consequence of its actions. This could be in the form of predictive models that directly predict the future in the space of image pixels, in the space of visual features extracted from these images, or in the space of compact learned latent representations. The three instances that this thesis studies are in the context of visual servoing, visual planning, and representation learning for reinforcement learning. In the first case, we combine learned visual features with learning single-step predictive dynamics models and reinforcement learning to learn visual servoing mechanisms. In the second case, we use a deterministic multi-step video prediction model to achieve various manipulation tasks through visual planning. In addition, we show that conventional video prediction models are unequipped to model uncertainty and multiple futures, which could limit the planning capabilities of the robot. To address this, we propose a stochastic video prediction model that is trained with a combination of variational losses, adversarial losses, and perceptual losses, and show that this model can predict futures that are more realistic, diverse, and accurate. Unlike the first two cases, in which the dynamics model is used to make predictions for decision-making, the third case learns the model solely for representation learning. We learn a stochastic sequential latent variable model to learn a latent representation, and then use it as an intermediate representation for reinforcement learning. We show that this approach improves final performance and sample efficiency
A Survey of Deep Learning in Sports Applications: Perception, Comprehension, and Decision
Deep learning has the potential to revolutionize sports performance, with
applications ranging from perception and comprehension to decision. This paper
presents a comprehensive survey of deep learning in sports performance,
focusing on three main aspects: algorithms, datasets and virtual environments,
and challenges. Firstly, we discuss the hierarchical structure of deep learning
algorithms in sports performance which includes perception, comprehension and
decision while comparing their strengths and weaknesses. Secondly, we list
widely used existing datasets in sports and highlight their characteristics and
limitations. Finally, we summarize current challenges and point out future
trends of deep learning in sports. Our survey provides valuable reference
material for researchers interested in deep learning in sports applications
- …