5 research outputs found
Leveraging over depth in egocentric activity recognition
Activity recognition from first person videos is a growing research area. The increasing diffusion of egocentric sensors in various devices makes it timely to develop approaches able to recognize fine grained first person actions like picking up, putting down, pouring and so forth. While most of previous work focused on RGB data, some authors pointed out the importance of leveraging over depth information in this domain. In this paper
we follow this trend and we propose the first deep architecture that uses depth maps as an attention mechanism for first person activity recognition. Specifically, we blend together the RGB and depth data, so to obtain an enriched input for the network. This blending puts more or less emphasis on different parts of the image based on their distance from the observer, hence acting as an attention mechanism. To further strengthen the proposed
activity recognition protocol, we opt for a self labeling approach.
This, combined with a Conv-LSTM block for extracting temporal information from the various frames, leads to the new state of the art on two publicly available benchmark databases. An ablation study completes our experimental findings, confirming the effectiveness of our approac
Egocentric Scene Understanding via Multimodal Spatial Rectifier
In this paper, we study a problem of egocentric scene understanding, i.e.,
predicting depths and surface normals from an egocentric image. Egocentric
scene understanding poses unprecedented challenges: (1) due to large head
movements, the images are taken from non-canonical viewpoints (i.e., tilted
images) where existing models of geometry prediction do not apply; (2) dynamic
foreground objects including hands constitute a large proportion of visual
scenes. These challenges limit the performance of the existing models learned
from large indoor datasets, such as ScanNet and NYUv2, which comprise
predominantly upright images of static scenes. We present a multimodal spatial
rectifier that stabilizes the egocentric images to a set of reference
directions, which allows learning a coherent visual representation. Unlike
unimodal spatial rectifier that often produces excessive perspective warp for
egocentric images, the multimodal spatial rectifier learns from multiple
directions that can minimize the impact of the perspective warp. To learn
visual representations of the dynamic foreground objects, we present a new
dataset called EDINA (Egocentric Depth on everyday INdoor Activities) that
comprises more than 500K synchronized RGBD frames and gravity directions.
Equipped with the multimodal spatial rectifier and the EDINA dataset, our
proposed method on single-view depth and surface normal estimation
significantly outperforms the baselines not only on our EDINA dataset, but also
on other popular egocentric datasets, such as First Person Hand Action (FPHA)
and EPIC-KITCHENS.Comment: Appearing in the Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), 202
Regional Attention with Architecture-Rebuilt 3D Network for RGB-D Gesture Recognition
Human gesture recognition has drawn much attention in the area of computer
vision. However, the performance of gesture recognition is always influenced by
some gesture-irrelevant factors like the background and the clothes of
performers. Therefore, focusing on the regions of hand/arm is important to the
gesture recognition. Meanwhile, a more adaptive architecture-searched network
structure can also perform better than the block-fixed ones like Resnet since
it increases the diversity of features in different stages of the network
better. In this paper, we propose a regional attention with
architecture-rebuilt 3D network (RAAR3DNet) for gesture recognition. We replace
the fixed Inception modules with the automatically rebuilt structure through
the network via Neural Architecture Search (NAS), owing to the different shape
and representation ability of features in the early, middle, and late stage of
the network. It enables the network to capture different levels of feature
representations at different layers more adaptively. Meanwhile, we also design
a stackable regional attention module called dynamic-static Attention (DSA),
which derives a Gaussian guidance heatmap and dynamic motion map to highlight
the hand/arm regions and the motion information in the spatial and temporal
domains, respectively. Extensive experiments on two recent large-scale RGB-D
gesture datasets validate the effectiveness of the proposed method and show it
outperforms state-of-the-art methods. The codes of our method are available at:
https://github.com/zhoubenjia/RAAR3DNet.Comment: Accepted by AAAI 202
ENIGMA-51: Towards a Fine-Grained Understanding of Human-Object Interactions in Industrial Scenarios
ENIGMA-51 is a new egocentric dataset acquired in a real industrial domain by
19 subjects who followed instructions to complete the repair of electrical
boards using industrial tools (e.g., electric screwdriver) and electronic
instruments (e.g., oscilloscope). The 51 sequences are densely annotated with a
rich set of labels that enable the systematic study of human-object
interactions in the industrial domain. We provide benchmarks on four tasks
related to human-object interactions: 1) untrimmed action detection, 2)
egocentric human-object interaction detection, 3) short-term object interaction
anticipation and 4) natural language understanding of intents and entities.
Baseline results show that the ENIGMA-51 dataset poses a challenging benchmark
to study human-object interactions in industrial scenarios. We publicly release
the dataset at: https://iplab.dmi.unict.it/ENIGMA-51/