2,026 research outputs found
Unsupervised Understanding of Location and Illumination Changes in Egocentric Videos
Wearable cameras stand out as one of the most promising devices for the
upcoming years, and as a consequence, the demand of computer algorithms to
automatically understand the videos recorded with them is increasing quickly.
An automatic understanding of these videos is not an easy task, and its mobile
nature implies important challenges to be faced, such as the changing light
conditions and the unrestricted locations recorded. This paper proposes an
unsupervised strategy based on global features and manifold learning to endow
wearable cameras with contextual information regarding the light conditions and
the location captured. Results show that non-linear manifold methods can
capture contextual patterns from global features without compromising large
computational resources. The proposed strategy is used, as an application case,
as a switching mechanism to improve the hand-detection problem in egocentric
videos.Comment: Submitted for publicatio
Egocentric Activity Recognition with Multimodal Fisher Vector
With the increasing availability of wearable devices, research on egocentric
activity recognition has received much attention recently. In this paper, we
build a Multimodal Egocentric Activity dataset which includes egocentric videos
and sensor data of 20 fine-grained and diverse activity categories. We present
a novel strategy to extract temporal trajectory-like features from sensor data.
We propose to apply the Fisher Kernel framework to fuse video and temporal
enhanced sensor features. Experiment results show that with careful design of
feature extraction and fusion algorithm, sensor data can enhance
information-rich video data. We make publicly available the Multimodal
Egocentric Activity dataset to facilitate future research.Comment: 5 pages, 4 figures, ICASSP 2016 accepte
Multi-modal Egocentric Activity Recognition using Audio-Visual Features
Egocentric activity recognition in first-person videos has an increasing
importance with a variety of applications such as lifelogging, summarization,
assisted-living and activity tracking. Existing methods for this task are based
on interpretation of various sensor information using pre-determined weights
for each feature. In this work, we propose a new framework for egocentric
activity recognition problem based on combining audio-visual features with
multi-kernel learning (MKL) and multi-kernel boosting (MKBoost). For that
purpose, firstly grid optical-flow, virtual-inertia feature, log-covariance,
cuboid are extracted from the video. The audio signal is characterized using a
"supervector", obtained based on Gaussian mixture modelling of frame-level
features, followed by a maximum a-posteriori adaptation. Then, the extracted
multi-modal features are adaptively fused by MKL classifiers in which both the
feature and kernel selection/weighing and recognition tasks are performed
together. The proposed framework was evaluated on a number of egocentric
datasets. The results showed that using multi-modal features with MKL
outperforms the existing methods
Egocentric Scene Understanding via Multimodal Spatial Rectifier
In this paper, we study a problem of egocentric scene understanding, i.e.,
predicting depths and surface normals from an egocentric image. Egocentric
scene understanding poses unprecedented challenges: (1) due to large head
movements, the images are taken from non-canonical viewpoints (i.e., tilted
images) where existing models of geometry prediction do not apply; (2) dynamic
foreground objects including hands constitute a large proportion of visual
scenes. These challenges limit the performance of the existing models learned
from large indoor datasets, such as ScanNet and NYUv2, which comprise
predominantly upright images of static scenes. We present a multimodal spatial
rectifier that stabilizes the egocentric images to a set of reference
directions, which allows learning a coherent visual representation. Unlike
unimodal spatial rectifier that often produces excessive perspective warp for
egocentric images, the multimodal spatial rectifier learns from multiple
directions that can minimize the impact of the perspective warp. To learn
visual representations of the dynamic foreground objects, we present a new
dataset called EDINA (Egocentric Depth on everyday INdoor Activities) that
comprises more than 500K synchronized RGBD frames and gravity directions.
Equipped with the multimodal spatial rectifier and the EDINA dataset, our
proposed method on single-view depth and surface normal estimation
significantly outperforms the baselines not only on our EDINA dataset, but also
on other popular egocentric datasets, such as First Person Hand Action (FPHA)
and EPIC-KITCHENS.Comment: Appearing in the Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), 202
MECCANO: A Multimodal Egocentric Dataset for Humans Behavior Understanding in the Industrial-like Domain
Wearable cameras allow to acquire images and videos from the user's
perspective. These data can be processed to understand humans behavior. Despite
human behavior analysis has been thoroughly investigated in third person
vision, it is still understudied in egocentric settings and in particular in
industrial scenarios. To encourage research in this field, we present MECCANO,
a multimodal dataset of egocentric videos to study humans behavior
understanding in industrial-like settings. The multimodality is characterized
by the presence of gaze signals, depth maps and RGB videos acquired
simultaneously with a custom headset. The dataset has been explicitly labeled
for fundamental tasks in the context of human behavior understanding from a
first person view, such as recognizing and anticipating human-object
interactions. With the MECCANO dataset, we explored five different tasks
including 1) Action Recognition, 2) Active Objects Detection and Recognition,
3) Egocentric Human-Objects Interaction Detection, 4) Action Anticipation and
5) Next-Active Objects Detection. We propose a benchmark aimed to study human
behavior in the considered industrial-like scenario which demonstrates that the
investigated tasks and the considered scenario are challenging for
state-of-the-art algorithms. To support research in this field, we publicy
release the dataset at https://iplab.dmi.unict.it/MECCANO/.Comment: arXiv admin note: text overlap with arXiv:2010.0565
EGO-TOPO: Environment Affordances from Egocentric Video
First-person video naturally brings the use of a physical environment to the
forefront, since it shows the camera wearer interacting fluidly in a space
based on his intentions. However, current methods largely separate the observed
actions from the persistent space itself. We introduce a model for environment
affordances that is learned directly from egocentric video. The main idea is to
gain a human-centric model of a physical space (such as a kitchen) that
captures (1) the primary spatial zones of interaction and (2) the likely
activities they support. Our approach decomposes a space into a topological map
derived from first-person activity, organizing an ego-video into a series of
visits to the different zones. Further, we show how to link zones across
multiple related environments (e.g., from videos of multiple kitchens) to
obtain a consolidated representation of environment functionality. On
EPIC-Kitchens and EGTEA+, we demonstrate our approach for learning scene
affordances and anticipating future actions in long-form video.Comment: Published in CVPR 2020, project page:
http://vision.cs.utexas.edu/projects/ego-topo
- …