15 research outputs found
Identifying First-person Camera Wearers in Third-person Videos
We consider scenarios in which we wish to perform joint scene understanding,
object tracking, activity recognition, and other tasks in environments in which
multiple people are wearing body-worn cameras while a third-person static
camera also captures the scene. To do this, we need to establish person-level
correspondences across first- and third-person videos, which is challenging
because the camera wearer is not visible from his/her own egocentric video,
preventing the use of direct feature matching. In this paper, we propose a new
semi-Siamese Convolutional Neural Network architecture to address this novel
challenge. We formulate the problem as learning a joint embedding space for
first- and third-person videos that considers both spatial- and motion-domain
cues. A new triplet loss function is designed to minimize the distance between
correct first- and third-person matches while maximizing the distance between
incorrect ones. This end-to-end approach performs significantly better than
several baselines, in part by learning the first- and third-person features
optimized for matching jointly with the distance measure itself
ECO: Egocentric Cognitive Mapping
We present a new method to localize a camera within a previously unseen
environment perceived from an egocentric point of view. Although this is, in
general, an ill-posed problem, humans can effortlessly and efficiently
determine their relative location and orientation and navigate into a
previously unseen environments, e.g., finding a specific item in a new grocery
store. To enable such a capability, we design a new egocentric representation,
which we call ECO (Egocentric COgnitive map). ECO is biologically inspired, by
the cognitive map that allows human navigation, and it encodes the surrounding
visual semantics with respect to both distance and orientation. ECO possesses
three main properties: (1) reconfigurability: complex semantics and geometry is
captured via the synthesis of atomic visual representations (e.g., image
patch); (2) robustness: the visual semantics are registered in a geometrically
consistent way (e.g., aligning with respect to the gravity vector,
frontalizing, and rescaling to canonical depth), thus enabling us to learn
meaningful atomic representations; (3) adaptability: a domain adaptation
framework is designed to generalize the learned representation without manual
calibration. As a proof-of-concept, we use ECO to localize a camera within
real-world scenes---various grocery stores---and demonstrate performance
improvements when compared to existing semantic localization approaches
An Egocentric Look at Video Photographer Identity
Egocentric cameras are being worn by an increasing number of users, among
them many security forces worldwide. GoPro cameras already penetrated the mass
market, reporting substantial increase in sales every year. As head-worn
cameras do not capture the photographer, it may seem that the anonymity of the
photographer is preserved even when the video is publicly distributed.
We show that camera motion, as can be computed from the egocentric video,
provides unique identity information. The photographer can be reliably
recognized from a few seconds of video captured when walking. The proposed
method achieves more than 90% recognition accuracy in cases where the random
success rate is only 3%.
Applications can include theft prevention by locking the camera when not worn
by its rightful owner. Searching video sharing services (e.g. YouTube) for
egocentric videos shot by a specific photographer may also become possible. An
important message in this paper is that photographers should be aware that
sharing egocentric video will compromise their anonymity, even when their face
is not visible
Unsupervised Mapping and Semantic User Localisation from First-Person Monocular Video
We propose an unsupervised probabilistic framework for learning a human-centred representation of a person’s environment from first-person video. Specifically, non-geometric maps modelled as hierarchies of probabilistic place graphs and view graphs are learned. Place graphs model a user’s patterns of transition between physical locations whereas view graphs capture an aspect of user behaviour within those locations. Furthermore, we describe an implementation in which the notion of place is divided into stations and the routes that interconnect them. Stations typically correspond to rooms or areas where a user spends time. Visits to stations are temporally segmented based on qualitative visual motion. We describe how to learn maps online in an unsupervised manner, and how to localise the user within these maps. We report experiments on two datasets, including comparison of performance with and without view graphs, and demonstrate better online mapping than when using offline clustering.<br/