8,690 research outputs found
Local features for view matching across independently moving cameras.
PhD ThesisMoving platforms, such as wearable and robotic cameras, need to recognise the same place
observed from different viewpoints in order to collaboratively reconstruct a 3D scene and to support
augmented reality or autonomous navigation. However, matching views is challenging for
independently moving cameras that directly interact with each other due to severe geometric and
photometric differences, such as viewpoint, scale, and illumination changes, can considerably
decrease the matching performance. This thesis proposes novel, compact, local features that can
cope with with scale and viewpoint variations. We extract and describe an image patch at different
scales of an image pyramid by comparing intensity values between learnt pixel pairs (binary
test), and employ a cross-scale distance when matching these features. We capture, at multiple
scales, the temporal changes of a 3D point, as observed in the image sequence of a camera, by
tracking local binary descriptors. After validating the feature-point trajectories through 3D reconstruction,
we reduce, for each scale, the sequence of binary features to a compact, fixed-length
descriptor that identifies the most frequent and the most stable binary tests over time. We then
propose XC-PR, a cross-camera place recognition approach that stores locally, for each uncalibrated
camera, spatio-temporal descriptors, extracted at a single scale, in a tree that is selectively
updated, as the camera moves. Cameras exchange descriptors selected from previous frames
within an adaptive temporal window and with the highest number of local features corresponding
to the descriptors. The other camera locally searches and matches the received descriptors to
identify and geometrically validate a previously seen place. Experiments on different scenarios
show the improved matching accuracy of the joint multi-scale extraction and temporal reduction
through comparisons of different temporal reduction strategies, as well as the cross-camera
matching strategy based on Bag of Binary Words, and the application to several binary descriptors.
We also show that XC-PR achieves similar accuracy but faster, on average, than a baseline
consisting of an incremental list of spatio-temporal descriptors. Moreover, XC-PR achieves similar
accuracy of a frame-based Bag of Binary Words approach adapted to our approach, while
avoiding to match features that cannot be informative, e.g. for 3D reconstruction
DART: Distribution Aware Retinal Transform for Event-based Cameras
We introduce a generic visual descriptor, termed as distribution aware
retinal transform (DART), that encodes the structural context using log-polar
grids for event cameras. The DART descriptor is applied to four different
problems, namely object classification, tracking, detection and feature
matching: (1) The DART features are directly employed as local descriptors in a
bag-of-features classification framework and testing is carried out on four
standard event-based object datasets (N-MNIST, MNIST-DVS, CIFAR10-DVS,
NCaltech-101). (2) Extending the classification system, tracking is
demonstrated using two key novelties: (i) For overcoming the low-sample problem
for the one-shot learning of a binary classifier, statistical bootstrapping is
leveraged with online learning; (ii) To achieve tracker robustness, the scale
and rotation equivariance property of the DART descriptors is exploited for the
one-shot learning. (3) To solve the long-term object tracking problem, an
object detector is designed using the principle of cluster majority voting. The
detection scheme is then combined with the tracker to result in a high
intersection-over-union score with augmented ground truth annotations on the
publicly available event camera dataset. (4) Finally, the event context encoded
by DART greatly simplifies the feature correspondence problem, especially for
spatio-temporal slices far apart in time, which has not been explicitly tackled
in the event-based vision domain.Comment: 12 pages, revision submitted to TPAMI in Nov 201
Action Recognition in Videos: from Motion Capture Labs to the Web
This paper presents a survey of human action recognition approaches based on
visual data recorded from a single video camera. We propose an organizing
framework which puts in evidence the evolution of the area, with techniques
moving from heavily constrained motion capture scenarios towards more
challenging, realistic, "in the wild" videos. The proposed organization is
based on the representation used as input for the recognition task, emphasizing
the hypothesis assumed and thus, the constraints imposed on the type of video
that each technique is able to address. Expliciting the hypothesis and
constraints makes the framework particularly useful to select a method, given
an application. Another advantage of the proposed organization is that it
allows categorizing newest approaches seamlessly with traditional ones, while
providing an insightful perspective of the evolution of the action recognition
task up to now. That perspective is the basis for the discussion in the end of
the paper, where we also present the main open issues in the area.Comment: Preprint submitted to CVIU, survey paper, 46 pages, 2 figures, 4
table
Early Recognition of Human Activities from First-Person Videos Using Onset Representations
In this paper, we propose a methodology for early recognition of human
activities from videos taken with a first-person viewpoint. Early recognition,
which is also known as activity prediction, is an ability to infer an ongoing
activity at its early stage. We present an algorithm to perform recognition of
activities targeted at the camera from streaming videos, making the system to
predict intended activities of the interacting person and avoid harmful events
before they actually happen. We introduce the novel concept of 'onset' that
efficiently summarizes pre-activity observations, and design an approach to
consider event history in addition to ongoing video observation for early
first-person recognition of activities. We propose to represent onset using
cascade histograms of time series gradients, and we describe a novel
algorithmic setup to take advantage of onset for early recognition of
activities. The experimental results clearly illustrate that the proposed
concept of onset enables better/earlier recognition of human activities from
first-person videos
- …