6,981 research outputs found
Local velocity-adapted motion events for spatio-temporal recognition
In this paper we address the problem in motion recognition using event-based local motion representations. We assume that similar patterns of motion contain similar events with consistent motion across image sequences. Using this assumption, we formulate the problem of motion recognition as a matching of corresponding events in image sequences. To enable the matching, we present and evaluate a set of motion descriptors exploiting the spatial and the temporal coherence of motion measurements between corresponding events in image sequences. As motion measurements may depend on the relative motion of the camera, we also present a mechanism for local velocity adaptation of events and evaluate its influence when recognizing image sequences subjected to different camera motions. When recognizing motion, we compare the performance of nearest neighbor (NN) classifier with the performance of support vector machine (SVM).We also compare event-based motion representations to motion representations by global histograms. An experimental evaluation on a large video database with human actions demonstrates the advantage of the proposed scheme for event-based motion representation in combination with SVM classification. The particular advantage of event-based representations and velocity adaptation is further emphasized when recognizing human actions in unconstrained scenes with complex and non-stationary backgrounds
Invariance of visual operations at the level of receptive fields
Receptive field profiles registered by cell recordings have shown that
mammalian vision has developed receptive fields tuned to different sizes and
orientations in the image domain as well as to different image velocities in
space-time. This article presents a theoretical model by which families of
idealized receptive field profiles can be derived mathematically from a small
set of basic assumptions that correspond to structural properties of the
environment. The article also presents a theory for how basic invariance
properties to variations in scale, viewing direction and relative motion can be
obtained from the output of such receptive fields, using complementary
selection mechanisms that operate over the output of families of receptive
fields tuned to different parameters. Thereby, the theory shows how basic
invariance properties of a visual system can be obtained already at the level
of receptive fields, and we can explain the different shapes of receptive field
profiles found in biological vision from a requirement that the visual system
should be invariant to the natural types of image transformations that occur in
its environment.Comment: 40 pages, 17 figure
Time-causal and time-recursive spatio-temporal receptive fields
We present an improved model and theory for time-causal and time-recursive
spatio-temporal receptive fields, based on a combination of Gaussian receptive
fields over the spatial domain and first-order integrators or equivalently
truncated exponential filters coupled in cascade over the temporal domain.
Compared to previous spatio-temporal scale-space formulations in terms of
non-enhancement of local extrema or scale invariance, these receptive fields
are based on different scale-space axiomatics over time by ensuring
non-creation of new local extrema or zero-crossings with increasing temporal
scale. Specifically, extensions are presented about (i) parameterizing the
intermediate temporal scale levels, (ii) analysing the resulting temporal
dynamics, (iii) transferring the theory to a discrete implementation, (iv)
computing scale-normalized spatio-temporal derivative expressions for
spatio-temporal feature detection and (v) computational modelling of receptive
fields in the lateral geniculate nucleus (LGN) and the primary visual cortex
(V1) in biological vision.
We show that by distributing the intermediate temporal scale levels according
to a logarithmic distribution, we obtain much faster temporal response
properties (shorter temporal delays) compared to a uniform distribution.
Specifically, these kernels converge very rapidly to a limit kernel possessing
true self-similar scale-invariant properties over temporal scales, thereby
allowing for true scale invariance over variations in the temporal scale,
although the underlying temporal scale-space representation is based on a
discretized temporal scale parameter.
We show how scale-normalized temporal derivatives can be defined for these
time-causal scale-space kernels and how the composed theory can be used for
computing basic types of scale-normalized spatio-temporal derivative
expressions in a computationally efficient manner.Comment: 39 pages, 12 figures, 5 tables in Journal of Mathematical Imaging and
Vision, published online Dec 201
Dynamic texture recognition using time-causal and time-recursive spatio-temporal receptive fields
This work presents a first evaluation of using spatio-temporal receptive
fields from a recently proposed time-causal spatio-temporal scale-space
framework as primitives for video analysis. We propose a new family of video
descriptors based on regional statistics of spatio-temporal receptive field
responses and evaluate this approach on the problem of dynamic texture
recognition. Our approach generalises a previously used method, based on joint
histograms of receptive field responses, from the spatial to the
spatio-temporal domain and from object recognition to dynamic texture
recognition. The time-recursive formulation enables computationally efficient
time-causal recognition. The experimental evaluation demonstrates competitive
performance compared to state-of-the-art. Especially, it is shown that binary
versions of our dynamic texture descriptors achieve improved performance
compared to a large range of similar methods using different primitives either
handcrafted or learned from data. Further, our qualitative and quantitative
investigation into parameter choices and the use of different sets of receptive
fields highlights the robustness and flexibility of our approach. Together,
these results support the descriptive power of this family of time-causal
spatio-temporal receptive fields, validate our approach for dynamic texture
recognition and point towards the possibility of designing a range of video
analysis methods based on these new time-causal spatio-temporal primitives.Comment: 29 pages, 16 figure
Event-based Vision: A Survey
Event cameras are bio-inspired sensors that differ from conventional frame
cameras: Instead of capturing images at a fixed rate, they asynchronously
measure per-pixel brightness changes, and output a stream of events that encode
the time, location and sign of the brightness changes. Event cameras offer
attractive properties compared to traditional cameras: high temporal resolution
(in the order of microseconds), very high dynamic range (140 dB vs. 60 dB), low
power consumption, and high pixel bandwidth (on the order of kHz) resulting in
reduced motion blur. Hence, event cameras have a large potential for robotics
and computer vision in challenging scenarios for traditional cameras, such as
low-latency, high speed, and high dynamic range. However, novel methods are
required to process the unconventional output of these sensors in order to
unlock their potential. This paper provides a comprehensive overview of the
emerging field of event-based vision, with a focus on the applications and the
algorithms developed to unlock the outstanding properties of event cameras. We
present event cameras from their working principle, the actual sensors that are
available and the tasks that they have been used for, from low-level vision
(feature detection and tracking, optic flow, etc.) to high-level vision
(reconstruction, segmentation, recognition). We also discuss the techniques
developed to process events, including learning-based techniques, as well as
specialized processors for these novel sensors, such as spiking neural
networks. Additionally, we highlight the challenges that remain to be tackled
and the opportunities that lie ahead in the search for a more efficient,
bio-inspired way for machines to perceive and interact with the world
What Will I Do Next? The Intention from Motion Experiment
In computer vision, video-based approaches have been widely explored for the
early classification and the prediction of actions or activities. However, it
remains unclear whether this modality (as compared to 3D kinematics) can still
be reliable for the prediction of human intentions, defined as the overarching
goal embedded in an action sequence. Since the same action can be performed
with different intentions, this problem is more challenging but yet affordable
as proved by quantitative cognitive studies which exploit the 3D kinematics
acquired through motion capture systems. In this paper, we bridge cognitive and
computer vision studies, by demonstrating the effectiveness of video-based
approaches for the prediction of human intentions. Precisely, we propose
Intention from Motion, a new paradigm where, without using any contextual
information, we consider instantaneous grasping motor acts involving a bottle
in order to forecast why the bottle itself has been reached (to pass it or to
place in a box, or to pour or to drink the liquid inside). We process only the
grasping onsets casting intention prediction as a classification framework.
Leveraging on our multimodal acquisition (3D motion capture data and 2D optical
videos), we compare the most commonly used 3D descriptors from cognitive
studies with state-of-the-art video-based techniques. Since the two analyses
achieve an equivalent performance, we demonstrate that computer vision tools
are effective in capturing the kinematics and facing the cognitive problem of
human intention prediction.Comment: 2017 IEEE Conference on Computer Vision and Pattern Recognition
Workshop
Action Recognition in Videos: from Motion Capture Labs to the Web
This paper presents a survey of human action recognition approaches based on
visual data recorded from a single video camera. We propose an organizing
framework which puts in evidence the evolution of the area, with techniques
moving from heavily constrained motion capture scenarios towards more
challenging, realistic, "in the wild" videos. The proposed organization is
based on the representation used as input for the recognition task, emphasizing
the hypothesis assumed and thus, the constraints imposed on the type of video
that each technique is able to address. Expliciting the hypothesis and
constraints makes the framework particularly useful to select a method, given
an application. Another advantage of the proposed organization is that it
allows categorizing newest approaches seamlessly with traditional ones, while
providing an insightful perspective of the evolution of the action recognition
task up to now. That perspective is the basis for the discussion in the end of
the paper, where we also present the main open issues in the area.Comment: Preprint submitted to CVIU, survey paper, 46 pages, 2 figures, 4
table
Separable time-causal and time-recursive spatio-temporal receptive fields
We present an improved model and theory for time-causal and time-recursive
spatio-temporal receptive fields, obtained by a combination of Gaussian
receptive fields over the spatial domain and first-order integrators or
equivalently truncated exponential filters coupled in cascade over the temporal
domain. Compared to previous spatio-temporal scale-space formulations in terms
of non-enhancement of local extrema or scale invariance, these receptive fields
are based on different scale-space axiomatics over time by ensuring
non-creation of new local extrema or zero-crossings with increasing temporal
scale. Specifically, extensions are presented about parameterizing the
intermediate temporal scale levels, analysing the resulting temporal dynamics
and transferring the theory to a discrete implementation in terms of recursive
filters over time.Comment: 12 pages, 2 figures, 2 tables. arXiv admin note: substantial text
overlap with arXiv:1404.203
Discriminatively Trained Latent Ordinal Model for Video Classification
We study the problem of video classification for facial analysis and human
action recognition. We propose a novel weakly supervised learning method that
models the video as a sequence of automatically mined, discriminative
sub-events (eg. onset and offset phase for "smile", running and jumping for
"highjump"). The proposed model is inspired by the recent works on Multiple
Instance Learning and latent SVM/HCRF -- it extends such frameworks to model
the ordinal aspect in the videos, approximately. We obtain consistent
improvements over relevant competitive baselines on four challenging and
publicly available video based facial analysis datasets for prediction of
expression, clinical pain and intent in dyadic conversations and on three
challenging human action datasets. We also validate the method with qualitative
results and show that they largely support the intuitions behind the method.Comment: Paper accepted in IEEE TPAMI. arXiv admin note: substantial text
overlap with arXiv:1604.0150
- âŠ