3,543 research outputs found
Dynamic texture recognition using time-causal and time-recursive spatio-temporal receptive fields
This work presents a first evaluation of using spatio-temporal receptive
fields from a recently proposed time-causal spatio-temporal scale-space
framework as primitives for video analysis. We propose a new family of video
descriptors based on regional statistics of spatio-temporal receptive field
responses and evaluate this approach on the problem of dynamic texture
recognition. Our approach generalises a previously used method, based on joint
histograms of receptive field responses, from the spatial to the
spatio-temporal domain and from object recognition to dynamic texture
recognition. The time-recursive formulation enables computationally efficient
time-causal recognition. The experimental evaluation demonstrates competitive
performance compared to state-of-the-art. Especially, it is shown that binary
versions of our dynamic texture descriptors achieve improved performance
compared to a large range of similar methods using different primitives either
handcrafted or learned from data. Further, our qualitative and quantitative
investigation into parameter choices and the use of different sets of receptive
fields highlights the robustness and flexibility of our approach. Together,
these results support the descriptive power of this family of time-causal
spatio-temporal receptive fields, validate our approach for dynamic texture
recognition and point towards the possibility of designing a range of video
analysis methods based on these new time-causal spatio-temporal primitives.Comment: 29 pages, 16 figure
DroTrack: High-speed Drone-based Object Tracking Under Uncertainty
We present DroTrack, a high-speed visual single-object tracking framework for
drone-captured video sequences. Most of the existing object tracking methods
are designed to tackle well-known challenges, such as occlusion and cluttered
backgrounds. The complex motion of drones, i.e., multiple degrees of freedom in
three-dimensional space, causes high uncertainty. The uncertainty problem leads
to inaccurate location predictions and fuzziness in scale estimations. DroTrack
solves such issues by discovering the dependency between object representation
and motion geometry. We implement an effective object segmentation based on
Fuzzy C Means (FCM). We incorporate the spatial information into the membership
function to cluster the most discriminative segments. We then enhance the
object segmentation by using a pre-trained Convolution Neural Network (CNN)
model. DroTrack also leverages the geometrical angular motion to estimate a
reliable object scale. We discuss the experimental results and performance
evaluation using two datasets of 51,462 drone-captured frames. The combination
of the FCM segmentation and the angular scaling increased DroTrack precision by
up to and decreased the centre location error by pixels on average.
DroTrack outperforms all the high-speed trackers and achieves comparable
results in comparison to deep learning trackers. DroTrack offers high frame
rates up to 1000 frame per second (fps) with the best location precision, more
than a set of state-of-the-art real-time trackers.Comment: 10 pages, 12 figures, FUZZ-IEEE 202
Learnable PINs: Cross-Modal Embeddings for Person Identity
We propose and investigate an identity sensitive joint embedding of face and
voice. Such an embedding enables cross-modal retrieval from voice to face and
from face to voice. We make the following four contributions: first, we show
that the embedding can be learnt from videos of talking faces, without
requiring any identity labels, using a form of cross-modal self-supervision;
second, we develop a curriculum learning schedule for hard negative mining
targeted to this task, that is essential for learning to proceed successfully;
third, we demonstrate and evaluate cross-modal retrieval for identities unseen
and unheard during training over a number of scenarios and establish a
benchmark for this novel task; finally, we show an application of using the
joint embedding for automatically retrieving and labelling characters in TV
dramas.Comment: To appear in ECCV 201
Age-related delay in information accrual for faces: Evidence from a parametric, single-trial EEG approach
Background: In this study, we quantified age-related changes in the time-course of face processing
by means of an innovative single-trial ERP approach. Unlike analyses used in previous studies, our
approach does not rely on peak measurements and can provide a more sensitive measure of
processing delays. Young and old adults (mean ages 22 and 70 years) performed a non-speeded
discrimination task between two faces. The phase spectrum of these faces was manipulated
parametrically to create pictures that ranged between pure noise (0% phase information) and the
undistorted signal (100% phase information), with five intermediate steps.
Results: Behavioural 75% correct thresholds were on average lower, and maximum accuracy was
higher, in younger than older observers. ERPs from each subject were entered into a single-trial
general linear regression model to identify variations in neural activity statistically associated with
changes in image structure. The earliest age-related ERP differences occurred in the time window
of the N170. Older observers had a significantly stronger N170 in response to noise, but this age
difference decreased with increasing phase information. Overall, manipulating image phase
information had a greater effect on ERPs from younger observers, which was quantified using a
hierarchical modelling approach. Importantly, visual activity was modulated by the same stimulus
parameters in younger and older subjects. The fit of the model, indexed by R2, was computed at
multiple post-stimulus time points. The time-course of the R2 function showed a significantly slower
processing in older observers starting around 120 ms after stimulus onset. This age-related delay
increased over time to reach a maximum around 190 ms, at which latency younger observers had
around 50 ms time lead over older observers.
Conclusion: Using a component-free ERP analysis that provides a precise timing of the visual
system sensitivity to image structure, the current study demonstrates that older observers
accumulate face information more slowly than younger subjects. Additionally, the N170 appears to
be less face-sensitive in older observers
Deep Shape-from-Template: Single-image quasi-isometric deformable registration and reconstruction
Shape-from-Template (SfT) solves 3D vision from a single image and a deformable 3D object model, called a template. Concretely, SfT computes registration (the correspondence between the template and the image) and reconstruction (the depth in camera frame). It constrains the object deformation to quasi-isometry. Real-time and automatic SfT represents an open problem for complex objects and imaging conditions. We present four contributions to address core unmet challenges to realise SfT with a Deep Neural Network (DNN). First, we propose a novel DNN called DeepSfT, which encodes the template in its weights and hence copes with highly complex templates. Second, we propose a semi-supervised training procedure to exploit real data. This is a practical solution to overcome the render gap that occurs when training only with simulated data. Third, we propose a geometry adaptation module to deal with different cameras at training and inference. Fourth, we combine statistical learning with physics-based reasoning. DeepSfT runs automatically and in real-time and we show with numerous experiments and an ablation study that it consistently achieves a lower 3D error than previous work. It outperforms in generalisation and achieves great performance in terms of reconstruction and registration error with wide-baseline, occlusions, illumination changes, weak texture and blur.Agencia Estatal de InvestigaciónMinisterio de Educació
Discovering Relationships between Object Categories via Universal Canonical Maps
We tackle the problem of learning the geometry of multiple categories of
deformable objects jointly. Recent work has shown that it is possible to learn
a unified dense pose predictor for several categories of related objects.
However, training such models requires to initialize inter-category
correspondences by hand. This is suboptimal and the resulting models fail to
maintain correct correspondences as individual categories are learned. In this
paper, we show that improved correspondences can be learned automatically as a
natural byproduct of learning category-specific dense pose predictors. To do
this, we express correspondences between different categories and between
images and categories using a unified embedding. Then, we use the latter to
enforce two constraints: symmetric inter-category cycle consistency and a new
asymmetric image-to-category cycle consistency. Without any manual annotations
for the inter-category correspondences, we obtain state-of-the-art alignment
results, outperforming dedicated methods for matching 3D shapes. Moreover, the
new model is also better at the task of dense pose prediction than prior work.Comment: Accepted at CVPR 2021; Project page:
https://gdude.de/discovering-3d-obj-re
Bioinspired symmetry detection on resource limited embedded platforms
This work is inspired by the vision of flying insects which enables them to detect and locate a set of relevant objects with remarkable effectiveness despite very limited
brainpower. The bioinspired approach worked out here focuses on detection of symmetric objects to be performed by resource-limited embedded platforms such as micro air vehicles. Symmetry detection is posed as a pattern matching problem which is solved by an approach based on the use of composite correlation filters. Two variants of the approach are proposed, analysed and tested in which symmetry detection is cast as 1) static and 2) dynamic pattern matching problems. In the static variant, images of objects are input to two dimentional spatial composite correlation filters. In the dynamic variant, a video (resulting from platform motion) is input to a composite correlation filter of which its peak response is used to define symmetry. In both cases, a novel method is used for designing the composite filter templates for symmetry detection. This method significantly reduces the level of detail which needs to be matched to achieve good detection performance. The resulting performance is systematically quantified using the ROC analysis; it is demonstrated that the bioinspired detection approach is better and with a lower computational cost compared to the best state-of-the-art solution hitherto available
A Comprehensive Performance Evaluation of Deformable Face Tracking "In-the-Wild"
Recently, technologies such as face detection, facial landmark localisation
and face recognition and verification have matured enough to provide effective
and efficient solutions for imagery captured under arbitrary conditions
(referred to as "in-the-wild"). This is partially attributed to the fact that
comprehensive "in-the-wild" benchmarks have been developed for face detection,
landmark localisation and recognition/verification. A very important technology
that has not been thoroughly evaluated yet is deformable face tracking
"in-the-wild". Until now, the performance has mainly been assessed
qualitatively by visually assessing the result of a deformable face tracking
technology on short videos. In this paper, we perform the first, to the best of
our knowledge, thorough evaluation of state-of-the-art deformable face tracking
pipelines using the recently introduced 300VW benchmark. We evaluate many
different architectures focusing mainly on the task of on-line deformable face
tracking. In particular, we compare the following general strategies: (a)
generic face detection plus generic facial landmark localisation, (b) generic
model free tracking plus generic facial landmark localisation, as well as (c)
hybrid approaches using state-of-the-art face detection, model free tracking
and facial landmark localisation technologies. Our evaluation reveals future
avenues for further research on the topic.Comment: E. Antonakos and P. Snape contributed equally and have joint second
authorshi
Hand Gesture Recognition using Depth Data for Indian Sign Language
It is hard for most people who are not familiar with a sign language to communicate without an interpreter. Thus, a system that transcribes symbols in sign languages into plain text can help with real-time communication, and it may also provide interactive training for people to learn a sign language. A sign language uses manual communication and body language to convey meaning. The depth data for five different gestures corresponding to alphabets Y, V, L, S, I was obtained from online database. Each segmented gesture is represented by its timeseries curve and feature vector is extracted from it. To recognise the class of input noisy hand shape, distance metric for hand dissimilarity measure, called Finger-Earth Mover’s Distance (FEMD) is used. As it only matches fingers while not the complete hand shape, it can distinguish hand gestures of slight differences better
- …