1,826 research outputs found
Online Metric-Weighted Linear Representations for Robust Visual Tracking
In this paper, we propose a visual tracker based on a metric-weighted linear
representation of appearance. In order to capture the interdependence of
different feature dimensions, we develop two online distance metric learning
methods using proximity comparison information and structured output learning.
The learned metric is then incorporated into a linear representation of
appearance.
We show that online distance metric learning significantly improves the
robustness of the tracker, especially on those sequences exhibiting drastic
appearance changes. In order to bound growth in the number of training samples,
we design a time-weighted reservoir sampling method.
Moreover, we enable our tracker to automatically perform object
identification during the process of object tracking, by introducing a
collection of static template samples belonging to several object classes of
interest. Object identification results for an entire video sequence are
achieved by systematically combining the tracking information and visual
recognition at each frame. Experimental results on challenging video sequences
demonstrate the effectiveness of the method for both inter-frame tracking and
object identification.Comment: 51 pages. Appearing in IEEE Transactions on Pattern Analysis and
Machine Intelligenc
Classification of Occluded Objects using Fast Recurrent Processing
Recurrent neural networks are powerful tools for handling incomplete data
problems in computer vision, thanks to their significant generative
capabilities. However, the computational demand for these algorithms is too
high to work in real time, without specialized hardware or software solutions.
In this paper, we propose a framework for augmenting recurrent processing
capabilities into a feedforward network without sacrificing much from
computational efficiency. We assume a mixture model and generate samples of the
last hidden layer according to the class decisions of the output layer, modify
the hidden layer activity using the samples, and propagate to lower layers. For
visual occlusion problem, the iterative procedure emulates feedforward-feedback
loop, filling-in the missing hidden layer activity with meaningful
representations. The proposed algorithm is tested on a widely used dataset, and
shown to achieve 2 improvement in classification accuracy for occluded
objects. When compared to Restricted Boltzmann Machines, our algorithm shows
superior performance for occluded object classification.Comment: arXiv admin note: text overlap with arXiv:1409.8576 by other author
Face recognition technologies for evidential evaluation of video traces
Human recognition from video traces is an important task in forensic investigations and evidence evaluations. Compared with other biometric traits, face is one of the most popularly used modalities for human recognition due to the fact that its collection is non-intrusive and requires less cooperation from the subjects. Moreover, face images taken at a long distance can still provide reasonable resolution, while most biometric modalities, such as iris and fingerprint, do not have this merit. In this chapter, we discuss automatic face recognition technologies for evidential evaluations of video traces. We first introduce the general concepts in both forensic and automatic face recognition , then analyse the difficulties in face recognition from videos . We summarise and categorise the approaches for handling different uncontrollable factors in difficult recognition conditions. Finally we discuss some challenges and trends in face recognition research in both forensics and biometrics . Given its merits tested in many deployed systems and great potential in other emerging applications, considerable research and development efforts are expected to be devoted in face recognition in the near future
Representation and recognition of human actions in video
PhDAutomated human action recognition plays a critical role in the development of human-machine
communication, by aiming for a more natural interaction between artificial intelligence and the
human society. Recent developments in technology have permitted a shift from a traditional
human action recognition performed in a well-constrained laboratory environment to realistic
unconstrained scenarios. This advancement has given rise to new problems and challenges still
not addressed by the available methods. Thus, the aim of this thesis is to study innovative approaches
that address the challenging problems of human action recognition from video captured
in unconstrained scenarios. To this end, novel action representations, feature selection methods,
fusion strategies and classification approaches are formulated.
More specifically, a novel interest points based action representation is firstly introduced, this
representation seeks to describe actions as clouds of interest points accumulated at different temporal
scales. The idea behind this method consists of extracting holistic features from the point
clouds and explicitly and globally describing the spatial and temporal action dynamic. Since
the proposed clouds of points representation exploits alternative and complementary information
compared to the conventional interest points-based methods, a more solid representation is then
obtained by fusing the two representations, adopting a Multiple Kernel Learning strategy. The
validity of the proposed approach in recognising action from a well-known benchmark dataset is
demonstrated as well as the superior performance achieved by fusing representations.
Since the proposed method appears limited by the presence of a dynamic background and fast
camera movements, a novel trajectory-based representation is formulated. Different from interest
points, trajectories can simultaneously retain motion and appearance information even in noisy
and crowded scenarios. Additionally, they can handle drastic camera movements and a robust
region of interest estimation. An equally important contribution is the proposed collaborative
feature selection performed to remove redundant and noisy components. In particular, a novel
feature selection method based on Multi-Class Delta Latent Dirichlet Allocation (MC-DLDA)
is introduced. Crucial, to enrich the final action representation, the trajectory representation is
adaptively fused with a conventional interest point representation. The proposed approach is
extensively validated on different datasets, and the reported performances are comparable with
the best state-of-the-art. The obtained results also confirm the fundamental contribution of both
collaborative feature selection and adaptive fusion.
Finally, the problem of realistic human action classification in very ambiguous scenarios is
taken into account. In these circumstances, standard feature selection methods and multi-class
classifiers appear inadequate due to: sparse training set, high intra-class variation and inter-class
similarity. Thus, both the feature selection and classification problems need to be redesigned.
The proposed idea is to iteratively decompose the classification task in subtasks and select the
optimal feature set and classifier in accordance with the subtask context. To this end, a cascaded
feature selection and action classification approach is introduced. The proposed cascade aims to
classify actions by exploiting as much information as possible, and at the same time trying to
simplify the multi-class classification in a cascade of binary separations. Specifically, instead of
separating multiple action classes simultaneously, the overall task is automatically divided into
easier binary sub-tasks. Experiments have been carried out using challenging public datasets;
the obtained results demonstrate that with identical action representation, the cascaded classifier
significantly outperforms standard multi-class classifiers
The Visual Social Distancing Problem
One of the main and most effective measures to contain the recent viral
outbreak is the maintenance of the so-called Social Distancing (SD). To comply
with this constraint, workplaces, public institutions, transports and schools
will likely adopt restrictions over the minimum inter-personal distance between
people. Given this actual scenario, it is crucial to massively measure the
compliance to such physical constraint in our life, in order to figure out the
reasons of the possible breaks of such distance limitations, and understand if
this implies a possible threat given the scene context. All of this, complying
with privacy policies and making the measurement acceptable. To this end, we
introduce the Visual Social Distancing (VSD) problem, defined as the automatic
estimation of the inter-personal distance from an image, and the
characterization of the related people aggregations. VSD is pivotal for a
non-invasive analysis to whether people comply with the SD restriction, and to
provide statistics about the level of safety of specific areas whenever this
constraint is violated. We then discuss how VSD relates with previous
literature in Social Signal Processing and indicate which existing Computer
Vision methods can be used to manage such problem. We conclude with future
challenges related to the effectiveness of VSD systems, ethical implications
and future application scenarios.Comment: 9 pages, 5 figures. All the authors equally contributed to this
manuscript and they are listed by alphabetical order. Under submissio
- …