1,232 research outputs found
Confidence-based Cue Integration for Visual Place Recognition
A distinctive feature of intelligent systems is their capability to analyze their level of expertise for a given task; in other words, they know what they know. As a way towards this ambitious goal, this paper presents a recognition algorithm able to measure its own level of confidence and, in case of uncertainty, to seek for extra information so to increase its own knowledge and ultimately achieve better performance. We focus on the visual place recognition problem for topological localization, and we take an SVM approach. We propose a new method for measuring the confidence level of the classification output, based on the distance of a test image and the average distance of training vectors. This method is combined with a discriminative accumulation scheme for cue integration. We show with extensive experiments that the resulting algorithm achieves better performances for two visual cues than the classic single cue SVM on the same task, while minimising the computational load. More important, our method provides a reliable measure of the level of confidence of the decision
Online Metric-Weighted Linear Representations for Robust Visual Tracking
In this paper, we propose a visual tracker based on a metric-weighted linear
representation of appearance. In order to capture the interdependence of
different feature dimensions, we develop two online distance metric learning
methods using proximity comparison information and structured output learning.
The learned metric is then incorporated into a linear representation of
appearance.
We show that online distance metric learning significantly improves the
robustness of the tracker, especially on those sequences exhibiting drastic
appearance changes. In order to bound growth in the number of training samples,
we design a time-weighted reservoir sampling method.
Moreover, we enable our tracker to automatically perform object
identification during the process of object tracking, by introducing a
collection of static template samples belonging to several object classes of
interest. Object identification results for an entire video sequence are
achieved by systematically combining the tracking information and visual
recognition at each frame. Experimental results on challenging video sequences
demonstrate the effectiveness of the method for both inter-frame tracking and
object identification.Comment: 51 pages. Appearing in IEEE Transactions on Pattern Analysis and
Machine Intelligenc
Multiple sparse representations classification
Sparse representations classification (SRC) is a powerful technique for pixelwise classification of images and it is increasingly being used for a wide variety of image analysis tasks. The method uses sparse representation and learned redundant dictionaries to classify image pixels. In this empirical study we propose to further leverage the redundancy of the learned dictionaries to achieve a more accurate classifier. In conventional SRC, each image pixel is associated with a small patch surrounding it. Using these patches, a dictionary is trained for each class in a supervised fashion. Commonly, redundant/overcomplete dictionaries are trained and image patches are sparsely represented by a linear combination of only a few of the dictionary elements. Given a set of trained dictionaries, a new patch is sparse coded using each of them, and subsequently assigned to the class whose dictionary yields the minimum residual energy.We propose a generalization of this scheme. The method, which we call multiple sparse representations classification (mSRC), is based on the observation that an overcomplete, class specific dictionary is capable of generating multiple accurate and independent estimates of a patch belonging to the class. So instead of finding a single sparse representation of a patch for each dictionary, we find multiple, and the corresponding residual energies provides an enhanced statistic which is used to improve classification. We demonstrate the efficacy of mSRC for three example applications: pixelwise classification of texture images, lumen segmentation in carotid artery magnetic resonance imaging (MRI), and bifurcation point detection in carotid artery MRI. We compare our method with conventional SRC, K-nearest neighbor, and support vector machine classifiers. The results show that mSRC outperforms SRC and the other reference methods. In addition, we present an extensive evaluation of the effect of the main mSRC parameters: patch size, dictionary size, and sparsity level
Robust Visual Tracking Revisited: From Correlation Filter to Template Matching
In this paper, we propose a novel matching based tracker by investigating the
relationship between template matching and the recent popular correlation
filter based trackers (CFTs). Compared to the correlation operation in CFTs, a
sophisticated similarity metric termed "mutual buddies similarity" (MBS) is
proposed to exploit the relationship of multiple reciprocal nearest neighbors
for target matching. By doing so, our tracker obtains powerful discriminative
ability on distinguishing target and background as demonstrated by both
empirical and theoretical analyses. Besides, instead of utilizing single
template with the improper updating scheme in CFTs, we design a novel online
template updating strategy named "memory filtering" (MF), which aims to select
a certain amount of representative and reliable tracking results in history to
construct the current stable and expressive template set. This scheme is
beneficial for the proposed tracker to comprehensively "understand" the target
appearance variations, "recall" some stable results. Both qualitative and
quantitative evaluations on two benchmarks suggest that the proposed tracking
method performs favorably against some recently developed CFTs and other
competitive trackers.Comment: has been published on IEEE TI
Multiple Instance Learning: A Survey of Problem Characteristics and Applications
Multiple instance learning (MIL) is a form of weakly supervised learning
where training instances are arranged in sets, called bags, and a label is
provided for the entire bag. This formulation is gaining interest because it
naturally fits various problems and allows to leverage weakly labeled data.
Consequently, it has been used in diverse application fields such as computer
vision and document classification. However, learning from bags raises
important challenges that are unique to MIL. This paper provides a
comprehensive survey of the characteristics which define and differentiate the
types of MIL problems. Until now, these problem characteristics have not been
formally identified and described. As a result, the variations in performance
of MIL algorithms from one data set to another are difficult to explain. In
this paper, MIL problem characteristics are grouped into four broad categories:
the composition of the bags, the types of data distribution, the ambiguity of
instance labels, and the task to be performed. Methods specialized to address
each category are reviewed. Then, the extent to which these characteristics
manifest themselves in key MIL application areas are described. Finally,
experiments are conducted to compare the performance of 16 state-of-the-art MIL
methods on selected problem characteristics. This paper provides insight on how
the problem characteristics affect MIL algorithms, recommendations for future
benchmarking and promising avenues for research
Spatiotemporal visual analysis of human actions
In this dissertation we propose four methods for the recognition of human activities. In all four of
them, the representation of the activities is based on spatiotemporal features that are automatically
detected at areas where there is a significant amount of independent motion, that is, motion that is
due to ongoing activities in the scene. We propose the use of spatiotemporal salient points as features
throughout this dissertation. The algorithms presented, however, can be used with any kind of features,
as long as the latter are well localized and have a well-defined area of support in space and time. We
introduce the utilized spatiotemporal salient points in the first method presented in this dissertation.
By extending previous work on spatial saliency, we measure the variations in the information content of
pixel neighborhoods both in space and time, and detect the points at the locations and scales for which
this information content is locally maximized. In this way, an activity is represented as a collection of
spatiotemporal salient points. We propose an iterative linear space-time warping technique in order
to align the representations in space and time and propose to use Relevance Vector Machines (RVM)
in order to classify each example into an action category. In the second method proposed in this
dissertation we propose to enhance the acquired representations of the first method. More specifically,
we propose to track each detected point in time, and create representations based on sets of trajectories,
where each trajectory expresses how the information engulfed by each salient point evolves over time.
In order to deal with imperfect localization of the detected points, we augment the observation model
of the tracker with background information, acquired using a fully automatic background estimation
algorithm. In this way, the tracker favors solutions that contain a large number of foreground pixels.
In addition, we perform experiments where the tracked templates are localized on specific parts of the
body, like the hands and the head, and we further augment the tracker’s observation model using a
human skin color model. Finally, we use a variant of the Longest Common Subsequence algorithm
(LCSS) in order to acquire a similarity measure between the resulting trajectory representations, and
RVMs for classification. In the third method that we propose, we assume that neighboring salient
points follow a similar motion. This is in contrast to the previous method, where each salient point was
tracked independently of its neighbors. More specifically, we propose to extract a novel set of visual
descriptors that are based on geometrical properties of three-dimensional piece-wise polynomials. The
latter are fitted on the spatiotemporal locations of salient points that fall within local spatiotemporal
neighborhoods, and are assumed to follow a similar motion. The extracted descriptors are invariant in
translation and scaling in space-time. Coupling the neighborhood dimensions to the scale at which the
corresponding spatiotemporal salient points are detected ensures the latter. The descriptors that are
extracted across the whole dataset are subsequently clustered in order to create a codebook, which is
used in order to represent the overall motion of the subjects within small temporal windows.Finally,we use boosting in order to select the most discriminative of these windows for each class, and RVMs for
classification. The fourth and last method addresses the joint problem of localization and recognition
of human activities depicted in unsegmented image sequences. Its main contribution is the use of an
implicit representation of the spatiotemporal shape of the activity, which relies on the spatiotemporal
localization of characteristic ensembles of spatiotemporal features. The latter are localized around
automatically detected salient points. Evidence for the spatiotemporal localization of the activity
is accumulated in a probabilistic spatiotemporal voting scheme. During training, we use boosting in
order to create codebooks of characteristic feature ensembles for each class. Subsequently, we construct
class-specific spatiotemporal models, which encode where in space and time each codeword ensemble
appears in the training set. During testing, each activated codeword ensemble casts probabilistic
votes concerning the spatiotemporal localization of the activity, according to the information stored
during training. We use a Mean Shift Mode estimation algorithm in order to extract the most probable
hypotheses from each resulting voting space. Each hypothesis corresponds to a spatiotemporal volume
which potentially engulfs the activity, and is verified by performing action category classification with
an RVM classifier
- …