5,944 research outputs found
Sparse Modeling for Image and Vision Processing
In recent years, a large amount of multi-disciplinary research has been
conducted on sparse models and their applications. In statistics and machine
learning, the sparsity principle is used to perform model selection---that is,
automatically selecting a simple model among a large collection of them. In
signal processing, sparse coding consists of representing data with linear
combinations of a few dictionary elements. Subsequently, the corresponding
tools have been widely adopted by several scientific communities such as
neuroscience, bioinformatics, or computer vision. The goal of this monograph is
to offer a self-contained view of sparse modeling for visual recognition and
image processing. More specifically, we focus on applications where the
dictionary is learned and adapted to data, yielding a compact representation
that has been successful in various contexts.Comment: 205 pages, to appear in Foundations and Trends in Computer Graphics
and Visio
EmoNets: Multimodal deep learning approaches for emotion recognition in video
The task of the emotion recognition in the wild (EmotiW) Challenge is to
assign one of seven emotions to short video clips extracted from Hollywood
style movies. The videos depict acted-out emotions under realistic conditions
with a large degree of variation in attributes such as pose and illumination,
making it worthwhile to explore approaches which consider combinations of
features from multiple modalities for label assignment. In this paper we
present our approach to learning several specialist models using deep learning
techniques, each focusing on one modality. Among these are a convolutional
neural network, focusing on capturing visual information in detected faces, a
deep belief net focusing on the representation of the audio stream, a K-Means
based "bag-of-mouths" model, which extracts visual features around the mouth
region and a relational autoencoder, which addresses spatio-temporal aspects of
videos. We explore multiple methods for the combination of cues from these
modalities into one common classifier. This achieves a considerably greater
accuracy than predictions from our strongest single-modality classifier. Our
method was the winning submission in the 2013 EmotiW challenge and achieved a
test set accuracy of 47.67% on the 2014 dataset
Regularized brain reading with shrinkage and smoothing
Functional neuroimaging measures how the brain responds to complex stimuli.
However, sample sizes are modest, noise is substantial, and stimuli are high
dimensional. Hence, direct estimates are inherently imprecise and call for
regularization. We compare a suite of approaches which regularize via
shrinkage: ridge regression, the elastic net (a generalization of ridge
regression and the lasso), and a hierarchical Bayesian model based on small
area estimation (SAE). We contrast regularization with spatial smoothing and
combinations of smoothing and shrinkage. All methods are tested on functional
magnetic resonance imaging (fMRI) data from multiple subjects participating in
two different experiments related to reading, for both predicting neural
response to stimuli and decoding stimuli from responses. Interestingly, when
the regularization parameters are chosen by cross-validation independently for
every voxel, low/high regularization is chosen in voxels where the
classification accuracy is high/low, indicating that the regularization
intensity is a good tool for identification of relevant voxels for the
cognitive task. Surprisingly, all the regularization methods work about equally
well, suggesting that beating basic smoothing and shrinkage will take not only
clever methods, but also careful modeling.Comment: Published at http://dx.doi.org/10.1214/15-AOAS837 in the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Beyond KernelBoost
In this Technical Report we propose a set of improvements with respect to the
KernelBoost classifier presented in [Becker et al., MICCAI 2013]. We start with
a scheme inspired by Auto-Context, but that is suitable in situations where the
lack of large training sets poses a potential problem of overfitting. The aim
is to capture the interactions between neighboring image pixels to better
regularize the boundaries of segmented regions. As in Auto-Context [Tu et al.,
PAMI 2009] the segmentation process is iterative and, at each iteration, the
segmentation results for the previous iterations are taken into account in
conjunction with the image itself. However, unlike in [Tu et al., PAMI 2009],
we organize our recursion so that the classifiers can progressively focus on
difficult-to-classify locations. This lets us exploit the power of the
decision-tree paradigm while avoiding over-fitting. In the context of this
architecture, KernelBoost represents a powerful building block due to its
ability to learn on the score maps coming from previous iterations. We first
introduce two important mechanisms to empower the KernelBoost classifier,
namely pooling and the clustering of positive samples based on the appearance
of the corresponding ground-truth. These operations significantly contribute to
increase the effectiveness of the system on biomedical images, where texture
plays a major role in the recognition of the different image components. We
then present some other techniques that can be easily integrated in the
KernelBoost framework to further improve the accuracy of the final
segmentation. We show extensive results on different medical image datasets,
including some multi-label tasks, on which our method is shown to outperform
state-of-the-art approaches. The resulting segmentations display high accuracy,
neat contours, and reduced noise
Generalized Rank Pooling for Activity Recognition
Most popular deep models for action recognition split video sequences into
short sub-sequences consisting of a few frames; frame-based features are then
pooled for recognizing the activity. Usually, this pooling step discards the
temporal order of the frames, which could otherwise be used for better
recognition. Towards this end, we propose a novel pooling method, generalized
rank pooling (GRP), that takes as input, features from the intermediate layers
of a CNN that is trained on tiny sub-sequences, and produces as output the
parameters of a subspace which (i) provides a low-rank approximation to the
features and (ii) preserves their temporal order. We propose to use these
parameters as a compact representation for the video sequence, which is then
used in a classification setup. We formulate an objective for computing this
subspace as a Riemannian optimization problem on the Grassmann manifold, and
propose an efficient conjugate gradient scheme for solving it. Experiments on
several activity recognition datasets show that our scheme leads to
state-of-the-art performance.Comment: Accepted at IEEE International Conference on Computer Vision and
Pattern Recognition (CVPR), 201
Towards Effective Codebookless Model for Image Classification
The bag-of-features (BoF) model for image classification has been thoroughly
studied over the last decade. Different from the widely used BoF methods which
modeled images with a pre-trained codebook, the alternative codebook free image
modeling method, which we call Codebookless Model (CLM), attracted little
attention. In this paper, we present an effective CLM that represents an image
with a single Gaussian for classification. By embedding Gaussian manifold into
a vector space, we show that the simple incorporation of our CLM into a linear
classifier achieves very competitive accuracy compared with state-of-the-art
BoF methods (e.g., Fisher Vector). Since our CLM lies in a high dimensional
Riemannian manifold, we further propose a joint learning method of low-rank
transformation with support vector machine (SVM) classifier on the Gaussian
manifold, in order to reduce computational and storage cost. To study and
alleviate the side effect of background clutter on our CLM, we also present a
simple yet effective partial background removal method based on saliency
detection. Experiments are extensively conducted on eight widely used databases
to demonstrate the effectiveness and efficiency of our CLM method
- âŠ