14,690 research outputs found
Relaxed Spatio-Temporal Deep Feature Aggregation for Real-Fake Expression Prediction
Frame-level visual features are generally aggregated in time with the
techniques such as LSTM, Fisher Vectors, NetVLAD etc. to produce a robust
video-level representation. We here introduce a learnable aggregation technique
whose primary objective is to retain short-time temporal structure between
frame-level features and their spatial interdependencies in the representation.
Also, it can be easily adapted to the cases where there have very scarce
training samples. We evaluate the method on a real-fake expression prediction
dataset to demonstrate its superiority. Our method obtains 65% score on the
test dataset in the official MAP evaluation and there is only one misclassified
decision with the best reported result in the Chalearn Challenge (i.e. 66:7%) .
Lastly, we believe that this method can be extended to different problems such
as action/event recognition in future.Comment: Submitted to International Conference on Computer Vision Workshop
EmoNets: Multimodal deep learning approaches for emotion recognition in video
The task of the emotion recognition in the wild (EmotiW) Challenge is to
assign one of seven emotions to short video clips extracted from Hollywood
style movies. The videos depict acted-out emotions under realistic conditions
with a large degree of variation in attributes such as pose and illumination,
making it worthwhile to explore approaches which consider combinations of
features from multiple modalities for label assignment. In this paper we
present our approach to learning several specialist models using deep learning
techniques, each focusing on one modality. Among these are a convolutional
neural network, focusing on capturing visual information in detected faces, a
deep belief net focusing on the representation of the audio stream, a K-Means
based "bag-of-mouths" model, which extracts visual features around the mouth
region and a relational autoencoder, which addresses spatio-temporal aspects of
videos. We explore multiple methods for the combination of cues from these
modalities into one common classifier. This achieves a considerably greater
accuracy than predictions from our strongest single-modality classifier. Our
method was the winning submission in the 2013 EmotiW challenge and achieved a
test set accuracy of 47.67% on the 2014 dataset
Personalized Pancreatic Tumor Growth Prediction via Group Learning
Tumor growth prediction, a highly challenging task, has long been viewed as a
mathematical modeling problem, where the tumor growth pattern is personalized
based on imaging and clinical data of a target patient. Though mathematical
models yield promising results, their prediction accuracy may be limited by the
absence of population trend data and personalized clinical characteristics. In
this paper, we propose a statistical group learning approach to predict the
tumor growth pattern that incorporates both the population trend and
personalized data, in order to discover high-level features from multimodal
imaging data. A deep convolutional neural network approach is developed to
model the voxel-wise spatio-temporal tumor progression. The deep features are
combined with the time intervals and the clinical factors to feed a process of
feature selection. Our predictive model is pretrained on a group data set and
personalized on the target patient data to estimate the future spatio-temporal
progression of the patient's tumor. Multimodal imaging data at multiple time
points are used in the learning, personalization and inference stages. Our
method achieves a Dice coefficient of 86.8% +- 3.6% and RVD of 7.9% +- 5.4% on
a pancreatic tumor data set, outperforming the DSC of 84.4% +- 4.0% and RVD
13.9% +- 9.8% obtained by a previous state-of-the-art model-based method
ModDrop: adaptive multi-modal gesture recognition
We present a method for gesture detection and localisation based on
multi-scale and multi-modal deep learning. Each visual modality captures
spatial information at a particular spatial scale (such as motion of the upper
body or a hand), and the whole system operates at three temporal scales. Key to
our technique is a training strategy which exploits: i) careful initialization
of individual modalities; and ii) gradual fusion involving random dropping of
separate channels (dubbed ModDrop) for learning cross-modality correlations
while preserving uniqueness of each modality-specific representation. We
present experiments on the ChaLearn 2014 Looking at People Challenge gesture
recognition track, in which we placed first out of 17 teams. Fusing multiple
modalities at several spatial and temporal scales leads to a significant
increase in recognition rates, allowing the model to compensate for errors of
the individual classifiers as well as noise in the separate channels.
Futhermore, the proposed ModDrop training technique ensures robustness of the
classifier to missing signals in one or several channels to produce meaningful
predictions from any number of available modalities. In addition, we
demonstrate the applicability of the proposed fusion scheme to modalities of
arbitrary nature by experiments on the same dataset augmented with audio.Comment: 14 pages, 7 figure
- …