74 research outputs found
Joint cross-domain classification and subspace learning for unsupervised adaptation
Domain adaptation aims at adapting the knowledge acquired on a source domain
to a new different but related target domain. Several approaches have
beenproposed for classification tasks in the unsupervised scenario, where no
labeled target data are available. Most of the attention has been dedicated to
searching a new domain-invariant representation, leaving the definition of the
prediction function to a second stage. Here we propose to learn both jointly.
Specifically we learn the source subspace that best matches the target subspace
while at the same time minimizing a regularized misclassification loss. We
provide an alternating optimization technique based on stochastic sub-gradient
descent to solve the learning problem and we demonstrate its performance on
several domain adaptation tasks.Comment: Paper is under consideration at Pattern Recognition Letter
Unsupervised Human Action Detection by Action Matching
We propose a new task of unsupervised action detection by action matching.
Given two long videos, the objective is to temporally detect all pairs of
matching video segments. A pair of video segments are matched if they share the
same human action. The task is category independent---it does not matter what
action is being performed---and no supervision is used to discover such video
segments. Unsupervised action detection by action matching allows us to align
videos in a meaningful manner. As such, it can be used to discover new action
categories or as an action proposal technique within, say, an action detection
pipeline. Moreover, it is a useful pre-processing step for generating video
highlights, e.g., from sports videos.
We present an effective and efficient method for unsupervised action
detection. We use an unsupervised temporal encoding method and exploit the
temporal consistency in human actions to obtain candidate action segments. We
evaluate our method on this challenging task using three activity recognition
benchmarks, namely, the MPII Cooking activities dataset, the THUMOS15 action
detection benchmark and a new dataset called the IKEA dataset. On the MPII
Cooking dataset we detect action segments with a precision of 21.6% and recall
of 11.7% over 946 long video pairs and over 5000 ground truth action segments.
Similarly, on THUMOS dataset we obtain 18.4% precision and 25.1% recall over
5094 ground truth action segment pairs.Comment: IEEE International Conference on Computer Vision and Pattern
Recognition CVPR 2017 Workshop
Location recognition over large time lags
Would it be possible to automatically associate ancient pictures to modern ones and create fancy cultural heritage city maps? We introduce here the task of recognizing the location depicted in an old photo given modern annotated images collected from the Internet. We present an extensive analysis on different features, looking for the most discriminative and most robust to the image variability induced by large time lags. Moreover, we show that the described task benefits from domain adaptation
Generalized Rank Pooling for Activity Recognition
Most popular deep models for action recognition split video sequences into
short sub-sequences consisting of a few frames; frame-based features are then
pooled for recognizing the activity. Usually, this pooling step discards the
temporal order of the frames, which could otherwise be used for better
recognition. Towards this end, we propose a novel pooling method, generalized
rank pooling (GRP), that takes as input, features from the intermediate layers
of a CNN that is trained on tiny sub-sequences, and produces as output the
parameters of a subspace which (i) provides a low-rank approximation to the
features and (ii) preserves their temporal order. We propose to use these
parameters as a compact representation for the video sequence, which is then
used in a classification setup. We formulate an objective for computing this
subspace as a Riemannian optimization problem on the Grassmann manifold, and
propose an efficient conjugate gradient scheme for solving it. Experiments on
several activity recognition datasets show that our scheme leads to
state-of-the-art performance.Comment: Accepted at IEEE International Conference on Computer Vision and
Pattern Recognition (CVPR), 201
Guided Open Vocabulary Image Captioning with Constrained Beam Search
Existing image captioning models do not generalize well to out-of-domain
images containing novel scenes or objects. This limitation severely hinders the
use of these models in real world applications dealing with images in the wild.
We address this problem using a flexible approach that enables existing deep
captioning architectures to take advantage of image taggers at test time,
without re-training. Our method uses constrained beam search to force the
inclusion of selected tag words in the output, and fixed, pretrained word
embeddings to facilitate vocabulary expansion to previously unseen tag words.
Using this approach we achieve state of the art results for out-of-domain
captioning on MSCOCO (and improved results for in-domain captioning). Perhaps
surprisingly, our results significantly outperform approaches that incorporate
the same tag predictions into the learning algorithm. We also show that we can
significantly improve the quality of generated ImageNet captions by leveraging
ground-truth labels.Comment: EMNLP 201
DeepPermNet: Visual Permutation Learning
We present a principled approach to uncover the structure of visual data by
solving a novel deep learning task coined visual permutation learning. The goal
of this task is to find the permutation that recovers the structure of data
from shuffled versions of it. In the case of natural images, this task boils
down to recovering the original image from patches shuffled by an unknown
permutation matrix. Unfortunately, permutation matrices are discrete, thereby
posing difficulties for gradient-based methods. To this end, we resort to a
continuous approximation of these matrices using doubly-stochastic matrices
which we generate from standard CNN predictions using Sinkhorn iterations.
Unrolling these iterations in a Sinkhorn network layer, we propose DeepPermNet,
an end-to-end CNN model for this task. The utility of DeepPermNet is
demonstrated on two challenging computer vision problems, namely, (i) relative
attributes learning and (ii) self-supervised representation learning. Our
results show state-of-the-art performance on the Public Figures and OSR
benchmarks for (i) and on the classification and segmentation tasks on the
PASCAL VOC dataset for (ii).Comment: Accepted in IEEE International Conference on Computer Vision and
Pattern Recognition CVPR 201
- …