102 research outputs found
Representation Learning by Learning to Count
We introduce a novel method for representation learning that uses an
artificial supervision signal based on counting visual primitives. This
supervision signal is obtained from an equivariance relation, which does not
require any manual annotation. We relate transformations of images to
transformations of the representations. More specifically, we look for the
representation that satisfies such relation rather than the transformations
that match a given representation. In this paper, we use two image
transformations in the context of counting: scaling and tiling. The first
transformation exploits the fact that the number of visual primitives should be
invariant to scale. The second transformation allows us to equate the total
number of visual primitives in each tile to that in the whole image. These two
transformations are combined in one constraint and used to train a neural
network with a contrastive loss. The proposed task produces representations
that perform on par or exceed the state of the art in transfer learning
benchmarks.Comment: ICCV 2017(oral
Anticipating Visual Representations from Unlabeled Video
Anticipating actions and objects before they start or appear is a difficult
problem in computer vision with several real-world applications. This task is
challenging partly because it requires leveraging extensive knowledge of the
world that is difficult to write down. We believe that a promising resource for
efficiently learning this knowledge is through readily available unlabeled
video. We present a framework that capitalizes on temporal structure in
unlabeled video to learn to anticipate human actions and objects. The key idea
behind our approach is that we can train deep networks to predict the visual
representation of images in the future. Visual representations are a promising
prediction target because they encode images at a higher semantic level than
pixels yet are automatic to compute. We then apply recognition algorithms on
our predicted representation to anticipate objects and actions. We
experimentally validate this idea on two datasets, anticipating actions one
second in the future and objects five seconds in the future.Comment: CVPR 201
Predicting Motivations of Actions by Leveraging Text
Understanding human actions is a key problem in computer vision. However,
recognizing actions is only the first step of understanding what a person is
doing. In this paper, we introduce the problem of predicting why a person has
performed an action in images. This problem has many applications in human
activity understanding, such as anticipating or explaining an action. To study
this problem, we introduce a new dataset of people performing actions annotated
with likely motivations. However, the information in an image alone may not be
sufficient to automatically solve this task. Since humans can rely on their
lifetime of experiences to infer motivation, we propose to give computer vision
systems access to some of these experiences by using recently developed natural
language models to mine knowledge stored in massive amounts of text. While we
are still far away from fully understanding motivation, our results suggest
that transferring knowledge from language into vision can help machines
understand why people in images might be performing an action.Comment: CVPR 201
Learning Aligned Cross-Modal Representations from Weakly Aligned Data
People can recognize scenes across many different modalities beyond natural
images. In this paper, we investigate how to learn cross-modal scene
representations that transfer across modalities. To study this problem, we
introduce a new cross-modal scene dataset. While convolutional neural networks
can categorize cross-modal scenes well, they also learn an intermediate
representation not aligned across modalities, which is undesirable for
cross-modal transfer applications. We present methods to regularize cross-modal
convolutional neural networks so that they have a shared representation that is
agnostic of the modality. Our experiments suggest that our scene representation
can help transfer representations across modalities for retrieval. Moreover,
our visualizations suggest that units emerge in the shared representation that
tend to activate on consistent concepts independently of the modality.Comment: Conference paper at CVPR 201
Hidden Trigger Backdoor Attacks
With the success of deep learning algorithms in various domains, studying
adversarial attacks to secure deep models in real world applications has become
an important research topic. Backdoor attacks are a form of adversarial attacks
on deep networks where the attacker provides poisoned data to the victim to
train the model with, and then activates the attack by showing a specific small
trigger pattern at the test time. Most state-of-the-art backdoor attacks either
provide mislabeled poisoning data that is possible to identify by visual
inspection, reveal the trigger in the poisoned data, or use noise to hide the
trigger. We propose a novel form of backdoor attack where poisoned data look
natural with correct labels and also more importantly, the attacker hides the
trigger in the poisoned data and keeps the trigger secret until the test time.
We perform an extensive study on various image classification settings and show
that our attack can fool the model by pasting the trigger at random locations
on unseen images although the model performs well on clean data. We also show
that our proposed attack cannot be easily defended using a state-of-the-art
defense algorithm for backdoor attacks.Comment: AAAI 2020 - Main Technical Track (Oral
- …