1,825,895 research outputs found
Representation Learning by Learning to Count
We introduce a novel method for representation learning that uses an
artificial supervision signal based on counting visual primitives. This
supervision signal is obtained from an equivariance relation, which does not
require any manual annotation. We relate transformations of images to
transformations of the representations. More specifically, we look for the
representation that satisfies such relation rather than the transformations
that match a given representation. In this paper, we use two image
transformations in the context of counting: scaling and tiling. The first
transformation exploits the fact that the number of visual primitives should be
invariant to scale. The second transformation allows us to equate the total
number of visual primitives in each tile to that in the whole image. These two
transformations are combined in one constraint and used to train a neural
network with a contrastive loss. The proposed task produces representations
that perform on par or exceed the state of the art in transfer learning
benchmarks.Comment: ICCV 2017(oral
Neural Discrete Representation Learning
Learning useful representations without supervision remains a key challenge
in machine learning. In this paper, we propose a simple yet powerful generative
model that learns such discrete representations. Our model, the Vector
Quantised-Variational AutoEncoder (VQ-VAE), differs from VAEs in two key ways:
the encoder network outputs discrete, rather than continuous, codes; and the
prior is learnt rather than static. In order to learn a discrete latent
representation, we incorporate ideas from vector quantisation (VQ). Using the
VQ method allows the model to circumvent issues of "posterior collapse" --
where the latents are ignored when they are paired with a powerful
autoregressive decoder -- typically observed in the VAE framework. Pairing
these representations with an autoregressive prior, the model can generate high
quality images, videos, and speech as well as doing high quality speaker
conversion and unsupervised learning of phonemes, providing further evidence of
the utility of the learnt representations
Active Discriminative Text Representation Learning
We propose a new active learning (AL) method for text classification with
convolutional neural networks (CNNs). In AL, one selects the instances to be
manually labeled with the aim of maximizing model performance with minimal
effort. Neural models capitalize on word embeddings as representations
(features), tuning these to the task at hand. We argue that AL strategies for
multi-layered neural models should focus on selecting instances that most
affect the embedding space (i.e., induce discriminative word representations).
This is in contrast to traditional AL approaches (e.g., entropy-based
uncertainty sampling), which specify higher level objectives. We propose a
simple approach for sentence classification that selects instances containing
words whose embeddings are likely to be updated with the greatest magnitude,
thereby rapidly learning discriminative, task-specific embeddings. We extend
this approach to document classification by jointly considering: (1) the
expected changes to the constituent word representations; and (2) the model's
current overall uncertainty regarding the instance. The relative emphasis
placed on these criteria is governed by a stochastic process that favors
selecting instances likely to improve representations at the outset of
learning, and then shifts toward general uncertainty sampling as AL progresses.
Empirical results show that our method outperforms baseline AL approaches on
both sentence and document classification tasks. We also show that, as
expected, the method quickly learns discriminative word embeddings. To the best
of our knowledge, this is the first work on AL addressing neural models for
text classification.Comment: This paper got accepted by AAAI 201
Multi-View Task-Driven Recognition in Visual Sensor Networks
Nowadays, distributed smart cameras are deployed for a wide set of tasks in
several application scenarios, ranging from object recognition, image
retrieval, and forensic applications. Due to limited bandwidth in distributed
systems, efficient coding of local visual features has in fact been an active
topic of research. In this paper, we propose a novel approach to obtain a
compact representation of high-dimensional visual data using sensor fusion
techniques. We convert the problem of visual analysis in resource-limited
scenarios to a multi-view representation learning, and we show that the key to
finding properly compressed representation is to exploit the position of
cameras with respect to each other as a norm-based regularization in the
particular signal representation of sparse coding. Learning the representation
of each camera is viewed as an individual task and a multi-task learning with
joint sparsity for all nodes is employed. The proposed representation learning
scheme is referred to as the multi-view task-driven learning for visual sensor
network (MT-VSN). We demonstrate that MT-VSN outperforms state-of-the-art in
various surveillance recognition tasks.Comment: 5 pages, Accepted in International Conference of Image Processing,
201
- …
