176 research outputs found
From Image-level to Pixel-level Labeling with Convolutional Networks
We are interested in inferring object segmentation by leveraging only object
class information, and by considering only minimal priors on the object
segmentation task. This problem could be viewed as a kind of weakly supervised
segmentation task, and naturally fits the Multiple Instance Learning (MIL)
framework: every training image is known to have (or not) at least one pixel
corresponding to the image class label, and the segmentation task can be
rewritten as inferring the pixels belonging to the class of the object (given
one image, and its object class). We propose a Convolutional Neural
Network-based model, which is constrained during training to put more weight on
pixels which are important for classifying the image. We show that at test
time, the model has learned to discriminate the right pixels well enough, such
that it performs very well on an existing segmentation benchmark, by adding
only few smoothing priors. Our system is trained using a subset of the Imagenet
dataset and the segmentation experiments are performed on the challenging
Pascal VOC dataset (with no fine-tuning of the model on Pascal VOC). Our model
beats the state of the art results in weakly supervised object segmentation
task by a large margin. We also compare the performance of our model with state
of the art fully-supervised segmentation approaches.Comment: CVPR201
Recurrent Convolutional Neural Networks for Scene Parsing
Scene parsing is a technique that consist on giving a label to all pixels in
an image according to the class they belong to. To ensure a good visual
coherence and a high class accuracy, it is essential for a scene parser to
capture image long range dependencies. In a feed-forward architecture, this can
be simply achieved by considering a sufficiently large input context patch,
around each pixel to be labeled. We propose an approach consisting of a
recurrent convolutional neural network which allows us to consider a large
input context, while limiting the capacity of the model. Contrary to most
standard approaches, our method does not rely on any segmentation methods, nor
any task-specific features. The system is trained in an end-to-end manner over
raw pixels, and models complex spatial dependencies with low inference cost. As
the context size increases with the built-in recurrence, the system identifies
and corrects its own errors. Our approach yields state-of-the-art performance
on both the Stanford Background Dataset and the SIFT Flow Dataset, while
remaining very fast at test time
Phrase-based Image Captioning
Generating a novel textual description of an image is an interesting problem
that connects computer vision and natural language processing. In this paper,
we present a simple model that is able to generate descriptive sentences given
a sample image. This model has a strong focus on the syntax of the
descriptions. We train a purely bilinear model that learns a metric between an
image representation (generated from a previously trained Convolutional Neural
Network) and phrases that are used to described them. The system is then able
to infer phrases from a given image sample. Based on caption syntax statistics,
we propose a simple language model that can produce relevant descriptions for a
given test image using the phrases inferred. Our approach, which is
considerably simpler than state-of-the-art models, achieves comparable results
in two popular datasets for the task: Flickr30k and the recently proposed
Microsoft COCO
End-to-end Phoneme Sequence Recognition using Convolutional Neural Networks
Most phoneme recognition state-of-the-art systems rely on a classical neural
network classifiers, fed with highly tuned features, such as MFCC or PLP
features. Recent advances in ``deep learning'' approaches questioned such
systems, but while some attempts were made with simpler features such as
spectrograms, state-of-the-art systems still rely on MFCCs. This might be
viewed as a kind of failure from deep learning approaches, which are often
claimed to have the ability to train with raw signals, alleviating the need of
hand-crafted features. In this paper, we investigate a convolutional neural
network approach for raw speech signals. While convolutional architectures got
tremendous success in computer vision or text processing, they seem to have
been let down in the past recent years in the speech processing field. We show
that it is possible to learn an end-to-end phoneme sequence classifier system
directly from raw signal, with similar performance on the TIMIT and WSJ
datasets than existing systems based on MFCC, questioning the need of complex
hand-crafted features on large datasets.Comment: NIPS Deep Learning Workshop, 201
Rehabilitation of Count-based Models for Word Vector Representations
Recent works on word representations mostly rely on predictive models.
Distributed word representations (aka word embeddings) are trained to optimally
predict the contexts in which the corresponding words tend to appear. Such
models have succeeded in capturing word similarties as well as semantic and
syntactic regularities. Instead, we aim at reviving interest in a model based
on counts. We present a systematic study of the use of the Hellinger distance
to extract semantic representations from the word co-occurence statistics of
large text corpora. We show that this distance gives good performance on word
similarity and analogy tasks, with a proper type and size of context, and a
dimensionality reduction based on a stochastic low-rank approximation. Besides
being both simple and intuitive, this method also provides an encoding function
which can be used to infer unseen words or phrases. This becomes a clear
advantage compared to predictive models which must train these new words.Comment: A. Gelbukh (Ed.), Springer International Publishing Switzerlan
N-gram-Based Low-Dimensional Representation for Document Classification
The bag-of-words (BOW) model is the common approach for classifying
documents, where words are used as feature for training a classifier. This
generally involves a huge number of features. Some techniques, such as Latent
Semantic Analysis (LSA) or Latent Dirichlet Allocation (LDA), have been
designed to summarize documents in a lower dimension with the least semantic
information loss. Some semantic information is nevertheless always lost, since
only words are considered. Instead, we aim at using information coming from
n-grams to overcome this limitation, while remaining in a low-dimension space.
Many approaches, such as the Skip-gram model, provide good word vector
representations very quickly. We propose to average these representations to
obtain representations of n-grams. All n-grams are thus embedded in a same
semantic space. A K-means clustering can then group them into semantic
concepts. The number of features is therefore dramatically reduced and
documents can be represented as bag of semantic concepts. We show that this
model outperforms LSA and LDA on a sentiment classification task, and yields
similar results than a traditional BOW-model with far less features.Comment: Accepted as a workshop contribution at ICLR 201
Joint RNN-Based Greedy Parsing and Word Composition
This paper introduces a greedy parser based on neural networks, which
leverages a new compositional sub-tree representation. The greedy parser and
the compositional procedure are jointly trained, and tightly depends on
each-other. The composition procedure outputs a vector representation which
summarizes syntactically (parsing tags) and semantically (words) sub-trees.
Composition and tagging is achieved over continuous (word or tag)
representations, and recurrent neural networks. We reach F1 performance on par
with well-known existing parsers, while having the advantage of speed, thanks
to the greedy nature of the parser. We provide a fully functional
implementation of the method described in this paper.Comment: Published as a conference paper at ICLR 201
- …