1,889 research outputs found
Deformable Part Networks
In this paper we propose novel Deformable Part Networks (DPNs) to learn {\em
pose-invariant} representations for 2D object recognition. In contrast to the
state-of-the-art pose-aware networks such as CapsNet \cite{sabour2017dynamic}
and STN \cite{jaderberg2015spatial}, DPNs can be naturally {\em interpreted} as
an efficient solver for a challenging detection problem, namely Localized
Deformable Part Models (LDPMs) where localization is introduced to DPMs as
another latent variable for searching for the best poses of objects over all
pixels and (predefined) scales. In particular we construct DPNs as sequences of
such LDPM units to model the semantic and spatial relations among the
deformable parts as hierarchical composition and spatial parsing trees.
Empirically our 17-layer DPN can outperform both CapsNets and STNs
significantly on affNIST \cite{sabour2017dynamic}, for instance, by 19.19\% and
12.75\%, respectively, with better generalization and better tolerance to
affine transformations
Combined convolutional and recurrent neural networks for hierarchical classification of images
Deep learning models based on CNNs are predominantly used in image
classification tasks. Such approaches, assuming independence of object
categories, normally use a CNN as a feature learner and apply a flat classifier
on top of it. Object classes in many settings have hierarchical relations, and
classifiers exploiting these relations should perform better. We propose
hierarchical classification models combining a CNN to extract hierarchical
representations of images, and an RNN or sequence-to-sequence model to capture
a hierarchical tree of classes. In addition, we apply residual learning to the
RNN part in oder to facilitate training our compound model and improve
generalization of the model. Experimental results on a real world proprietary
dataset of images show that our hierarchical networks perform better than
state-of-the-art CNNs
Scene Parsing via Dense Recurrent Neural Networks with Attentional Selection
Recurrent neural networks (RNNs) have shown the ability to improve scene
parsing through capturing long-range dependencies among image units. In this
paper, we propose dense RNNs for scene labeling by exploring various long-range
semantic dependencies among image units. Different from existing RNN based
approaches, our dense RNNs are able to capture richer contextual dependencies
for each image unit by enabling immediate connections between each pair of
image units, which significantly enhances their discriminative power. Besides,
to select relevant dependencies and meanwhile to restrain irrelevant ones for
each unit from dense connections, we introduce an attention model into dense
RNNs. The attention model allows automatically assigning more importance to
helpful dependencies while less weight to unconcerned dependencies. Integrating
with convolutional neural networks (CNNs), we develop an end-to-end scene
labeling system. Extensive experiments on three large-scale benchmarks
demonstrate that the proposed approach can improve the baselines by large
margins and outperform other state-of-the-art algorithms.Comment: 10 pages. arXiv admin note: substantial text overlap with
arXiv:1801.0683
Multi-level Contextual RNNs with Attention Model for Scene Labeling
Context in image is crucial for scene labeling while existing methods only
exploit local context generated from a small surrounding area of an image patch
or a pixel, by contrast long-range and global contextual information is
ignored. To handle this issue, we in this work propose a novel approach for
scene labeling by exploring multi-level contextual recurrent neural networks
(ML-CRNNs). Specifically, we encode three kinds of contextual cues, i.e., local
context, global context and image topic context in structural recurrent neural
networks (RNNs) to model long-range local and global dependencies in image. In
this way, our method is able to `see' the image in terms of both long-range
local and holistic views, and make a more reliable inference for image
labeling. Besides, we integrate the proposed contextual RNNs into hierarchical
convolutional neural networks (CNNs), and exploit dependence relationships in
multiple levels to provide rich spatial and semantic information. Moreover, we
novelly adopt an attention model to effectively merge multiple levels and show
that it outperforms average- or max-pooling fusion strategies. Extensive
experiments demonstrate that the proposed approach achieves new
state-of-the-art results on the CamVid, SiftFlow and Stanford-background
datasets.Comment: 8 pages, 8 figure
Scene Parsing with Integration of Parametric and Non-parametric Models
We adopt Convolutional Neural Networks (CNNs) to be our parametric model to
learn discriminative features and classifiers for local patch classification.
Based on the occurrence frequency distribution of classes, an ensemble of CNNs
(CNN-Ensemble) are learned, in which each CNN component focuses on learning
different and complementary visual patterns. The local beliefs of pixels are
output by CNN-Ensemble. Considering that visually similar pixels are
indistinguishable under local context, we leverage the global scene semantics
to alleviate the local ambiguity. The global scene constraint is mathematically
achieved by adding a global energy term to the labeling energy function, and it
is practically estimated in a non-parametric framework. A large margin based
CNN metric learning method is also proposed for better global belief
estimation. In the end, the integration of local and global beliefs gives rise
to the class likelihood of pixels, based on which maximum marginal inference is
performed to generate the label prediction maps. Even without any
post-processing, we achieve state-of-the-art results on the challenging
SiftFlow and Barcelona benchmarks.Comment: 13 Pages, 6 figures, IEEE Transactions on Image Processing (T-IP)
201
Dense Recurrent Neural Networks for Scene Labeling
Recently recurrent neural networks (RNNs) have demonstrated the ability to
improve scene labeling through capturing long-range dependencies among image
units. In this paper, we propose dense RNNs for scene labeling by exploring
various long-range semantic dependencies among image units. In comparison with
existing RNN based approaches, our dense RNNs are able to capture richer
contextual dependencies for each image unit via dense connections between each
pair of image units, which significantly enhances their discriminative power.
Besides, to select relevant and meanwhile restrain irrelevant dependencies for
each unit from dense connections, we introduce an attention model into dense
RNNs. The attention model enables automatically assigning more importance to
helpful dependencies while less weight to unconcerned dependencies. Integrating
with convolutional neural networks (CNNs), our method achieves state-of-the-art
performances on the PASCAL Context, MIT ADE20K and SiftFlow benchmarks.Comment: Tech. Repor
Semi-Supervised Hierarchical Semantic Object Parsing
Models based on Convolutional Neural Networks (CNNs) have been proven very
successful for semantic segmentation and object parsing that yield hierarchies
of features. Our key insight is to build convolutional networks that take input
of arbitrary size and produce object parsing output with efficient inference
and learning. In this work, we focus on the task of instance segmentation and
parsing which recognizes and localizes objects down to a pixel level base on
deep CNN. Therefore, unlike some related work, a pixel cannot belong to
multiple instances and parsing. Our model is based on a deep neural network
trained for object masking that supervised with input image and follow
incorporates a Conditional Random Field (CRF) with end-to-end trainable
piecewise order potentials based on object parsing outputs. In each CRF unit we
designed terms to capture the short range and long range dependencies from
various neighbors. The accurate instance-level segmentation that our network
produce is reflected by the considerable improvements obtained over previous
work at high APr thresholds. We demonstrate the effectiveness of our model with
extensive experiments on challenging dataset subset of PASCAL VOC2012
Interactively Transferring CNN Patterns for Part Localization
In the scenario of one/multi-shot learning, conventional end-to-end learning
strategies without sufficient supervision are usually not powerful enough to
learn correct patterns from noisy signals. Thus, given a CNN pre-trained for
object classification, this paper proposes a method that first summarizes the
knowledge hidden inside the CNN into a dictionary of latent activation
patterns, and then builds a new model for part localization by manually
assembling latent patterns related to the target part via human interactions.
We use very few (e.g., three) annotations of a semantic object part to retrieve
certain latent patterns from conv-layers to represent the target part. We then
visualize these latent patterns and ask users to further remove incorrect
patterns, in order to refine part representation. With the guidance of human
interactions, our method exhibited superior performance of part localization in
experiments
Unsupervised Category Discovery via Looped Deep Pseudo-Task Optimization Using a Large Scale Radiology Image Database
Obtaining semantic labels on a large scale radiology image database (215,786
key images from 61,845 unique patients) is a prerequisite yet bottleneck to
train highly effective deep convolutional neural network (CNN) models for image
recognition. Nevertheless, conventional methods for collecting image labels
(e.g., Google search followed by crowd-sourcing) are not applicable due to the
formidable difficulties of medical annotation tasks for those who are not
clinically trained. This type of image labeling task remains non-trivial even
for radiologists due to uncertainty and possible drastic inter-observer
variation or inconsistency.
In this paper, we present a looped deep pseudo-task optimization procedure
for automatic category discovery of visually coherent and clinically semantic
(concept) clusters. Our system can be initialized by domain-specific (CNN
trained on radiology images and text report derived labels) or generic
(ImageNet based) CNN models. Afterwards, a sequence of pseudo-tasks are
exploited by the looped deep image feature clustering (to refine image labels)
and deep CNN training/classification using new labels (to obtain more task
representative deep features). Our method is conceptually simple and based on
the hypothesized "convergence" of better labels leading to better trained CNN
models which in turn feed more effective deep image features to facilitate more
meaningful clustering/labels. We have empirically validated the convergence and
demonstrated promising quantitative and qualitative results. Category labels of
significantly higher quality than those in previous work are discovered. This
allows for further investigation of the hierarchical semantic nature of the
given large-scale radiology image database
CNN+CNN: Convolutional Decoders for Image Captioning
Image captioning is a challenging task that combines the field of computer
vision and natural language processing. A variety of approaches have been
proposed to achieve the goal of automatically describing an image, and
recurrent neural network (RNN) or long-short term memory (LSTM) based models
dominate this field. However, RNNs or LSTMs cannot be calculated in parallel
and ignore the underlying hierarchical structure of a sentence. In this paper,
we propose a framework that only employs convolutional neural networks (CNNs)
to generate captions. Owing to parallel computing, our basic model is around 3
times faster than NIC (an LSTM-based model) during training time, while also
providing better results. We conduct extensive experiments on MSCOCO and
investigate the influence of the model width and depth. Compared with
LSTM-based models that apply similar attention mechanisms, our proposed models
achieves comparable scores of BLEU-1,2,3,4 and METEOR, and higher scores of
CIDEr. We also test our model on the paragraph annotation dataset, and get
higher CIDEr score compared with hierarchical LSTM
- …