2,677 research outputs found
Deep Contextual Recurrent Residual Networks for Scene Labeling
Designed as extremely deep architectures, deep residual networks which
provide a rich visual representation and offer robust convergence behaviors
have recently achieved exceptional performance in numerous computer vision
problems. Being directly applied to a scene labeling problem, however, they
were limited to capture long-range contextual dependence, which is a critical
aspect. To address this issue, we propose a novel approach, Contextual
Recurrent Residual Networks (CRRN) which is able to simultaneously handle rich
visual representation learning and long-range context modeling within a fully
end-to-end deep network. Furthermore, our proposed end-to-end CRRN is
completely trained from scratch, without using any pre-trained models in
contrast to most existing methods usually fine-tuned from the state-of-the-art
pre-trained models, e.g. VGG-16, ResNet, etc. The experiments are conducted on
four challenging scene labeling datasets, i.e. SiftFlow, CamVid, Stanford
background and SUN datasets, and compared against various state-of-the-art
scene labeling methods
Beyond Forward Shortcuts: Fully Convolutional Master-Slave Networks (MSNets) with Backward Skip Connections for Semantic Segmentation
Recent deep CNNs contain forward shortcut connections; i.e. skip connections
from low to high layers. Reusing features from lower layers that have higher
resolution (location information) benefit higher layers to recover lost details
and mitigate information degradation. However, during inference the lower
layers do not know about high layer features, although they contain contextual
high semantics that benefit low layers to adaptively extract informative
features for later layers. In this paper, we study the influence of backward
skip connections which are in the opposite direction to forward shortcuts, i.e.
paths from high layers to low layers. To achieve this -- which indeed runs
counter to the nature of feed-forward networks -- we propose a new fully
convolutional model that consists of a pair of networks. A `Slave' network is
dedicated to provide the backward connections from its top layers to the
`Master' network's bottom layers. The Master network is used to produce the
final label predictions. In our experiments we validate the proposed FCN model
on ADE20K (ImageNet scene parsing), PASCAL-Context, and PASCAL VOC 2011
datasets.Comment: 9 pages, 5 figure
Scene Parsing via Dense Recurrent Neural Networks with Attentional Selection
Recurrent neural networks (RNNs) have shown the ability to improve scene
parsing through capturing long-range dependencies among image units. In this
paper, we propose dense RNNs for scene labeling by exploring various long-range
semantic dependencies among image units. Different from existing RNN based
approaches, our dense RNNs are able to capture richer contextual dependencies
for each image unit by enabling immediate connections between each pair of
image units, which significantly enhances their discriminative power. Besides,
to select relevant dependencies and meanwhile to restrain irrelevant ones for
each unit from dense connections, we introduce an attention model into dense
RNNs. The attention model allows automatically assigning more importance to
helpful dependencies while less weight to unconcerned dependencies. Integrating
with convolutional neural networks (CNNs), we develop an end-to-end scene
labeling system. Extensive experiments on three large-scale benchmarks
demonstrate that the proposed approach can improve the baselines by large
margins and outperform other state-of-the-art algorithms.Comment: 10 pages. arXiv admin note: substantial text overlap with
arXiv:1801.0683
Dense Recurrent Neural Networks for Scene Labeling
Recently recurrent neural networks (RNNs) have demonstrated the ability to
improve scene labeling through capturing long-range dependencies among image
units. In this paper, we propose dense RNNs for scene labeling by exploring
various long-range semantic dependencies among image units. In comparison with
existing RNN based approaches, our dense RNNs are able to capture richer
contextual dependencies for each image unit via dense connections between each
pair of image units, which significantly enhances their discriminative power.
Besides, to select relevant and meanwhile restrain irrelevant dependencies for
each unit from dense connections, we introduce an attention model into dense
RNNs. The attention model enables automatically assigning more importance to
helpful dependencies while less weight to unconcerned dependencies. Integrating
with convolutional neural networks (CNNs), our method achieves state-of-the-art
performances on the PASCAL Context, MIT ADE20K and SiftFlow benchmarks.Comment: Tech. Repor
Recurrent Iterative Gating Networks for Semantic Segmentation
In this paper, we present an approach for Recurrent Iterative Gating called
RIGNet. The core elements of RIGNet involve recurrent connections that control
the flow of information in neural networks in a top-down manner, and different
variants on the core structure are considered. The iterative nature of this
mechanism allows for gating to spread in both spatial extent and feature space.
This is revealed to be a powerful mechanism with broad compatibility with
common existing networks. Analysis shows how gating interacts with different
network characteristics, and we also show that more shallow networks with
gating may be made to perform better than much deeper networks that do not
include RIGNet modules.Comment: WACV 201
Multimodal Recurrent Neural Networks with Information Transfer Layers for Indoor Scene Labeling
This paper proposes a new method called Multimodal RNNs for RGB-D scene
semantic segmentation. It is optimized to classify image pixels given two input
sources: RGB color channels and Depth maps. It simultaneously performs training
of two recurrent neural networks (RNNs) that are crossly connected through
information transfer layers, which are learnt to adaptively extract relevant
cross-modality features. Each RNN model learns its representations from its own
previous hidden states and transferred patterns from the other RNNs previous
hidden states; thus, both model-specific and crossmodality features are
retained. We exploit the structure of quad-directional 2D-RNNs to model the
short and long range contextual information in the 2D input image. We carefully
designed various baselines to efficiently examine our proposed model structure.
We test our Multimodal RNNs method on popular RGB-D benchmarks and show how it
outperforms previous methods significantly and achieves competitive results
with other state-of-the-art works.Comment: 15 pages, 13 figures, IEEE TMM 201
Reading Scene Text with Attention Convolutional Sequence Modeling
Reading text in the wild is a challenging task in the field of computer
vision. Existing approaches mainly adopted Connectionist Temporal
Classification (CTC) or Attention models based on Recurrent Neural Network
(RNN), which is computationally expensive and hard to train. In this paper, we
present an end-to-end Attention Convolutional Network for scene text
recognition. Firstly, instead of RNN, we adopt the stacked convolutional layers
to effectively capture the contextual dependencies of the input sequence, which
is characterized by lower computational complexity and easier parallel
computation. Compared to the chain structure of recurrent networks, the
Convolutional Neural Network (CNN) provides a natural way to capture long-term
dependencies between elements, which is 9 times faster than Bidirectional Long
Short-Term Memory (BLSTM). Furthermore, in order to enhance the representation
of foreground text and suppress the background noise, we incorporate the
residual attention modules into a small densely connected network to improve
the discriminability of CNN features. We validate the performance of our
approach on the standard benchmarks, including the Street View Text, IIIT5K and
ICDAR datasets. As a result, state-of-the-art or highly-competitive performance
and efficiency show the superiority of the proposed approach
Learning Deep Representations for Scene Labeling with Semantic Context Guided Supervision
Scene labeling is a challenging classification problem where each input image
requires a pixel-level prediction map. Recently, deep-learning-based methods
have shown their effectiveness on solving this problem. However, we argue that
the large intra-class variation provides ambiguous training information and
hinders the deep models' ability to learn more discriminative deep feature
representations. Unlike existing methods that mainly utilize semantic context
for regularizing or smoothing the prediction map, we design novel supervisions
from semantic context for learning better deep feature representations. Two
types of semantic context, scene names of images and label map statistics of
image patches, are exploited to create label hierarchies between the original
classes and newly created subclasses as the learning supervisions. Such
subclasses show lower intra-class variation, and help CNN detect more
meaningful visual patterns and learn more effective deep features. Novel
training strategies and network structure that take advantages of such label
hierarchies are introduced. Our proposed method is evaluated extensively on
four popular datasets, Stanford Background (8 classes), SIFTFlow (33 classes),
Barcelona (170 classes) and LM+Sun datasets (232 classes) with 3 different
networks structures, and show state-of-the-art performance. The experiments
show that our proposed method makes deep models learn more discriminative
feature representations without increasing model size or complexity.Comment: 13 page
LatentGNN: Learning Efficient Non-local Relations for Visual Recognition
Capturing long-range dependencies in feature representations is crucial for
many visual recognition tasks. Despite recent successes of deep convolutional
networks, it remains challenging to model non-local context relations between
visual features. A promising strategy is to model the feature context by a
fully-connected graph neural network (GNN), which augments traditional
convolutional features with an estimated non-local context representation.
However, most GNN-based approaches require computing a dense graph affinity
matrix and hence have difficulty in scaling up to tackle complex real-world
visual problems. In this work, we propose an efficient and yet flexible
non-local relation representation based on a novel class of graph neural
networks. Our key idea is to introduce a latent space to reduce the complexity
of graph, which allows us to use a low-rank representation for the graph
affinity matrix and to achieve a linear complexity in computation. Extensive
experimental evaluations on three major visual recognition tasks show that our
method outperforms the prior works with a large margin while maintaining a low
computation cost.Comment: ICML 201
Combining the Best of Convolutional Layers and Recurrent Layers: A Hybrid Network for Semantic Segmentation
State-of-the-art results of semantic segmentation are established by Fully
Convolutional neural Networks (FCNs). FCNs rely on cascaded convolutional and
pooling layers to gradually enlarge the receptive fields of neurons, resulting
in an indirect way of modeling the distant contextual dependence. In this work,
we advocate the use of spatially recurrent layers (i.e. ReNet layers) which
directly capture global contexts and lead to improved feature representations.
We demonstrate the effectiveness of ReNet layers by building a Naive deep ReNet
(N-ReNet), which achieves competitive performance on Stanford Background
dataset. Furthermore, we integrate ReNet layers with FCNs, and develop a novel
Hybrid deep ReNet (H-ReNet). It enjoys a few remarkable properties, including
full-image receptive fields, end-to-end training, and efficient network
execution. On the PASCAL VOC 2012 benchmark, the H-ReNet improves the results
of state-of-the-art approaches Piecewise, CRFasRNN and DeepParsing by 3.6%,
2.3% and 0.2%, respectively, and achieves the highest IoUs for 13 out of the 20
object classes.Comment: 14 page
- …