22,217 research outputs found
Deep Learning Algorithms with Applications to Video Analytics for A Smart City: A Survey
Deep learning has recently achieved very promising results in a wide range of
areas such as computer vision, speech recognition and natural language
processing. It aims to learn hierarchical representations of data by using deep
architecture models. In a smart city, a lot of data (e.g. videos captured from
many distributed sensors) need to be automatically processed and analyzed. In
this paper, we review the deep learning algorithms applied to video analytics
of smart city in terms of different research topics: object detection, object
tracking, face recognition, image classification and scene labeling.Comment: 8 pages, 18 figure
Learning Deep Representations for Scene Labeling with Semantic Context Guided Supervision
Scene labeling is a challenging classification problem where each input image
requires a pixel-level prediction map. Recently, deep-learning-based methods
have shown their effectiveness on solving this problem. However, we argue that
the large intra-class variation provides ambiguous training information and
hinders the deep models' ability to learn more discriminative deep feature
representations. Unlike existing methods that mainly utilize semantic context
for regularizing or smoothing the prediction map, we design novel supervisions
from semantic context for learning better deep feature representations. Two
types of semantic context, scene names of images and label map statistics of
image patches, are exploited to create label hierarchies between the original
classes and newly created subclasses as the learning supervisions. Such
subclasses show lower intra-class variation, and help CNN detect more
meaningful visual patterns and learn more effective deep features. Novel
training strategies and network structure that take advantages of such label
hierarchies are introduced. Our proposed method is evaluated extensively on
four popular datasets, Stanford Background (8 classes), SIFTFlow (33 classes),
Barcelona (170 classes) and LM+Sun datasets (232 classes) with 3 different
networks structures, and show state-of-the-art performance. The experiments
show that our proposed method makes deep models learn more discriminative
feature representations without increasing model size or complexity.Comment: 13 page
Learning Hierarchical Shape Segmentation and Labeling from Online Repositories
We propose a method for converting geometric shapes into hierarchically
segmented parts with part labels. Our key idea is to train category-specific
models from the scene graphs and part names that accompany 3D shapes in public
repositories. These freely-available annotations represent an enormous,
untapped source of information on geometry. However, because the models and
corresponding scene graphs are created by a wide range of modelers with
different levels of expertise, modeling tools, and objectives, these models
have very inconsistent segmentations and hierarchies with sparse and noisy
textual tags. Our method involves two analysis steps. First, we perform a joint
optimization to simultaneously cluster and label parts in the database while
also inferring a canonical tag dictionary and part hierarchy. We then use this
labeled data to train a method for hierarchical segmentation and labeling of
new 3D shapes. We demonstrate that our method can mine complex information,
detecting hierarchies in man-made objects and their constituent parts,
obtaining finer scale details than existing alternatives. We also show that, by
performing domain transfer using a few supervised examples, our technique
outperforms fully-supervised techniques that require hundreds of
manually-labeled models
Deep Structured Scene Parsing by Learning with Image Descriptions
This paper addresses a fundamental problem of scene understanding: How to
parse the scene image into a structured configuration (i.e., a semantic object
hierarchy with object interaction relations) that finely accords with human
perception. We propose a deep architecture consisting of two networks: i) a
convolutional neural network (CNN) extracting the image representation for
pixelwise object labeling and ii) a recursive neural network (RNN) discovering
the hierarchical object structure and the inter-object relations. Rather than
relying on elaborative user annotations (e.g., manually labeling semantic maps
and relations), we train our deep model in a weakly-supervised manner by
leveraging the descriptive sentences of the training images. Specifically, we
decompose each sentence into a semantic tree consisting of nouns and verb
phrases, and facilitate these trees discovering the configurations of the
training images. Once these scene configurations are determined, then the
parameters of both the CNN and RNN are updated accordingly by back propagation.
The entire model training is accomplished through an Expectation-Maximization
method. Extensive experiments suggest that our model is capable of producing
meaningful and structured scene configurations and achieving more favorable
scene labeling performance on PASCAL VOC 2012 over other state-of-the-art
weakly-supervised methods.Comment: Discovering a semantic object hierarchy with object interaction
relations (Publhised in Proceedings of IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), 2016. (oral)
Semantic Instance Labeling Leveraging Hierarchical Segmentation
Most of the approaches for indoor RGBD semantic la- beling focus on using
pixels or superpixels to train a classi- fier. In this paper, we implement a
higher level segmentation using a hierarchy of superpixels to obtain a better
segmen- tation for training our classifier. By focusing on meaningful segments
that conform more directly to objects, regardless of size, we train a random
forest of decision trees as a clas- sifier using simple features such as the 3D
size, LAB color histogram, width, height, and shape as specified by a his-
togram of surface normals. We test our method on the NYU V2 depth dataset, a
challenging dataset of cluttered indoor environments. Our experiments using the
NYU V2 depth dataset show that our method achieves state of the art re- sults
on both a general semantic labeling introduced by the dataset (floor,
structure, furniture, and objects) and a more object specific semantic
labeling. We show that training a classifier on a segmentation from a hierarchy
of super pixels yields better results than training directly on super pixels,
patches, or pixels as in previous work
Geometric Scene Parsing with Hierarchical LSTM
This paper addresses the problem of geometric scene parsing, i.e.
simultaneously labeling geometric surfaces (e.g. sky, ground and vertical
plane) and determining the interaction relations (e.g. layering, supporting,
siding and affinity) between main regions. This problem is more challenging
than the traditional semantic scene labeling, as recovering geometric
structures necessarily requires the rich and diverse contextual information. To
achieve these goals, we propose a novel recurrent neural network model, named
Hierarchical Long Short-Term Memory (H-LSTM). It contains two coupled
sub-networks: the Pixel LSTM (P-LSTM) and the Multi-scale Super-pixel LSTM
(MS-LSTM) for handling the surface labeling and relation prediction,
respectively. The two sub-networks provide complementary information to each
other to exploit hierarchical scene contexts, and they are jointly optimized
for boosting the performance. Our extensive experiments show that our model is
capable of parsing scene geometric structures and outperforming several
state-of-the-art methods by large margins. In addition, we show promising 3D
reconstruction results from the still images based on the geometric parsing.Comment: To be presented at IJCAI'1
Learning to Group and Label Fine-Grained Shape Components
A majority of stock 3D models in modern shape repositories are assembled with
many fine-grained components. The main cause of such data form is the
component-wise modeling process widely practiced by human modelers. These
modeling components thus inherently reflect some function-based shape
decomposition the artist had in mind during modeling. On the other hand,
modeling components represent an over-segmentation since a functional part is
usually modeled as a multi-component assembly. Based on these observations, we
advocate that labeled segmentation of stock 3D models should not overlook the
modeling components and propose a learning solution to grouping and labeling of
the fine-grained components. However, directly characterizing the shape of
individual components for the purpose of labeling is unreliable, since they can
be arbitrarily tiny and semantically meaningless. We propose to generate part
hypotheses from the components based on a hierarchical grouping strategy, and
perform labeling on those part groups instead of directly on the components.
Part hypotheses are mid-level elements which are more probable to carry
semantic information. A multiscale 3D convolutional neural network is trained
to extract context-aware features for the hypotheses. To accomplish a labeled
segmentation of the whole shape, we formulate higher-order conditional random
fields (CRFs) to infer an optimal label assignment for all components.
Extensive experiments demonstrate that our method achieves significantly robust
labeling results on raw 3D models from public shape repositories. Our work also
contributes the first benchmark for component-wise labeling.Comment: Accepted to SIGGRAPH Asia 2018. Corresponding Author: Kai Xu
([email protected]
PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space
Few prior works study deep learning on point sets. PointNet by Qi et al. is a
pioneer in this direction. However, by design PointNet does not capture local
structures induced by the metric space points live in, limiting its ability to
recognize fine-grained patterns and generalizability to complex scenes. In this
work, we introduce a hierarchical neural network that applies PointNet
recursively on a nested partitioning of the input point set. By exploiting
metric space distances, our network is able to learn local features with
increasing contextual scales. With further observation that point sets are
usually sampled with varying densities, which results in greatly decreased
performance for networks trained on uniform densities, we propose novel set
learning layers to adaptively combine features from multiple scales.
Experiments show that our network called PointNet++ is able to learn deep point
set features efficiently and robustly. In particular, results significantly
better than state-of-the-art have been obtained on challenging benchmarks of 3D
point clouds
Hierarchical Scene Parsing by Weakly Supervised Learning with Image Descriptions
This paper investigates a fundamental problem of scene understanding: how to
parse a scene image into a structured configuration (i.e., a semantic object
hierarchy with object interaction relations). We propose a deep architecture
consisting of two networks: i) a convolutional neural network (CNN) extracting
the image representation for pixel-wise object labeling and ii) a recursive
neural network (RsNN) discovering the hierarchical object structure and the
inter-object relations. Rather than relying on elaborative annotations (e.g.,
manually labeled semantic maps and relations), we train our deep model in a
weakly-supervised learning manner by leveraging the descriptive sentences of
the training images. Specifically, we decompose each sentence into a semantic
tree consisting of nouns and verb phrases, and apply these tree structures to
discover the configurations of the training images. Once these scene
configurations are determined, then the parameters of both the CNN and RsNN are
updated accordingly by back propagation. The entire model training is
accomplished through an Expectation-Maximization method. Extensive experiments
show that our model is capable of producing meaningful scene configurations and
achieving more favorable scene labeling results on two benchmarks (i.e., PASCAL
VOC 2012 and SYSU-Scenes) compared with other state-of-the-art
weakly-supervised deep learning methods. In particular, SYSU-Scenes contains
more than 5000 scene images with their semantic sentence descriptions, which is
created by us for advancing research on scene parsing.Comment: Accepted by Transactions on Pattern Analysis and Machine Intelligence
(T-PAMI) 201
Geometric Context from Videos
We present a novel algorithm for estimating the broad 3D geometric structure
of outdoor video scenes. Leveraging spatio-temporal video segmentation, we
decompose a dynamic scene captured by a video into geometric classes, based on
predictions made by region-classifiers that are trained on appearance and
motion features. By examining the homogeneity of the prediction, we combine
predictions across multiple segmentation hierarchy levels alleviating the need
to determine the granularity a priori. We built a novel, extensive dataset on
geometric context of video to evaluate our method, consisting of over 100
ground-truth annotated outdoor videos with over 20,000 frames. To further scale
beyond this dataset, we propose a semi-supervised learning framework to expand
the pool of labeled data with high confidence predictions obtained from
unlabeled data. Our system produces an accurate prediction of geometric context
of video achieving 96% accuracy across main geometric classes.Comment: Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference
o
- …