24 research outputs found
Pixelwise Instance Segmentation with a Dynamically Instantiated Network
Semantic segmentation and object detection research have recently achieved
rapid progress. However, the former task has no notion of different instances
of the same object, and the latter operates at a coarse, bounding-box level. We
propose an Instance Segmentation system that produces a segmentation map where
each pixel is assigned an object class and instance identity label. Most
approaches adapt object detectors to produce segments instead of boxes. In
contrast, our method is based on an initial semantic segmentation module, which
feeds into an instance subnetwork. This subnetwork uses the initial
category-level segmentation, along with cues from the output of an object
detector, within an end-to-end CRF to predict instances. This part of our model
is dynamically instantiated to produce a variable number of instances per
image. Our end-to-end approach requires no post-processing and considers the
image holistically, instead of processing independent proposals. Therefore,
unlike some related work, a pixel cannot belong to multiple instances.
Furthermore, far more precise segmentations are achieved, as shown by our
state-of-the-art results (particularly at high IoU thresholds) on the Pascal
VOC and Cityscapes datasets.Comment: CVPR 201
Human-Machine CRFs for Identifying Bottlenecks in Holistic Scene Understanding
Recent trends in image understanding have pushed for holistic scene
understanding models that jointly reason about various tasks such as object
detection, scene recognition, shape analysis, contextual reasoning, and local
appearance based classifiers. In this work, we are interested in understanding
the roles of these different tasks in improved scene understanding, in
particular semantic segmentation, object detection and scene recognition.
Towards this goal, we "plug-in" human subjects for each of the various
components in a state-of-the-art conditional random field model. Comparisons
among various hybrid human-machine CRFs give us indications of how much "head
room" there is to improve scene understanding by focusing research efforts on
various individual tasks
A Linear-Time Bottom-Up Discourse Parser with Constraints and Post-Editing
Text-level discourse parsing remains a challenge. The current state-of-the-art overall accuracy in relation assignment is 55.73%, achieved by Joty et al. (2013). However, their model has a high order of time complexity, and thus cannot be ap-plied in practice. In this work, we develop a much faster model whose time complex-ity is linear in the number of sentences. Our model adopts a greedy bottom-up ap-proach, with two linear-chain CRFs ap-plied in cascade as local classifiers. To en-hance the accuracy of the pipeline, we add additional constraints in the Viterbi decod-ing of the first CRF. In addition to effi-ciency, our parser also significantly out-performs the state of the art. Moreover, our novel approach of post-editing, which modifies a fully-built tree by considering information from constituents on upper levels, can further improve the accuracy.
Geometric Supervision and Deep Structured Models for Image Segmentation
The task of semantic segmentation aims at understanding an image at a pixel level. Due to its applicability in many areas, such as autonomous vehicles, robotics and medical surgery assistance, semantic segmentation has become an essential task in image analysis. During the last few years a lot of progress have been made for image segmentation algorithms, mainly due to the introduction of deep learning methods, in particular the use of Convolutional Neural Networks (CNNs). CNNs are powerful for modeling complex connections between input and output data but have two drawbacks when it comes to semantic segmentation. Firstly, CNNs lack the ability to directly model dependent output structures, for instance, explicitly enforcing properties such as label smoothness and coherence. This drawback motivates the use of Conditional Random Fields (CRFs), applied as a post-processing step in semantic segmentation. Secondly, training CNNs requires large amounts of annotated data. For segmentation this amounts to dense, pixel-level, annotations that are very time-consuming to acquire.This thesis summarizes the content of five papers addressing the two aforementioned drawbacks of CNNs. The first two papers present methods on how geometric 3D models can be used to improve segmentation models. The 3D models can be created with little human labour and can be used as a supervisory signal to improve the robustness of semantic segmentation and long-term visual localization methods. The last three papers focuses on models combining CNNs and CRFs for semantic segmentation. The models consist of a CNN capable of learning complex image features coupled with a CRF capable of learning dependencies between output variables. Emphasis has been on creating models that are possible to train end-to-end, giving the CNN and the CRF a chance to learn how to interact and exploit complementary information to achieve better performance
On the Role of Context at Different Scales in Scene Parsing
Scene parsing can be formulated as a labeling problem where each
visual data element, e.g., each pixel of an image or each 3D
point in a point cloud, is assigned a semantic class label. One
can approach this problem by training a classifier and predicting
a class label for the data elements purely based on their local
properties. This approach, however, does not take into account
any kind of contextual information between different elements in
the image or point cloud. For example, in an application where we
are interested in labeling roadside objects, the fact that most
of the utility poles are connected to some power wires can be
very helpful in disambiguating them from other similar looking
classes. Recurrence of certain class combinations can be also
considered as a good contextual hint since they are very likely
to co-occur again. These forms of high-level contextual
information are often formulated using pairwise and higher-order
Conditional Random Fields (CRFs). A CRF is a probabilistic
graphical model that encodes the contextual relationships between
the data elements in a scene. In this thesis, we study the
potential of contextual information at different scales (ranges)
in scene parsing problems.
First, we propose a model that utilizes the local context of the
scene via a pairwise CRF. Our model acquires contextual
interactions between different classes by assessing their
misclassification rates using only the local properties of data.
In other words, no extra training is required for obtaining the
class interaction information.
Next, we expand the context field of view from a local range to a
longer range, and make use of higher-order models to encode more
complex contextual cues. More specifically, we introduce a new
model to employ geometric higher-order terms in a CRF for
semantic labeling of 3D point cloud data.
Despite the potential of the above models at capturing the
contextual cues in the scene, there are higher-level context cues
that cannot be encoded via pairwise and higher-order CRFs. For
instance, a vehicle is very unlikely to appear in a sea scene, or
buildings are frequently observed in a street scene. Such
information can be described using scene context and are modeled
using global image descriptors. In particular, through an image
retrieval procedure, we find images whose content is similar to
that of the query image, and use them for scene parsing. Another
problem of the above methods is that they rely on a
computationally expensive training process for the classification
using the local properties of data elements, which needs to be
repeated every time the training data is modified. We address
this issue by proposing a fast and efficient approach that
exempts us from the cumbersome training task, by transferring the
ground-truth information directly from the training data to the
test data