1 research outputs found
On the Role of Context at Different Scales in Scene Parsing
Scene parsing can be formulated as a labeling problem where each
visual data element, e.g., each pixel of an image or each 3D
point in a point cloud, is assigned a semantic class label. One
can approach this problem by training a classifier and predicting
a class label for the data elements purely based on their local
properties. This approach, however, does not take into account
any kind of contextual information between different elements in
the image or point cloud. For example, in an application where we
are interested in labeling roadside objects, the fact that most
of the utility poles are connected to some power wires can be
very helpful in disambiguating them from other similar looking
classes. Recurrence of certain class combinations can be also
considered as a good contextual hint since they are very likely
to co-occur again. These forms of high-level contextual
information are often formulated using pairwise and higher-order
Conditional Random Fields (CRFs). A CRF is a probabilistic
graphical model that encodes the contextual relationships between
the data elements in a scene. In this thesis, we study the
potential of contextual information at different scales (ranges)
in scene parsing problems.
First, we propose a model that utilizes the local context of the
scene via a pairwise CRF. Our model acquires contextual
interactions between different classes by assessing their
misclassification rates using only the local properties of data.
In other words, no extra training is required for obtaining the
class interaction information.
Next, we expand the context field of view from a local range to a
longer range, and make use of higher-order models to encode more
complex contextual cues. More specifically, we introduce a new
model to employ geometric higher-order terms in a CRF for
semantic labeling of 3D point cloud data.
Despite the potential of the above models at capturing the
contextual cues in the scene, there are higher-level context cues
that cannot be encoded via pairwise and higher-order CRFs. For
instance, a vehicle is very unlikely to appear in a sea scene, or
buildings are frequently observed in a street scene. Such
information can be described using scene context and are modeled
using global image descriptors. In particular, through an image
retrieval procedure, we find images whose content is similar to
that of the query image, and use them for scene parsing. Another
problem of the above methods is that they rely on a
computationally expensive training process for the classification
using the local properties of data elements, which needs to be
repeated every time the training data is modified. We address
this issue by proposing a fast and efficient approach that
exempts us from the cumbersome training task, by transferring the
ground-truth information directly from the training data to the
test data