614 research outputs found
Exploring Context with Deep Structured models for Semantic Segmentation
State-of-the-art semantic image segmentation methods are mostly based on
training deep convolutional neural networks (CNNs). In this work, we proffer to
improve semantic segmentation with the use of contextual information. In
particular, we explore `patch-patch' context and `patch-background' context in
deep CNNs. We formulate deep structured models by combining CNNs and
Conditional Random Fields (CRFs) for learning the patch-patch context between
image regions. Specifically, we formulate CNN-based pairwise potential
functions to capture semantic correlations between neighboring patches.
Efficient piecewise training of the proposed deep structured model is then
applied in order to avoid repeated expensive CRF inference during the course of
back propagation. For capturing the patch-background context, we show that a
network design with traditional multi-scale image inputs and sliding pyramid
pooling is very effective for improving performance. We perform comprehensive
evaluation of the proposed method. We achieve new state-of-the-art performance
on a number of challenging semantic segmentation datasets including ,
-, , -, -,
-, and datasets. Particularly, we report an
intersection-over-union score of on the - dataset.Comment: 16 pages. Accepted to IEEE T. Pattern Analysis & Machine
Intelligence, 2017. Extended version of arXiv:1504.0101
Line Based Multi-Range Asymmetric Conditional Random Field For Terrestrial Laser Scanning Data Classification
Terrestrial Laser Scanning (TLS) is a ground-based, active imaging method that rapidly acquires accurate, highly dense three-dimensional point cloud of object surfaces by laser range finding. For fully utilizing its benefits, developing a robust method to classify many objects of interests from huge amounts of laser point clouds is urgently required. However, classifying massive TLS data faces many challenges, such as complex urban scene, partial data acquisition from occlusion. To make an automatic, accurate and robust TLS data classification, we present a line-based multi-range asymmetric Conditional Random Field algorithm.
The first contribution is to propose a line-base TLS data classification method. In this thesis, we are interested in seven classes: building, roof, pedestrian road (PR), tree, low man-made object (LMO), vehicle road (VR), and low vegetation (LV). The line-based classification is implemented in each scan profile, which follows the line profiling nature of laser scanning mechanism.Ten conventional local classifiers are tested, including popular generative and discriminative classifiers, and experimental results validate that the line-based method can achieve satisfying classification performance. However, local classifiers implement labeling task on individual line independently of its neighborhood, the inference of which often suffers from similar local appearance across different object classes. The second contribution is to propose a multi-range asymmetric Conditional Random Field (maCRF) model, which uses object context as post-classification to improve the performance of a local generative classifier. The maCRF incorporates appearance, local smoothness constraint, and global scene layout regularity together into a probabilistic graphical model. The local smoothness enforces that lines in a local area to have the same class label, while scene layout favours an asymmetric regularity of spatial arrangement between different object classes within long-range, which is considered both in vertical (above-bellow relation) and horizontal (front-behind) directions. The asymmetric regularity allows capturing directional spatial arrangement between pairwise objects (e.g. it allows ground is lower than building, not vice-versa). The third contribution is to extend the maCRF model by adding across scan profile context, which is called Across scan profile Multi-range Asymmetric Conditional Random Field (amaCRF) model. Due to the sweeping nature of laser scanning, the sequentially acquired TLS data has strong spatial dependency, and the across scan profile context can provide more contextual information. The final contribution is to propose a sequential classification strategy. Along the sweeping direction of laser scanning, amaCRF models were sequentially constructed. By dynamically updating posterior probability of common scan profiles, contextual information propagates through adjacent scan profiles
Spatial and temporal background modelling of non-stationary visual scenes
PhDThe prevalence of electronic imaging systems in everyday life has become increasingly apparent
in recent years. Applications are to be found in medical scanning, automated manufacture, and
perhaps most significantly, surveillance. Metropolitan areas, shopping malls, and road traffic
management all employ and benefit from an unprecedented quantity of video cameras for monitoring
purposes. But the high cost and limited effectiveness of employing humans as the final
link in the monitoring chain has driven scientists to seek solutions based on machine vision techniques.
Whilst the field of machine vision has enjoyed consistent rapid development in the last
20 years, some of the most fundamental issues still remain to be solved in a satisfactory manner.
Central to a great many vision applications is the concept of segmentation, and in particular,
most practical systems perform background subtraction as one of the first stages of video
processing. This involves separation of ‘interesting foreground’ from the less informative but
persistent background. But the definition of what is ‘interesting’ is somewhat subjective, and
liable to be application specific. Furthermore, the background may be interpreted as including
the visual appearance of normal activity of any agents present in the scene, human or otherwise.
Thus a background model might be called upon to absorb lighting changes, moving trees and
foliage, or normal traffic flow and pedestrian activity, in order to effect what might be termed in
‘biologically-inspired’ vision as pre-attentive selection. This challenge is one of the Holy Grails
of the computer vision field, and consequently the subject has received considerable attention.
This thesis sets out to address some of the limitations of contemporary methods of background
segmentation by investigating methods of inducing local mutual support amongst pixels
in three starkly contrasting paradigms: (1) locality in the spatial domain, (2) locality in the shortterm
time domain, and (3) locality in the domain of cyclic repetition frequency.
Conventional per pixel models, such as those based on Gaussian Mixture Models, offer no
spatial support between adjacent pixels at all. At the other extreme, eigenspace models impose
a structure in which every image pixel bears the same relation to every other pixel. But Markov
Random Fields permit definition of arbitrary local cliques by construction of a suitable graph, and
3
are used here to facilitate a novel structure capable of exploiting probabilistic local cooccurrence
of adjacent Local Binary Patterns. The result is a method exhibiting strong sensitivity to multiple
learned local pattern hypotheses, whilst relying solely on monochrome image data.
Many background models enforce temporal consistency constraints on a pixel in attempt to
confirm background membership before being accepted as part of the model, and typically some
control over this process is exercised by a learning rate parameter. But in busy scenes, a true
background pixel may be visible for a relatively small fraction of the time and in a temporally
fragmented fashion, thus hindering such background acquisition. However, support in terms of
temporal locality may still be achieved by using Combinatorial Optimization to derive shortterm
background estimates which induce a similar consistency, but are considerably more robust
to disturbance. A novel technique is presented here in which the short-term estimates act as
‘pre-filtered’ data from which a far more compact eigen-background may be constructed.
Many scenes entail elements exhibiting repetitive periodic behaviour. Some road junctions
employing traffic signals are among these, yet little is to be found amongst the literature regarding
the explicit modelling of such periodic processes in a scene. Previous work focussing on gait
recognition has demonstrated approaches based on recurrence of self-similarity by which local
periodicity may be identified. The present work harnesses and extends this method in order
to characterize scenes displaying multiple distinct periodicities by building a spatio-temporal
model. The model may then be used to highlight abnormality in scene activity. Furthermore, a
Phase Locked Loop technique with a novel phase detector is detailed, enabling such a model to
maintain correct synchronization with scene activity in spite of noise and drift of periodicity.
This thesis contends that these three approaches are all manifestations of the same broad
underlying concept: local support in each of the space, time and frequency domains, and furthermore,
that the support can be harnessed practically, as will be demonstrated experimentally
An attention model and its application in man-made scene interpretation
The ultimate aim of research into computer vision is designing a system which interprets
its surrounding environment in a similar way the human can do effortlessly. However, the
state of technology is far from achieving such a goal. In this thesis different components of
a computer vision system that are designed for the task of interpreting man-made scenes,
in particular images of buildings, are described. The flow of information in the proposed
system is bottom-up i.e., the image is first segmented into its meaningful components and
subsequently the regions are labelled using a contextual classifier.
Starting from simple observations concerning the human vision system and the gestalt laws
of human perception, like the law of “good (simple) shape” and “perceptual grouping”, a
blob detector is developed, that identifies components in a 2D image. These components
are convex regions of interest, with interest being defined as significant gradient magnitude
content. An eye tracking experiment is conducted, which shows that the regions identified
by the blob detector, correlate significantly with the regions which drive the attention of
viewers.
Having identified these blobs, it is postulated that a blob represents an object, linguistically
identified with its own semantic name. In other words, a blob may contain a window a
door or a chimney in a building. These regions are used to identify and segment higher
order structures in a building, like facade, window array and also environmental regions
like sky and ground.
Because of inconsistency in the unary features of buildings, a contextual learning algorithm
is used to classify the segmented regions. A model which learns spatial and topological
relationships between different objects from a set of hand-labelled data, is used. This
model utilises this information in a MRF to achieve consistent labellings of new scenes
Activity Analysis; Finding Explanations for Sets of Events
Automatic activity recognition is the computational process of analysing visual input and reasoning about detections to understand the performed events. In all but the simplest scenarios, an activity involves multiple interleaved events, some related and others independent. The activity in a car park or at a playground would typically include many events. This research assumes the possible events and any constraints between the events can be defined for the given scene. Analysing the activity should thus recognise a complete and consistent set of events; this is referred to as a global explanation of the activity. By seeking a global explanation that satisfies the activity’s constraints, infeasible interpretations can be avoided, and ambiguous observations may be resolved.
An activity’s events and any natural constraints are defined using a grammar formalism. Attribute Multiset Grammars (AMG) are chosen because they allow defining hierarchies, as well as attribute rules and constraints. When used for recognition, detectors are employed to gather a set of detections. Parsing the set of detections by the AMG provides a global explanation. To find the best parse tree given a set of detections, a Bayesian network models the probability distribution over the space of possible parse trees. Heuristic and exhaustive search techniques are proposed to find the maximum a posteriori global explanation.
The framework is tested for two activities: the activity in a bicycle rack, and around a building entrance. The first case study involves people locking bicycles onto a bicycle rack and picking them up later. The best global explanation for all detections gathered during the day resolves local ambiguities from occlusion or clutter. Intensive testing on 5 full days proved global analysis achieves higher recognition rates. The second case study tracks people and any objects they are carrying as they enter and exit a building entrance. A complete sequence of the person entering and exiting multiple times is recovered by the global explanation
Patch-based semantic labelling of images.
PhDThe work presented in this thesis is focused at associating a semantics
to the content of an image, linking the content to high level
semantic categories. The process can take place at two levels: either
at image level, towards image categorisation, or at pixel level, in se-
mantic segmentation or semantic labelling. To this end, an analysis
framework is proposed, and the different steps of part (or patch) extraction,
description and probabilistic modelling are detailed. Parts of
different nature are used, and one of the contributions is a method to
complement information associated to them. Context for parts has to
be considered at different scales. Short range pixel dependences are accounted
by associating pixels to larger patches. A Conditional Random
Field, that is, a probabilistic discriminative graphical model, is used
to model medium range dependences between neighbouring patches.
Another contribution is an efficient method to consider rich neighbourhoods
without having loops in the inference graph. To this end, weak
neighbours are introduced, that is, neighbours whose label probability
distribution is pre-estimated rather than mutable during the inference.
Longer range dependences, that tend to make the inference problem
intractable, are addressed as well. A novel descriptor based on local
histograms of visual words has been proposed, meant to both complement
the feature descriptor of the patches and augment the context
awareness in the patch labelling process. Finally, an alternative approach
to consider multiple scales in a hierarchical framework based
on image pyramids is proposed. An image pyramid is a compositional
representation of the image based on hierarchical clustering. All the
presented contributions are extensively detailed throughout the thesis,
and experimental results performed on publicly available datasets are
reported to assess their validity. A critical comparison with the state
of the art in this research area is also presented, and the advantage in
adopting the proposed improvements are clearly highlighted
- …