10 research outputs found
Hierarchical Scene Annotation
We present a computer-assisted annotation system, together with a labeled dataset and benchmark suite, for evaluating an algorithm’s ability to recover hierarchical scene structure. We evolve segmentation groundtruth from the two-dimensional image partition
into a tree model that captures both occlusion and object-part relationships among possibly overlapping regions. Our tree model extends the segmentation problem to encompass object detection, object-part containment, and figure-ground ordering.
We mitigate the cost of providing richer groundtruth labeling through a new web-based annotation tool with an intuitive graphical interface for rearranging the region hierarchy. Using precomputed superpixels, our tool also guides creation of user-specified regions with pixel-perfect boundaries. Widespread adoption of this human-machine combination should make the inaccuracies of bounding box labeling a relic of the past.
Evaluating the state-of-the-art in fully automatic image segmentation reveals that it produces accurate two-dimension partitions, but does not respect groundtruth object-part structure. Our dataset and benchmark is the first to quantify these inadequacies. We illuminate recovery of rich scene structure as an important new goal for segmentation
Cornsweet surfaces for selective contrast enhancement
A typical goal when enhancing the contrast of images is to increase the perceived contrast without altering the original feel of the
image. Such contrast enhancement can be achieved by modelling Cornsweet profiles into the image. We demonstrate that previous
methods aiming to model Cornsweet profiles for contrast enhancement, often employing the unsharp mask operator, are not robust
to image content. To achieve robustness, we propose a fundamentally di erent vector-centric approach with Cornsweet surfaces.
Cornsweet surfaces are parametrised 3D surfaces (2D in space, 1D in luminance enhancement) that are extruded or depressed in
the luminance dimension to create countershading that respects image structure. In contrast to previous methods, our method is
robust against the topology of the edges to be enhanced and the relative luminance across those edges. In user trials, our solution
was significantly preferred over the most related contrast enhancement method.Kosinka was funded by EPSRC grant EP/H024816/1. Lieng was funded by a scholarship from the Norwegian Government.This is the accepted manuscript. The final version is available from Elsevier at http://www.sciencedirect.com/science/article/pii/S0097849314000405
THE IMAGE TORQUE OPERATOR FOR MID-LEVEL VISION: THEORY AND EXPERIMENT
A problem central to visual scene understanding and computer vision is to extract semantically meaningful parts of images. A visual scene consists of objects, and the objects and parts of objects are delineated from their surrounding by closed contours. In this thesis a new bottom-up visual operator, called the Torque operator, which captures the concept of closed contours is introduced. Its computation is inspired by the mechanical definition of torque or moment of force, and applied to image edges. It takes as input edges and computes over regions of different size a measure of how well the edges are aligned to form a closed, convex contour. The torque operator is by definition scale independent, and can be seen as an operator of mid-level vision that captures the organizational concept of 'closure' and grouping mechanism of edges. In this thesis, fundamental properties of the torque measure are studied, and experiments are performed to demonstrate and verify that it can be made a useful tool for a variety of applications, including visual attention, segmentation, and boundary edge detection
Monocular Depth Ordering Using Occlusion Cues
English: This project proposes a system to relate the objects in an image using occlusion cues and arrange them according to depth. The system does not rely on a priori knowledge of the scene structure and focus on detecting special points, such as T-junctions and high convexity regions, to infer the depth relationships between objects in the scene. The system makes extensive use of the Binary Partition Tree (BPT) as the segmentation tool jointly with a new approach for T-junction candidate point estimation. In a BPT approach, as a bottom-up strategy, regions are iteratively merged and grown from pixels until only one region is left. At each step, our system estimates the junction points, where three regions meet. When the BPT is constructed and the pruning is performed, this information is used for depth ordering. Since many images may not have occlusion points formed by junctions, occlusion is also detected by examining convex shapes on region boundaries. Combining T-junctions and convexity lead to a system which only relies on low level depth cues and does not involve any learning process. However, it shows a similar performance with the state of the art
Global optimisation techniques for image segmentation with higher order models
Energy minimisation methods are one of the most successful approaches to image segmentation.
Typically used energy functions are limited to pairwise interactions due to the increased
complexity when working with higher-order functions. However, some important assumptions
about objects are not translatable to pairwise interactions. The goal of this thesis is to explore
higher order models for segmentation that are applicable to a wide range of objects. We consider:
(1) a connectivity constraint, (2) a joint model over the segmentation and the appearance,
and (3) a model for segmenting the same object in multiple images.
We start by investigating a connectivity prior, which is a natural assumption about objects.
We show how this prior can be formulated in the energy minimisation framework and explore
the complexity of the underlying optimisation problem, introducing two different algorithms for
optimisation. This connectivity prior is useful to overcome the “shrinking bias” of the pairwise
model, in particular in interactive segmentation systems.
Secondly, we consider an existing model that treats the appearance of the image segments
as variables. We show how to globally optimise this model using a Dual Decomposition technique
and show that this optimisation method outperforms existing ones.
Finally, we explore the current limits of the energy minimisation framework. We consider
the cosegmentation task and show that a preference for object-like segmentations is an
important addition to cosegmentation. This preference is, however, not easily encoded in the
energy minimisation framework. Instead, we use a practical proposal generation approach that
allows not only the inclusion of a preference for object-like segmentations, but also to learn the
similarity measure needed to define the cosegmentation task.
We conclude that higher order models are useful for different object segmentation tasks.
We show how some of these models can be formulated in the energy minimisation framework.
Furthermore, we introduce global optimisation methods for these energies and make extensive
use of the Dual Decomposition optimisation approach that proves to be suitable for this type of
models
Recommended from our members
Surface modelling for 2D imagery
Vector graphics provides powerful tools for drawing scalable 2D imagery. With
the rise of mobile computers, of different types of displays and image resolutions,
vector graphics is receiving an increasing amount of attention. However, vector
graphics is not the leading framework for creating and manipulating 2D imagery.
The reason for this reluctance of employing vector graphical frameworks is that it
is difficult to handle complex behaviour of colour across the 2D domain.
A challenging problem within vector graphics is to define smooth colour functions
across the image. In previous work, two approaches exist. The first approach,
known as diffusion curves, diffuses colours from a set of input curves and points.
The second approach, known as gradient meshes, defines smooth colour functions
from control meshes. These two approaches are incompatible: diffusion curves do
not support the local behaviour provided by gradient meshes and gradient meshes
do not support freeform curves as input. My research aims to narrow the gap between
diffusion curves and gradient meshes.
With this aim in mind, I propose solutions to create control meshes from freeform
curves. I demonstrate that these control meshes can be used to render a vector
primitive similar to diffusion curves using subdivision surfaces. With the use of
subdivision surfaces, instead of a diffusion process, colour gradients can be locally
controlled using colour-gradient curves associated with the input curves.
The advantage of local control is further explored in the setting of vector-centric
image processing. I demonstrate that a certain contrast enhancement profile, known
as the Cornsweet profile, can be modelled via surfaces in images. This approach
does not produce saturation artefacts related with previous filter-based methods.
Additionally, I demonstrate various approaches to artistic filtering, where the artist
locally models given artistic effects.
Gradient meshes are restricted to rectangular topology of the control meshes. I
argue that this restriction hinders the applicability of the approach and its potential
to be used with control meshes extracted from freeform curves. To this end, I
propose a mesh-based vector primitive that supports arbitrary manifold topology of
the mesh
Computational Models of Perceptual Organization and Bottom-up Attention in Visual and Audio-Visual Environments
Figure Ground Organization (FGO) - inferring spatial depth ordering of objects in a visual scene - involves determining which side of an occlusion boundary (OB) is figure (closer to the observer) and which is ground (further away from the observer). Attention, the process that governs how only some part of sensory information is selected for further analysis based on behavioral relevance, can be exogenous, driven by stimulus properties such as an abrupt sound or a bright flash, the processing of which is purely bottom-up; or endogenous (goal-driven or voluntary), where top-down factors such as familiarity, aesthetic quality, etc., determine attentional selection. The two main objectives of this thesis are developing computational models of: (i) FGO in visual environments; (ii) bottom-up attention in audio-visual environments.
In the visual domain, we first identify Spectral Anisotropy (SA), characterized by anisotropic distribution of oriented high frequency spectral power on the figure side and lack of it on the ground side, as a novel FGO cue, that can determine Figure/Ground (FG) relations at an OB with an accuracy exceeding 60%. Next, we show a non-linear Support Vector Machine based classifier trained on the SA features achieves an accuracy close to 70% in determining FG relations, the highest for a stand-alone local cue. We then show SA can be computed in a biologically plausible manner by pooling the Complex cell responses of different scales in a specific orientation, which also achieves an accuracy greater than or equal to 60% in determining FG relations. Next, we present a biologically motivated, feed forward model of FGO incorporating convexity, surroundedness, parallelism as global cues and SA, T-junctions as local cues, where SA is computed in a biologically plausible manner. Each local cue, when added alone, gives statistically significant improvement in the model's performance. The model with both local cues achieves higher accuracy than those of models with individual cues in determining FG relations, indicating SA and T-Junctions are not mutually contradictory. Compared to the model with no local cues, the model with both local cues achieves greater than or equal to 8.78% improvement in determining FG relations at every border location of images in the BSDS dataset.
In the audio-visual domain, first we build a simple computational model to explain how visual search can be aided by providing concurrent, co-spatial auditory cues. Our model shows that adding a co-spatial, concurrent auditory cue can enhance the saliency of a weakly visible target among prominent visual distractors, the behavioral effect of which could be faster reaction time and/or better search accuracy. Lastly, a bottom-up, feed-forward, proto-object based audiovisual saliency map (AVSM) for the analysis of dynamic natural scenes is presented. We demonstrate that the performance of proto-object based AVSM in detecting and localizing salient objects/events is in agreement with human judgment. In addition, we show the AVSM computed as a linear combination of visual and auditory feature conspicuity maps captures a higher number of valid salient events compared to unisensory saliency maps
Monocular depth estimation in images and sequences using occlusion cues
When humans observe a scene, they are able to perfectly distinguish the different parts composing it. Moreover, humans can easily reconstruct the spatial position of these parts and conceive a consistent structure. The mechanisms involving visual perception have been studied since the beginning of neuroscience but, still today, not all the processes composing it are known.
In usual situations, humans can make use of three different methods to estimate the scene structure. The first one is the so called divergence and it makes use of both eyes. When objects lie in front of the observed at a distance up to hundred meters, subtle differences in the image formation in each eye can be used to determine depth. When objects are not in the field of view of both eyes, other mechanisms should be used. In these cases, both visual cues and prior learned information can be used to determine depth. Even if these mechanisms are less accurate than divergence, humans can almost always infer the correct depth structure when using them. As an example of visual cues, occlusion, perspective or object size provide a lot of information about the structure of the scene. A priori information depends on each observer, but it is normally used subconsciously by humans to detect commonly known regions such as the sky, the ground or different types of objects.
In the last years, since technology has been able to handle the processing burden of vision systems, there has been lots of efforts devoted to design automated scene interpreting systems. In this thesis we address the problem of depth estimation using only one point of view and using only occlusion depth cues. The thesis objective is to detect occlusions present in the scene and combine them with a segmentation system so as to generate a relative depth order depth map for a scene. We explore both static and dynamic situations such as single images, frame inside sequences or full video sequences. In the case where a full image sequence is available, a system exploiting motion information to recover depth structure is also designed. Results are promising and competitive with respect to the state of the art literature, but there is still much room for improvement when compared to human depth perception performance.Quan els humans observen una escena, son capaços de distingir perfectament les parts que la composen i organitzar-les espacialment per tal de poder-se orientar. Els mecanismes que governen la percepció visual han estat estudiats des dels principis de la neurociència, però encara no es coneixen tots els processos biològic que hi prenen part. En situacions normals, els humans poden fer servir tres eines per estimar l’estructura de l’escena.
La primera és l’anomenada divergència. Aprofita l’ús de dos punts de vista (els dos ulls) i és capaç¸ de determinar molt acuradament la posició dels objectes ,que a una distància de fins a cent metres, romanen enfront de l’observador. A mesura que augmenta la distància o els objectes no es troben en el camp de visió dels dos ulls, altres mecanismes s’han d’utilitzar. Tant l’experiència anterior com certs indicis visuals s’utilitzen en aquests casos i, encara que la seva precisió és menor, els humans aconsegueixen quasi bé sempre interpretar bé el seu entorn. Els indicis visuals que aporten informació de profunditat més coneguts i utilitzats són per exemple, la perspectiva, les oclusions o el tamany de certs objectes. L’experiència anterior permet resoldre situacions vistes anteriorment com ara saber quins regions corresponen al terra, al cel o a objectes.
Durant els últims anys, quan la tecnologia ho ha permès, s’han intentat dissenyar sistemes que interpretessin automàticament diferents tipus d’escena. En aquesta tesi s’aborda el tema de l’estimació de la profunditat utilitzant només un punt de vista i indicis visuals d’oclusió. L’objectiu del treball es la detecció d’aquests indicis i combinar-los amb un sistema de segmentació per tal de generar automàticament els diferents plans de profunditat presents a una escena. La tesi explora tant situacions estàtiques (imatges fixes) com situacions dinàmiques, com ara trames dins de seqüències de vídeo o seqüències completes. En el cas de seqüències completes, també es proposa un sistema automàtic per reconstruir l’estructura de l’escena només amb informació de moviment. Els resultats del treball son prometedors i competitius amb la literatura del moment, però mostren encara que la visió per computador té molt marge de millora respecte la precisió dels humans