10 research outputs found

    Hierarchical Scene Annotation

    Get PDF
    We present a computer-assisted annotation system, together with a labeled dataset and benchmark suite, for evaluating an algorithm’s ability to recover hierarchical scene structure. We evolve segmentation groundtruth from the two-dimensional image partition into a tree model that captures both occlusion and object-part relationships among possibly overlapping regions. Our tree model extends the segmentation problem to encompass object detection, object-part containment, and figure-ground ordering. We mitigate the cost of providing richer groundtruth labeling through a new web-based annotation tool with an intuitive graphical interface for rearranging the region hierarchy. Using precomputed superpixels, our tool also guides creation of user-specified regions with pixel-perfect boundaries. Widespread adoption of this human-machine combination should make the inaccuracies of bounding box labeling a relic of the past. Evaluating the state-of-the-art in fully automatic image segmentation reveals that it produces accurate two-dimension partitions, but does not respect groundtruth object-part structure. Our dataset and benchmark is the first to quantify these inadequacies. We illuminate recovery of rich scene structure as an important new goal for segmentation

    Cornsweet surfaces for selective contrast enhancement

    Get PDF
    A typical goal when enhancing the contrast of images is to increase the perceived contrast without altering the original feel of the image. Such contrast enhancement can be achieved by modelling Cornsweet profiles into the image. We demonstrate that previous methods aiming to model Cornsweet profiles for contrast enhancement, often employing the unsharp mask operator, are not robust to image content. To achieve robustness, we propose a fundamentally di erent vector-centric approach with Cornsweet surfaces. Cornsweet surfaces are parametrised 3D surfaces (2D in space, 1D in luminance enhancement) that are extruded or depressed in the luminance dimension to create countershading that respects image structure. In contrast to previous methods, our method is robust against the topology of the edges to be enhanced and the relative luminance across those edges. In user trials, our solution was significantly preferred over the most related contrast enhancement method.Kosinka was funded by EPSRC grant EP/H024816/1. Lieng was funded by a scholarship from the Norwegian Government.This is the accepted manuscript. The final version is available from Elsevier at http://www.sciencedirect.com/science/article/pii/S0097849314000405

    THE IMAGE TORQUE OPERATOR FOR MID-LEVEL VISION: THEORY AND EXPERIMENT

    Get PDF
    A problem central to visual scene understanding and computer vision is to extract semantically meaningful parts of images. A visual scene consists of objects, and the objects and parts of objects are delineated from their surrounding by closed contours. In this thesis a new bottom-up visual operator, called the Torque operator, which captures the concept of closed contours is introduced. Its computation is inspired by the mechanical definition of torque or moment of force, and applied to image edges. It takes as input edges and computes over regions of different size a measure of how well the edges are aligned to form a closed, convex contour. The torque operator is by definition scale independent, and can be seen as an operator of mid-level vision that captures the organizational concept of 'closure' and grouping mechanism of edges. In this thesis, fundamental properties of the torque measure are studied, and experiments are performed to demonstrate and verify that it can be made a useful tool for a variety of applications, including visual attention, segmentation, and boundary edge detection

    Monocular Depth Ordering Using Occlusion Cues

    Get PDF
    English: This project proposes a system to relate the objects in an image using occlusion cues and arrange them according to depth. The system does not rely on a priori knowledge of the scene structure and focus on detecting special points, such as T-junctions and high convexity regions, to infer the depth relationships between objects in the scene. The system makes extensive use of the Binary Partition Tree (BPT) as the segmentation tool jointly with a new approach for T-junction candidate point estimation. In a BPT approach, as a bottom-up strategy, regions are iteratively merged and grown from pixels until only one region is left. At each step, our system estimates the junction points, where three regions meet. When the BPT is constructed and the pruning is performed, this information is used for depth ordering. Since many images may not have occlusion points formed by junctions, occlusion is also detected by examining convex shapes on region boundaries. Combining T-junctions and convexity lead to a system which only relies on low level depth cues and does not involve any learning process. However, it shows a similar performance with the state of the art

    Global optimisation techniques for image segmentation with higher order models

    Get PDF
    Energy minimisation methods are one of the most successful approaches to image segmentation. Typically used energy functions are limited to pairwise interactions due to the increased complexity when working with higher-order functions. However, some important assumptions about objects are not translatable to pairwise interactions. The goal of this thesis is to explore higher order models for segmentation that are applicable to a wide range of objects. We consider: (1) a connectivity constraint, (2) a joint model over the segmentation and the appearance, and (3) a model for segmenting the same object in multiple images. We start by investigating a connectivity prior, which is a natural assumption about objects. We show how this prior can be formulated in the energy minimisation framework and explore the complexity of the underlying optimisation problem, introducing two different algorithms for optimisation. This connectivity prior is useful to overcome the “shrinking bias” of the pairwise model, in particular in interactive segmentation systems. Secondly, we consider an existing model that treats the appearance of the image segments as variables. We show how to globally optimise this model using a Dual Decomposition technique and show that this optimisation method outperforms existing ones. Finally, we explore the current limits of the energy minimisation framework. We consider the cosegmentation task and show that a preference for object-like segmentations is an important addition to cosegmentation. This preference is, however, not easily encoded in the energy minimisation framework. Instead, we use a practical proposal generation approach that allows not only the inclusion of a preference for object-like segmentations, but also to learn the similarity measure needed to define the cosegmentation task. We conclude that higher order models are useful for different object segmentation tasks. We show how some of these models can be formulated in the energy minimisation framework. Furthermore, we introduce global optimisation methods for these energies and make extensive use of the Dual Decomposition optimisation approach that proves to be suitable for this type of models

    Computational Models of Perceptual Organization and Bottom-up Attention in Visual and Audio-Visual Environments

    Get PDF
    Figure Ground Organization (FGO) - inferring spatial depth ordering of objects in a visual scene - involves determining which side of an occlusion boundary (OB) is figure (closer to the observer) and which is ground (further away from the observer). Attention, the process that governs how only some part of sensory information is selected for further analysis based on behavioral relevance, can be exogenous, driven by stimulus properties such as an abrupt sound or a bright flash, the processing of which is purely bottom-up; or endogenous (goal-driven or voluntary), where top-down factors such as familiarity, aesthetic quality, etc., determine attentional selection. The two main objectives of this thesis are developing computational models of: (i) FGO in visual environments; (ii) bottom-up attention in audio-visual environments. In the visual domain, we first identify Spectral Anisotropy (SA), characterized by anisotropic distribution of oriented high frequency spectral power on the figure side and lack of it on the ground side, as a novel FGO cue, that can determine Figure/Ground (FG) relations at an OB with an accuracy exceeding 60%. Next, we show a non-linear Support Vector Machine based classifier trained on the SA features achieves an accuracy close to 70% in determining FG relations, the highest for a stand-alone local cue. We then show SA can be computed in a biologically plausible manner by pooling the Complex cell responses of different scales in a specific orientation, which also achieves an accuracy greater than or equal to 60% in determining FG relations. Next, we present a biologically motivated, feed forward model of FGO incorporating convexity, surroundedness, parallelism as global cues and SA, T-junctions as local cues, where SA is computed in a biologically plausible manner. Each local cue, when added alone, gives statistically significant improvement in the model's performance. The model with both local cues achieves higher accuracy than those of models with individual cues in determining FG relations, indicating SA and T-Junctions are not mutually contradictory. Compared to the model with no local cues, the model with both local cues achieves greater than or equal to 8.78% improvement in determining FG relations at every border location of images in the BSDS dataset. In the audio-visual domain, first we build a simple computational model to explain how visual search can be aided by providing concurrent, co-spatial auditory cues. Our model shows that adding a co-spatial, concurrent auditory cue can enhance the saliency of a weakly visible target among prominent visual distractors, the behavioral effect of which could be faster reaction time and/or better search accuracy. Lastly, a bottom-up, feed-forward, proto-object based audiovisual saliency map (AVSM) for the analysis of dynamic natural scenes is presented. We demonstrate that the performance of proto-object based AVSM in detecting and localizing salient objects/events is in agreement with human judgment. In addition, we show the AVSM computed as a linear combination of visual and auditory feature conspicuity maps captures a higher number of valid salient events compared to unisensory saliency maps

    Monocular depth estimation in images and sequences using occlusion cues

    Get PDF
    When humans observe a scene, they are able to perfectly distinguish the different parts composing it. Moreover, humans can easily reconstruct the spatial position of these parts and conceive a consistent structure. The mechanisms involving visual perception have been studied since the beginning of neuroscience but, still today, not all the processes composing it are known. In usual situations, humans can make use of three different methods to estimate the scene structure. The first one is the so called divergence and it makes use of both eyes. When objects lie in front of the observed at a distance up to hundred meters, subtle differences in the image formation in each eye can be used to determine depth. When objects are not in the field of view of both eyes, other mechanisms should be used. In these cases, both visual cues and prior learned information can be used to determine depth. Even if these mechanisms are less accurate than divergence, humans can almost always infer the correct depth structure when using them. As an example of visual cues, occlusion, perspective or object size provide a lot of information about the structure of the scene. A priori information depends on each observer, but it is normally used subconsciously by humans to detect commonly known regions such as the sky, the ground or different types of objects. In the last years, since technology has been able to handle the processing burden of vision systems, there has been lots of efforts devoted to design automated scene interpreting systems. In this thesis we address the problem of depth estimation using only one point of view and using only occlusion depth cues. The thesis objective is to detect occlusions present in the scene and combine them with a segmentation system so as to generate a relative depth order depth map for a scene. We explore both static and dynamic situations such as single images, frame inside sequences or full video sequences. In the case where a full image sequence is available, a system exploiting motion information to recover depth structure is also designed. Results are promising and competitive with respect to the state of the art literature, but there is still much room for improvement when compared to human depth perception performance.Quan els humans observen una escena, son capaços de distingir perfectament les parts que la composen i organitzar-les espacialment per tal de poder-se orientar. Els mecanismes que governen la percepció visual han estat estudiats des dels principis de la neurociència, però encara no es coneixen tots els processos biològic que hi prenen part. En situacions normals, els humans poden fer servir tres eines per estimar l’estructura de l’escena. La primera és l’anomenada divergència. Aprofita l’ús de dos punts de vista (els dos ulls) i és capaç¸ de determinar molt acuradament la posició dels objectes ,que a una distància de fins a cent metres, romanen enfront de l’observador. A mesura que augmenta la distància o els objectes no es troben en el camp de visió dels dos ulls, altres mecanismes s’han d’utilitzar. Tant l’experiència anterior com certs indicis visuals s’utilitzen en aquests casos i, encara que la seva precisió és menor, els humans aconsegueixen quasi bé sempre interpretar bé el seu entorn. Els indicis visuals que aporten informació de profunditat més coneguts i utilitzats són per exemple, la perspectiva, les oclusions o el tamany de certs objectes. L’experiència anterior permet resoldre situacions vistes anteriorment com ara saber quins regions corresponen al terra, al cel o a objectes. Durant els últims anys, quan la tecnologia ho ha permès, s’han intentat dissenyar sistemes que interpretessin automàticament diferents tipus d’escena. En aquesta tesi s’aborda el tema de l’estimació de la profunditat utilitzant només un punt de vista i indicis visuals d’oclusió. L’objectiu del treball es la detecció d’aquests indicis i combinar-los amb un sistema de segmentació per tal de generar automàticament els diferents plans de profunditat presents a una escena. La tesi explora tant situacions estàtiques (imatges fixes) com situacions dinàmiques, com ara trames dins de seqüències de vídeo o seqüències completes. En el cas de seqüències completes, també es proposa un sistema automàtic per reconstruir l’estructura de l’escena només amb informació de moviment. Els resultats del treball son prometedors i competitius amb la literatura del moment, però mostren encara que la visió per computador té molt marge de millora respecte la precisió dels humans
    corecore