12 research outputs found

    ImageSpirit: Verbal Guided Image Parsing

    Get PDF
    Humans describe images in terms of nouns and adjectives while algorithms operate on images represented as sets of pixels. Bridging this gap between how humans would like to access images versus their typical representation is the goal of image parsing, which involves assigning object and attribute labels to pixel. In this paper we propose treating nouns as object labels and adjectives as visual attribute labels. This allows us to formulate the image parsing problem as one of jointly estimating per-pixel object and attribute labels from a set of training images. We propose an efficient (interactive time) solution. Using the extracted labels as handles, our system empowers a user to verbally refine the results. This enables hands-free parsing of an image into pixel-wise object/attribute labels that correspond to human semantics. Verbally selecting objects of interests enables a novel and natural interaction modality that can possibly be used to interact with new generation devices (e.g. smart phones, Google Glass, living room devices). We demonstrate our system on a large number of real-world images with varying complexity. To help understand the tradeoffs compared to traditional mouse based interactions, results are reported for both a large scale quantitative evaluation and a user study.Comment: http://mmcheng.net/imagespirit

    Dense Semantic Image Segmentation with Objects and Attributes

    Full text link
    The concepts of objects and attributes are both impor-tant for describing images precisely, since verbal descrip-tions often contain both adjectives and nouns (e.g. โ€˜I see a shiny red chairโ€™). In this paper, we formulate the prob-lem of joint visual attribute and object class image seg-mentation as a dense multi-labelling problem, where each pixel in an image can be associated with both an object-class and a set of visual attributes labels. In order to learn the label correlations, we adopt a boosting-based piecewise training approach with respect to the visual appearance and co-occurrence cues. We use a filtering-based mean-field approximation approach for efficient joint inference. Further, we develop a hierarchical model to incorporate region-level object and attribute information. Experiments on the aPASCAL, CORE and attribute augmented NYU in-door scenes datasets show that the proposed approach is able to achieve state-of-the-art results. 1

    Towards open-universe image parsing with broad coverage

    Get PDF
    One of the main goals of computer vision is to develop algorithms that allow the computer to interpret an image not as a pattern of colors but as the semantic relationships that make up a real world three-dimensional scene. In this dissertation, I present a system for image parsing, or labeling the regions of an image with their semantic categories, as a means of scene understanding. Most existing image parsing systems use a fixed set of a few hundred hand-labeled images as examples from which they learn how to label image regions, but our world cannot be adequately described with only a few hundred images. A new breed of open universe datasets have recently started to emerge. These datasets not only have more images but are constantly expanding, with new images and labels assigned by users on the web. Here I present a system that is able to both learn from these larger datasets of labeled images and scale as the dataset expands, thus greatly broadening the number of class labels that can correctly be identified in an image. Throughout this work I employ a retrieval-based methodology: I first retrieve images similar to the query and then match image regions from this set of retrieved images. My system can assign to each image region multiple forms of meaning: for example, it can simultaneously label the wing of a crow as an animal, crow, wing, and feather. I also broaden the label coverage by using both region and detector based similarity measures to effectively match a broad range to label types. This work shows the power of retrieval-based systems and the importance of having a diverse set of image cues and interpretations.Doctor of Philosoph

    A NOVEL KNOWLEDGE-COMPATIBILITY BENCHMARKER FOR SEMANTIC SEGMENTATION

    Full text link

    An Unsupervised Approach to Modelling Visual Data

    Get PDF
    For very large visual datasets, producing expert ground-truth data for training supervised algorithms can represent a substantial human effort. In these situations there is scope for the use of unsupervised approaches that can model collections of images and automatically summarise their content. The primary motivation for this thesis comes from the problem of labelling large visual datasets of the seafloor obtained by an Autonomous Underwater Vehicle (AUV) for ecological analysis. It is expensive to label this data, as taxonomical experts for the specific region are required, whereas automatically generated summaries can be used to focus the efforts of experts, and inform decisions on additional sampling. The contributions in this thesis arise from modelling this visual data in entirely unsupervised ways to obtain comprehensive visual summaries. Firstly, popular unsupervised image feature learning approaches are adapted to work with large datasets and unsupervised clustering algorithms. Next, using Bayesian models the performance of rudimentary scene clustering is boosted by sharing clusters between multiple related datasets, such as regular photo albums or AUV surveys. These Bayesian scene clustering models are extended to simultaneously cluster sub-image segments to form unsupervised notions of โ€œobjectsโ€ within scenes. The frequency distribution of these objects within scenes is used as the scene descriptor for simultaneous scene clustering. Finally, this simultaneous clustering model is extended to make use of whole image descriptors, which encode rudimentary spatial information, as well as object frequency distributions to describe scenes. This is achieved by unifying the previously presented Bayesian clustering models, and in so doing rectifies some of their weaknesses and limitations. Hence, the final contribution of this thesis is a practical unsupervised algorithm for modelling images from the super-pixel to album levels, and is applicable to large datasets

    ์˜๋ฏธ๋ก ์  ์˜์ƒ ๋ถ„ํ• ์„ ์œ„ํ•œ ๋งฅ๋ฝ ์ธ์‹ ๊ธฐ๋ฐ˜ ํ‘œํ˜„ ํ•™์Šต

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2017. 2. ์ด๊ฒฝ๋ฌด.Semantic segmentation, segmenting all the objects and identifying their categories, is a fundamental and important problem in computer vision. Traditional approaches to semantic segmentation are based on two main elements: visual appearance features and semantic context. Visual appearance features such as color, edge, shape and so on, are a primary source of information for reasoning the objects in an image. However, image data are sometimes unable to fully capture diversity in the object classes, since the appearance of the objects presented in real world scenes is affected by imaging conditions such as illumination, texture, occlusion, and viewpoint. Therefore, semantic context, obtained from not only the presence but also the location of other objects, can help to disambiguate the visual appearance in semantic segmentation tasks. The modern contextualized semantic segmentation systems have successfully improved segmentation performance by refining inconsistently labeled pixels via modeling of contextual interactions. However, they considered semantic context and visual appearance features independently due to the absence of the suitable representation model. Motivated by this issue, this dissertation proposes a novel framework for learning semantic context-aware representations in which appearance features is enhanced and enriched by semantic context and vice versa. The first part of the dissertation will be devoted to semantic context-aware appearance modeling for semantic segmentation. Adaptive context aggregation network is studied to capture semantic context adequately while multiple steps of reasoning. Secondly, semantic context will be reinforced by utilizing visual appearance. Graph and example-based context model is presented for estimating contextual relationships according to the visual appearance of objects. Finally, we propose a Multiscale Conditional Random Fields (CRFs), for integrating context-aware appearance and appearance-aware semantic context to produce accurate segmentations. Experimental evaluations show the effectiveness of the proposed context-aware representations on various challenging datasets.1 Introduction 1 1.1 Backgrounds 3 1.2 Context Modeling for Semantic Segmentation Systems 4 1.3 Dissertation Goal and Contribution 6 1.4 Organization of Dissertation 7 2 Adaptive Context Aggregation Network 11 2.1 Introduction 11 2.2 Related Works 13 2.3 Proposed Method 15 2.3.1 Embedding Network 15 2.3.2 Deeply Supervised Context Aggregation Network 16 2.4 Experiments 20 2.4.1 PASCAL VOC 2012 dataset 22 2.4.2 SIFT Flow dataset 23 2.5 Summary 25 3 Second-order Semantic Relationships 27 3.1 Introduction 27 3.2 Related Work 30 3.3 Our Approach 32 3.3.1 Overview 32 3.3.2 Retrieval System 34 3.3.3 Graph Construction 35 3.3.4 Context Exemplar Description 35 3.3.5 Context Link Prediction 37 3.4 Inference 40 3.5 Experiements 42 3.6 Summary 52 4 High-order Semantic Relationships 53 4.1 Introduction 53 4.2 Related work 55 4.3 The high-order semantic relation transfer algorithm 58 4.3.1 Problem statement 58 4.3.2 Objective function 59 4.3.3 Approximate algorithm 61 4.4 Semantic segmentation through semantic relation transfer 65 4.4.1 Scene retrieval 65 4.4.2 Inference 65 4.5 Experiements 67 4.6 Summary 73 5 Multiscale CRF formulation 75 5.1 Introduction 75 5.2 Proposed Method 76 5.2.1 Multiscale Potentials 77 5.2.2 Non Convex Optimization 79 5.3 Experiments 79 5.3.1 SiftFlow dataset 79 6 Conclusion 83 6.1 Summary of the dissertation 83 6.2 Future Works 84 Abstract (In Korean) 98Docto
    corecore