    Object Detection Through Exploration With A Foveated Visual Field

    We present a foveated object detector (FOD) as a biologically-inspired alternative to the sliding window (SW) approach which is the dominant method of search in computer vision object detection. Similar to the human visual system, the FOD has higher resolution at the fovea and lower resolution at the visual periphery. Consequently, more computational resources are allocated at the fovea and relatively fewer at the periphery. The FOD processes the entire scene, uses retino-specific object detection classifiers to guide eye movements, aligns its fovea with regions of interest in the input image and integrates observations across multiple fixations. Our approach combines modern object detectors from computer vision with a recent model of peripheral pooling regions found at the V1 layer of the human visual system. We assessed various eye movement strategies on the PASCAL VOC 2007 dataset and show that the FOD performs on par with the SW detector while bringing significant computational cost savings.Comment: An extended version of this manuscript was published in PLOS Computational Biology (October 2017) at https://doi.org/10.1371/journal.pcbi.100574

    Object Detection: Current and Future Directions

    Data Decomposition and Spatial Mixture Modeling for Part Based Model

    Abstract. This paper presents a system of data decomposition and spa-tial mixture modeling for part based models. Recently, many enhanced part based models (with e.g., multiple features, more components or parts) have been proposed. Nevertheless, those enhanced models bring high computation cost together with the risk of over-fitting. To tackle this problem, we propose a data decomposition method for part based models which not only accelerates training and testing process but also improves the performance on average. Besides, the original part based model uses a strict rigid structural model to describe the distribution of each part location. It is not “deformable ” enough, especially for those instances with different viewpoints or poses in the same aspect ratio. To address this problem, we present a novel spatial mixture modeling method. The spatial mixture embedded model is then integrated into the proposed data decomposition framework. We evaluate our system on the challenging PASCAL VOC2007 and PASCAL VOC2010 datasets, demonstrating the state-of-the-art performance compared with other re-lated methods in terms of accuracy and efficiency.

    Characterizing Objects in Images using Human Context

    Humans have an unmatched capability of interpreting detailed information about existent objects by just looking at an image. Particularly, they can effortlessly perform the following tasks: 1) Localizing various objects in the image and 2) Assigning functionalities to the parts of localized objects. This dissertation addresses the problem of aiding vision systems accomplish these two goals. The first part of the dissertation concerns object detection in a Hough-based framework. To this end, the independence assumption between features is addressed by grouping them in a local neighborhood. We study the complementary nature of individual and grouped features and combine them to achieve improved performance. Further, we consider the challenging case of detecting small and medium sized household objects under human-object interactions. We first evaluate appearance based star and tree models. While the tree model is slightly better, appearance based methods continue to suffer due to deficiencies caused by human interactions. To this end, we successfully incorporate automatically extracted human pose as a form of context for object detection. The second part of the dissertation addresses the tedious process of manually annotating objects to train fully supervised detectors. We observe that videos of human-object interactions with activity labels can serve as weakly annotated examples of household objects. Since such objects cannot be localized only through appearance or motion, we propose a framework that includes human centric functionality to retrieve the common object. Designed to maximize data utility by detecting multiple instances of an object per video, the framework achieves performance comparable to its fully supervised counterpart. The final part of the dissertation concerns localizing functional regions or affordances within objects by casting the problem as that of semantic image segmentation. To this end, we introduce a dataset involving human-object interactions with strong i.e. pixel level and weak i.e. clickpoint and image level affordance annotations. We propose a framework that utilizes both forms of weak labels and demonstrate that efforts for weak annotation can be further optimized using human context

    Object Detection Using Hough Transform

    Tato diplomová práce se zabývá problematikou detekce objektů pomocí matematické techniky zvané Houghova transformace. Techniku Houghovy transformace pojímá z obecného hlediska od de facto nejjednoduššího užití pro detekci elementárních analyticky popsatelných útvarů jako jsou přímky, elipsy, kružnice či jednoduché analyticky definovatelné prvky až po sofistikované užití pro detekci komplexních - analyticky prakticky nepopsatelných - objektů. Mezi ně patří například automobily či chodci, kteří se detekují na základě předložených fotografických záznamů těchto objektů a entit. Dokument tedy mapuje definice a použití jednotlivých subtechnik Houghovy transformace spolu s jejich základním členěním na pravděpodobnostní a nepravděpodobnostní metody. Práce následně vrcholí popisem obecné state-of-the-art metody zvané Třídně-specifické Houghovy lesy pro detekci objektů, uvádí její definici, postup trénovaní na základě poskytnutého datasetu a detekce z testovacích obrazců. V závěru této práce je pak navrhnut a implementován obecně trénovatelný detektor objektů využívající tuto techniku. A je experimentálně vyhodnocena jeho úspěšnost.This diploma thesis deals with object detection using mathematical technique called Hough transform. Hough transform technique is conceived in general terms from the de facto simplest use for the detection of elementary analytically describable shapes such as lines, ellipses, circles or simple analytically definable elements to sophisticated use for the detection of complex - analytically virtually indescribable - objects. These include cars or pedestrians who are detected on the basis of the photographic records of these objects and entities. The document thus maps the definition and use of the respective Hough transform subtechniques along with their basic classification on probabilistic and non-probabilistic methods. The work subsequently culminates in describing the general state-of-the-art technique called Class-Specific Hough Forests for Object Detection, introduces its definition, training procedure on a provided dataset and the detection of test patterns. In conclusion of this work,there is designed and implemented generally trainable object detector using this technique. And there is experimental evaluation of its quality.

    Context-driven Object Detection and Segmentation with Auxiliary Information

    One fundamental problem in computer vision and robotics is to localize objects of interest in an image. The task can either be formulated as an object detection problem if the objects are described by a set of pose parameters, or an object segmentation one if we recover object boundary precisely. A key issue in object detection and segmentation concerns exploiting the spatial context, as local evidence is often insufficient to determine object pose in the presence of heavy occlusions or large object appearance variations. This thesis addresses the object detection and segmentation problem in such adverse conditions with auxiliary depth data provided by RGBD cameras. We focus on four main issues in context-aware object detection and segmentation: 1) what are the effective context representations? 2) how can we work with limited and imperfect depth data? 3) how to design depth-aware features and integrate depth cues into conventional visual inference tasks? 4) how to make use of unlabeled data to relax the labeling requirements for training data? We discuss three object detection and segmentation scenarios based on varying amounts of available auxiliary information. In the first case, depth data are available for model training but not available for testing. We propose a structured Hough voting method for detecting objects with heavy occlusion in indoor environments, in which we extend the Hough hypothesis space to include both the object's location, and its visibility pattern. We design a new score function that accumulates votes for object detection and occlusion prediction. In addition, we explore the correlation between objects and their environment, building a depth-encoded object-context model based on RGBD data. In the second case, we address the problem of localizing glass objects with noisy and incomplete depth data. Our method integrates the intensity and depth information from a single view point, and builds a Markov Random Field that predicts glass boundary and region jointly. In addition, we propose a nonparametric, data-driven label transfer scheme for local glass boundary estimation. A weighted voting scheme based on a joint feature manifold is adopted to integrate depth and appearance cues, and we learn a distance metric on the depth-encoded feature manifold. In the third case, we make use of unlabeled data to relax the annotation requirements for object detection and segmentation, and propose a novel data-dependent margin distribution learning criterion for boosting, which utilizes the intrinsic geometric structure of datasets. One key aspect of this method is that it can seamlessly incorporate unlabeled data by including a graph Laplacian regularizer. We demonstrate the performance of our models and compare with baseline methods on several real-world object detection and segmentation tasks, including indoor object detection, glass object segmentation and foreground segmentation in video