412 research outputs found

    Weakly Supervised Object Localization with Multi-fold Multiple Instance Learning

    Get PDF
    Object category localization is a challenging problem in computer vision. Standard supervised training requires bounding box annotations of object instances. This time-consuming annotation process is sidestepped in weakly supervised learning. In this case, the supervised information is restricted to binary labels that indicate the absence/presence of object instances in the image, without their locations. We follow a multiple-instance learning approach that iteratively trains the detector and infers the object locations in the positive training images. Our main contribution is a multi-fold multiple instance learning procedure, which prevents training from prematurely locking onto erroneous object locations. This procedure is particularly important when using high-dimensional representations, such as Fisher vectors and convolutional neural network features. We also propose a window refinement method, which improves the localization accuracy by incorporating an objectness prior. We present a detailed experimental evaluation using the PASCAL VOC 2007 dataset, which verifies the effectiveness of our approach.Comment: To appear in IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI

    Monocular 3d Object Recognition

    Get PDF
    Object recognition is one of the fundamental tasks of computer vision. Recent advances in the field enable reliable 2D detections from a single cluttered image. However, many challenges still remain. Object detection needs timely response for real world applications. Moreover, we are genuinely interested in estimating the 3D pose and shape of an object or human for the sake of robotic manipulation and human-robot interaction. In this thesis, a suite of solutions to these challenges is presented. First, Active Deformable Part Models (ADPM) is proposed for fast part-based object detection. ADPM dramatically accelerates the detection by dynamically scheduling the part evaluations and efficiently pruning the image locations. Second, we unleash the power of marrying discriminative 2D parts with an explicit 3D geometric representation. Several methods of such scheme are proposed for recovering rich 3D information of both rigid and non-rigid objects from monocular RGB images. (1) The accurate 3D pose of an object instance is recovered from cluttered images using only the CAD model. (2) A global optimal solution for simultaneous 2D part localization, 3D pose and shape estimation is obtained by optimizing a unified convex objective function. Both appearance and geometric compatibility are jointly maximized. (3) 3D human pose estimation from an image sequence is realized via an Expectation-Maximization algorithm. The 2D joint location uncertainties are marginalized out during inference and 3D pose smoothness is enforced across frames. By bridging the gap between 2D and 3D, our methods provide an end-to-end solution to 3D object recognition from images. We demonstrate a range of interesting applications using only a single image or a monocular video, including autonomous robotic grasping with a single image, 3D object image pop-up and a monocular human MoCap system. We also show empirical start-of-art results on a number of benchmarks on 2D detection and 3D pose and shape estimation

    3D scene and object parsing from a single image

    Get PDF
    The term 3D parsing refers to the process of segmenting and labeling the 3D space into expressive categories of voxels, point clouds or surfaces. Humans can effortlessly perceive the 3D scene and the unseen part of an object from a single image with a limited field of view. In the same sense, a robot that is designed to execute a few human-like actions should be able to infer the 3D visual world, from a single snapshot of a 2D sensor such as a camera, or a 2.5D sensor such as a Kinect depth equipment. In this thesis, we focus on 3D scene and object parsing from a single image, aiming to produce a 3D parse that is able to support applications like robotics and navigation. Our goal is to produce an expressive 3D parse: e.g., what is it, where is it, how can humans move and interact with it. Inferring such a 3D parse from a single image is not trivial. The main challenges are: the unknown separation of layout surfaces and objects; the high degree of occlusions and the diverse classes of objects in the cluttered scene; how to represent 3D object geometry in a way that can be predicted from noisy or partial observations, and can help assist reasoning like contact, support and extent. In this thesis, we put forward the hypothesis and prove in experiments, that a data-driven approach is able to directly produce a complete 3D recovery from 2D partial observations. Moreover, we show that by imposing constraints of 3D patterns and priors into the learned model (e.g., layout surfaces are flat and orthogonal to adjacent surfaces, support height can reveal the full extent of an occluded object, 2D complete silhouettes can guide reconstructions beyond partial foreground occlusions, and a shape can be decomposed into a set of simple parts), we are able to obtain a more accurate reconstruction of the scene and a structural representation of the object. We present our approaches at different levels of detail, from a rough layout level to a more complex scene level and finally to the most detailed object level. We start by estimating the 3D room layout from a single RGB image, proposing an approach that generalizes across panoramas and perspective images, cuboid layouts and more general layouts (e.g., “L”-shape room). We then make use of an additional depth image, explore at the scene level to recover the complete 3D scene with layouts and all objects jointly. At the object level, we propose to recover each 3D object with robustness to possible partial foreground occlusions. Finally, we represent each 3D object as a 3D composite of sets of primitives, recurrently parsing each shape into primitives given a single depth view. We demonstrate the efficacy of each proposed approach with extensive experiments both quantitatively and qualitatively on public datasets

    Learning with Limited Data and Supervision

    Get PDF
    Deep neural networks have been the main driving force of recent successes in machine learning leading to the deployment of these models in a wide range of industries such as healthcare, autonomous driving, and fintech. Despite the great success, these models are known as data-hungry models requiring many labelled training examples and costly computational resources to solve a pre-determined task. Several obstacles limit the applicability of deep learning models in real-world scenarios. First, annotating large-scale training data in tasks such as object localization or segmentation is cumbersome and demands huge time and labor. Second, in real-world scenarios and applications such as field robotics, the models may be required to learn new classes in an ever-changing environment. However, accessing abundant fully labelled training data for novel classes may be infeasible. Therefore, a model needs to adapt to learn novel classes given only a few examples with simple (weak) annotations. Finally, it is known that most modern deep convolutional networks do not have calibrated confidence scores, meaning that the confidence scores they assign to the outcomes do not match the true frequency of those events. These models are of utmost importance to output calibrated prediction scores that the downstream applications can rely upon, especially in safety-critical applications. This thesis focuses on tackling these limitations in deep learning models with applications in Computer Vision. We investigate the task of finding common objects in small image collections and propose an efficient graphical model inference algorithm that utilizes the structure of the problem to reduce the computational time compared to traditional inference algorithms significantly. We also propose a probabilistic approach to solve the few-shot common object localization problem based on a parametric distribution of each class on a unit sphere. We further extend our model to localize objects of novel classes in unseen images. In the next step, we study pairwise similarity knowledge transfer for weakly supervised object localization to reduce the cost of labor and time in annotating large-scale object detection datasets for novel classes. We learn the similarity functions and the assignment of proposals to different novel classes jointly using alternating optimization and show that the assignment problem becomes an integer linear program for a certain type of loss function. Furthermore, we propose an efficient inference algorithm to overcome the difficulty of computing all pairwise similarities. Finally, to overcome pre-trained models' accuracy degradation in learning expressive probability calibration functions using small calibration data, we introduce and formalize the notion of order-preserving functions. We also present two sub-families of order-preserving functions that benefit from parameter sharing across different classes in classification problems

    SCALE-ROBUST DEEP LEARNING FOR VISUAL RECOGNITION

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    FINDING OBJECTS IN COMPLEX SCENES

    Get PDF
    Object detection is one of the fundamental problems in computer vision that has great practical impact. Current object detectors work well under certain con- ditions. However, challenges arise when scenes become more complex. Scenes are often cluttered and object detectors trained on Internet collected data fail when there are large variations in objects’ appearance. We believe the key to tackle those challenges is to understand the rich context of objects in scenes, which includes: the appearance variations of an object due to viewpoint and lighting condition changes; the relationships between objects and their typical environment; and the composition of multiple objects in the same scene. This dissertation aims to study the complexity of scenes from those aspects. To facilitate collecting training data with large variations, we design a novel user interface, ARLabeler, utilizing the power of Augmented Reality (AR) devices. Instead of labeling images from the Internet passively, we put an observer in the real world with full control over the scene complexities. Users walk around freely and observe objects from multiple angles. Lighting can be adjusted. Objects can be added and/or removed to the scene to create rich compositions. Our tool opens new possibilities to prepare data for complex scenes. We also study challenges in deploying object detectors in real world scenes: detecting curb ramps in street view images. A system, Tohme, is proposed to combine detection results from detectors and human crowdsourcing verifications. One core component is a meta-classifier that estimates the complexity of a scene and assigns it to human (accurate but costly) or computer (low cost but error-prone) accordingly. One of the insights from Tohme is that context is crucial in detecting objects. To understand the complex relationship between objects and their environment, we propose a standalone context model that predicts where an object can occur in an image. By combining this model with object detection, it can find regions where an object is missing. It can also be used to find out-of-context objects. To take a step beyond single object based detections, we explicitly model the geometrical relationships between groups of objects and use the layout information to represent scenes as a whole. We show that such a strategy is useful in retrieving indoor furniture scenes with natural language inputs
    corecore