    A systems engineering approach to robotic bin picking

    In recent times the presence of vision and robotic systems in industry has become common place, but in spite of many achievements a large range of industrial tasks still remain unsolved due to the lack of flexibility of the vision systems when dealing with highly adaptive manufacturing environments. An important task found across a broad range of modern flexible manufacturing environments is the need to present parts to automated machinery from a supply bin. In order to carry out grasping and manipulation operations safely and efficiently we need to know the identity, location and spatial orientation of the objects that lie in an unstructured heap in a bin. Historically, the bin picking problem was tackled using mechanical vibratory feeders where the vision feedback was unavailable. This solution has certain problems with parts jamming and more important they are highly dedicated. In this regard if a change in the manufacturing process is required, the changeover may include an extensive re-tooling and a total revision of the system control strategy (Kelley et al., 1982). Due to these disadvantages modern bin picking systems perform grasping and manipulation operations using vision feedback (Yoshimi & Allen, 1994). Vision based robotic bin picking has been the subject of research since the introduction of the automated vision controlled processes in industry and a review of existing systems indicates that none of the proposed solutions were able to solve this classic vision problem in its generality. One of the main challenges facing such a bin picking system is its ability to deal with overlapping objects. The object recognition in cluttered scenes is the main objective of these systems and early approaches attempted to perform bin picking operations for similar objects that are jumbled together in an unstructured heap using no knowledge about the pose or geometry of the parts (Birk et al., 1981). While these assumptions may be acceptable for a restricted number of applications, in most practical cases a flexible system must deal with more than one type of object with a wide scale of shapes. A flexible bin picking system has to address three difficult problems: scene interpretation, object recognition and pose estimation. Initial approaches to these tasks were based on modeling parts using the 2D surface representations. Typical 2D representations include invariant shape descriptors (Zisserman et al., 1994), algebraic curves (Tarel & Cooper, 2000), 2 Name of the book (Header position 1,5) conics (Bolles & Horaud, 1986; Forsyth et al., 1991) and appearance based models (Murase & Nayar, 1995; Ohba & Ikeuchi, 1997). These systems are generally better suited to planar object recognition and they are not able to deal with severe viewpoint distortions or objects with complex shapes/textures. Also the spatial orientation cannot be robustly estimated for objects with free-form contours. To address this limitation most bin picking systems attempt to recognize the scene objects and estimate their spatial orientation using the 3D information (Fan et al., 1989; Faugeras & Hebert, 1986). Notable approaches include the use of 3D local descriptors (Ansar & Daniilidis, 2003; Campbell & Flynn, 2001; Kim & Kak, 1991), polyhedra (Rothwell & Stern, 1996), generalized cylinders (Ponce et al., 1989; Zerroug & Nevatia, 1996), super-quadrics (Blane et al., 2000) and visual learning methods (Johnson & Hebert, 1999; Mittrapiyanuruk et al., 2004). The most difficult problem for 3D bin picking systems that are based on a structural description of the objects (local descriptors or 3D primitives) is the complex procedure required to perform the scene to model feature matching. This procedure is usually based on complex graph-searching techniques and is increasingly more difficult when dealing with object occlusions, a situation when the structural description of the scene objects is incomplete. Visual learning methods based on eigenimage analysis have been proposed as an alternative solution to address the object recognition and pose estimation for objects with complex appearances. In this regard, Johnson and Hebert (Johnson & Hebert, 1999) developed an object recognition scheme that is able to identify multiple 3D objects in scenes affected by clutter and occlusion. They proposed an eigenimage analysis approach that is applied to match surface points using the spin image representation. The main attraction of this approach resides in the use of spin images that are local surface descriptors; hence they can be easily identified in real scenes that contain clutter and occlusions. This approach returns accurate results but the pose estimation cannot be inferred, as the spin images are local descriptors and they are not robust to capture the object orientation. In general the pose sampling for visual learning methods is a problem difficult to solve as the numbers of views required to sample the full 6 degree of freedom for object pose is prohibitive. This issue was addressed in the paper by Edwards (Edwards, 1996) when he applied eigenimage analysis to a one-object scene and his approach was able to estimate the pose only in cases where the tilt angle was limited to 30 degrees with respect to the optical axis of the sensor. In this chapter we describe the implementation of a vision sensor for robotic bin picking where we attempt to eliminate the main problem faced by the visual learning methods, namely the pose sampling problem. This paper is organized as follows. Section 2 outlines the overall system. Section 3 describes the implementation of the range sensor while Section 4 details the edge-based segmentation algorithm. Section 5 presents the viewpoint correction algorithm that is applied to align the detected object surfaces perpendicular on the optical axis of the sensor. Section 6 describes the object recognition algorithm. This is followed in Section 7 by an outline of the pose estimation algorithm. Section 8 presents a number of experimental results illustrating the benefits of the approach outlined in this chapter

    Deux méthodes de comparaison d'images pour l'identification d'objets à partir de données prospectives

    Cette étude aborde le problÚme de l'identification d'objets mobiles à partir de données délivrées par un senseur prospectif dont la conception est actuellement en cours. Le but est d'estimer la faisabilité d'une telle identification à l'aide d'outils disponibles à ce jour en reconnaissance des formes. On présente dans ce papier la réalisation complÚte d'une chaßne de simulation, comprenant à la fois la génération des données (non disponibles) et la mise en place de processus capables de les exploiter dans un but d'identification. Des paramÚtres variables contrÎlent la nature des images (richesse, niveau de bruit) tout au long de la simulation, ceci afin de pouvoir prendre en compte des données de qualité variable

    Shape description and matching using integral invariants on eccentricity transformed images

    Matching occluded and noisy shapes is a problem frequently encountered in medical image analysis and more generally in computer vision. To keep track of changes inside the breast, for example, it is important for a computer aided detection system to establish correspondences between regions of interest. Shape transformations, computed both with integral invariants (II) and with geodesic distance, yield signatures that are invariant to isometric deformations, such as bending and articulations. Integral invariants describe the boundaries of planar shapes. However, they provide no information about where a particular feature lies on the boundary with regard to the overall shape structure. Conversely, eccentricity transforms (Ecc) can match shapes by signatures of geodesic distance histograms based on information from inside the shape; but they ignore the boundary information. We describe a method that combines the boundary signature of a shape obtained from II and structural information from the Ecc to yield results that improve on them separately

    Generalization to Novel Views: Universal, Class-based, and Model-based Processing", Int

    Abstract. A major problem in object recognition is that a novel image of a given object can be different from all previously seen images. Images can vary considerably due to changes in viewing conditions such as viewing position and illumination. In this paper we distinguish between three types of recognition schemes by the level at which generalization to novel images takes place: universal, class, and model-based. The first is applicable equally to all objects, the second to a class of objects, and the third uses known properties of individual objects. We derive theoretical limitations on each of the three generalization levels. For the universal level, previous results have shown that no invariance can be obtained. Here we show that this limitation holds even when the assumptions made on the objects and the recognition functions are relaxed. We also extend the results to changes of illumination direction. For the class level, previous studies presented specific examples of classes of objects for which functions invariant to viewpoint exist. Here, we distinguish between classes that admit such invariance and classes that do not. We demonstrate that there is a tradeoff between the set of objects that can be discriminated by a given recognition function and the set of images from which the recognition function can recognize these objects. Furthermore, we demonstrate that although functions that are invariant to illumination direction do not exist at the universal level, when the objects are restricted to belong to a given class, an invariant function to illumination direction can be defined. A general conclusion of this study is that class-based processing, that has not been used extensively in the past, is often advantageous for dealing with variations due to viewpoint and illuminant changes. Keywords: object recognition, invariance 1

    Recognizing Large Isolated 3-D Objects Through Next View Planning Using Inner Camera Invariants

    Lunar Crater Identification in Digital Images

    It is often necessary to identify a pattern of observed craters in a single image of the lunar surface and without any prior knowledge of the camera's location. This so-called "lost-in-space" crater identification problem is common in both crater-based terrain relative navigation (TRN) and in automatic registration of scientific imagery. Past work on crater identification has largely been based on heuristic schemes, with poor performance outside of a narrowly defined operating regime (e.g., nadir pointing images, small search areas). This work provides the first mathematically rigorous treatment of the general crater identification problem. It is shown when it is (and when it is not) possible to recognize a pattern of elliptical crater rims in an image formed by perspective projection. For the cases when it is possible to recognize a pattern, descriptors are developed using invariant theory that provably capture all of the viewpoint invariant information. These descriptors may be pre-computed for known crater patterns and placed in a searchable index for fast recognition. New techniques are also developed for computing pose from crater rim observations and for evaluating crater rim correspondences. These techniques are demonstrated on both synthetic and real images

    Active recognition through next view planning: a survey

    Bottom-up Object Segmentation for Visual Recognition

    Automatic recognition and segmentation of objects in images is a central open problem in computer vision. Most previous approaches have pursued either sliding-window object detection or dense classification of overlapping local image patches. Differently, the framework introduced in this thesis attempts to identify the spatial extent of objects prior to recognition, using bottom-up computational processes and mid-level selection cues. After a set of plausible object hypotheses is identified, a sequential recognition process is executed, based on continuous estimates of the spatial overlap between the image segment hypotheses and each putative class. The object hypotheses are represented as figure-ground segmentations, and are extracted automatically, without prior knowledge of the properties of individual object classes, by solving a sequence of constrained parametric min-cut problems (CPMC) on a regular image grid. It is show that CPMC significantly outperforms the state of the art for low-level segmentation in the PASCAL VOC 2009 and 2010 datasets. Results beyond the current state of the art for image classification, object detection and semantic segmentation are also demonstrated in a number of challenging datasets including Caltech-101, ETHZ-Shape as well as PASCAL VOC 2009-11. These results suggest that a greater emphasis on grouping and image organization may be valuable for making progress in high-level tasks such as object recognition and scene understanding

    Author index—Volumes 1–89

    3D compositional hierarchies for object categorization

    Deep learning methods have become the default tool for image classification. However, application of deep learning to surface shape classification is burdened by the limitations of existing methods, in particular, by lack of invariance to geometric transformations of input data. This thesis proposes two novel frameworks for learning a multi-layer representation of surface shape features, namely the view-based and the surface-based compositional hierarchical frameworks. The proposed representation is a hierarchical vocabulary of shape features, termed parts. Parts of the first layer are pre-defined, while parts of the subsequent layers, describing spatial relations of subparts, are learned. The view-based framework describes spatial relations between subparts using a camera-based reference frame. The key stage of the learning algorithm is part selection which forms the vocabulary based on multi-objective optimization, considering different importance measures of parts. Our experiments show that this framework enables efficient category recognition on a large-scale dataset. The surface-based framework exploits part-based intrinsic reference frames, which are computed for lower layers parts and inherited by parts of the subsequent layers. During learning spatial relations between subparts are described in these reference frames. During inference, a part is detected in input data when its subparts are detected at certain positions and orientations in each other’s reference frames. Since rigid body transformations don’t change positions and orientations of parts in intrinsic reference frames, this approach enables efficient recognition from unseen poses. Experiments show that this framework exhibits a large discriminative power and greater robustness to rigid body transformations than advanced CNN-based methods