134 research outputs found

    Research on a modifeied RANSAC and its applications to ellipse detection from a static image and motion detection from active stereo video sequences

    Get PDF
    制度:新 ; 報告番号:甲3091号 ; 学位の種類:博士(国際情報通信学) ; 授与年月日:2010/2/24 ; 早大学位記番号:新535

    Robust surface modelling of visual hull from multiple silhouettes

    Get PDF
    Reconstructing depth information from images is one of the actively researched themes in computer vision and its application involves most vision research areas from object recognition to realistic visualisation. Amongst other useful vision-based reconstruction techniques, this thesis extensively investigates the visual hull (VH) concept for volume approximation and its robust surface modelling when various views of an object are available. Assuming that multiple images are captured from a circular motion, projection matrices are generally parameterised in terms of a rotation angle from a reference position in order to facilitate the multi-camera calibration. However, this assumption is often violated in practice, i.e., a pure rotation in a planar motion with accurate rotation angle is hardly realisable. To address this problem, at first, this thesis proposes a calibration method associated with the approximate circular motion. With these modified projection matrices, a resulting VH is represented by a hierarchical tree structure of voxels from which surfaces are extracted by the Marching cubes (MC) algorithm. However, the surfaces may have unexpected artefacts caused by a coarser volume reconstruction, the topological ambiguity of the MC algorithm, and imperfect image processing or calibration result. To avoid this sensitivity, this thesis proposes a robust surface construction algorithm which initially classifies local convex regions from imperfect MC vertices and then aggregates local surfaces constructed by the 3D convex hull algorithm. Furthermore, this thesis also explores the use of wide baseline images to refine a coarse VH using an affine invariant region descriptor. This improves the quality of VH when a small number of initial views is given. In conclusion, the proposed methods achieve a 3D model with enhanced accuracy. Also, robust surface modelling is retained when silhouette images are degraded by practical noise

    Automatic video segmentation employing object/camera modeling techniques

    Get PDF
    Practically established video compression and storage techniques still process video sequences as rectangular images without further semantic structure. However, humans watching a video sequence immediately recognize acting objects as semantic units. This semantic object separation is currently not reflected in the technical system, making it difficult to manipulate the video at the object level. The realization of object-based manipulation will introduce many new possibilities for working with videos like composing new scenes from pre-existing video objects or enabling user-interaction with the scene. Moreover, object-based video compression, as defined in the MPEG-4 standard, can provide high compression ratios because the foreground objects can be sent independently from the background. In the case that the scene background is static, the background views can even be combined into a large panoramic sprite image, from which the current camera view is extracted. This results in a higher compression ratio since the sprite image for each scene only has to be sent once. A prerequisite for employing object-based video processing is automatic (or at least user-assisted semi-automatic) segmentation of the input video into semantic units, the video objects. This segmentation is a difficult problem because the computer does not have the vast amount of pre-knowledge that humans subconsciously use for object detection. Thus, even the simple definition of the desired output of a segmentation system is difficult. The subject of this thesis is to provide algorithms for segmentation that are applicable to common video material and that are computationally efficient. The thesis is conceptually separated into three parts. In Part I, an automatic segmentation system for general video content is described in detail. Part II introduces object models as a tool to incorporate userdefined knowledge about the objects to be extracted into the segmentation process. Part III concentrates on the modeling of camera motion in order to relate the observed camera motion to real-world camera parameters. The segmentation system that is described in Part I is based on a background-subtraction technique. The pure background image that is required for this technique is synthesized from the input video itself. Sequences that contain rotational camera motion can also be processed since the camera motion is estimated and the input images are aligned into a panoramic scene-background. This approach is fully compatible to the MPEG-4 video-encoding framework, such that the segmentation system can be easily combined with an object-based MPEG-4 video codec. After an introduction to the theory of projective geometry in Chapter 2, which is required for the derivation of camera-motion models, the estimation of camera motion is discussed in Chapters 3 and 4. It is important that the camera-motion estimation is not influenced by foreground object motion. At the same time, the estimation should provide accurate motion parameters such that all input frames can be combined seamlessly into a background image. The core motion estimation is based on a feature-based approach where the motion parameters are determined with a robust-estimation algorithm (RANSAC) in order to distinguish the camera motion from simultaneously visible object motion. Our experiments showed that the robustness of the original RANSAC algorithm in practice does not reach the theoretically predicted performance. An analysis of the problem has revealed that this is caused by numerical instabilities that can be significantly reduced by a modification that we describe in Chapter 4. The synthetization of static-background images is discussed in Chapter 5. In particular, we present a new algorithm for the removal of the foreground objects from the background image such that a pure scene background remains. The proposed algorithm is optimized to synthesize the background even for difficult scenes in which the background is only visible for short periods of time. The problem is solved by clustering the image content for each region over time, such that each cluster comprises static content. Furthermore, it is exploited that the times, in which foreground objects appear in an image region, are similar to the corresponding times of neighboring image areas. The reconstructed background could be used directly as the sprite image in an MPEG-4 video coder. However, we have discovered that the counterintuitive approach of splitting the background into several independent parts can reduce the overall amount of data. In the case of general camera motion, the construction of a single sprite image is even impossible. In Chapter 6, a multi-sprite partitioning algorithm is presented, which separates the video sequence into a number of segments, for which independent sprites are synthesized. The partitioning is computed in such a way that the total area of the resulting sprites is minimized, while simultaneously satisfying additional constraints. These include a limited sprite-buffer size at the decoder, and the restriction that the image resolution in the sprite should never fall below the input-image resolution. The described multisprite approach is fully compatible to the MPEG-4 standard, but provides three advantages. First, any arbitrary rotational camera motion can be processed. Second, the coding-cost for transmitting the sprite images is lower, and finally, the quality of the decoded sprite images is better than in previously proposed sprite-generation algorithms. Segmentation masks for the foreground objects are computed with a change-detection algorithm that compares the pure background image with the input images. A special effect that occurs in the change detection is the problem of image misregistration. Since the change detection compares co-located image pixels in the camera-motion compensated images, a small error in the motion estimation can introduce segmentation errors because non-corresponding pixels are compared. We approach this problem in Chapter 7 by integrating risk-maps into the segmentation algorithm that identify pixels for which misregistration would probably result in errors. For these image areas, the change-detection algorithm is modified to disregard the difference values for the pixels marked in the risk-map. This modification significantly reduces the number of false object detections in fine-textured image areas. The algorithmic building-blocks described above can be combined into a segmentation system in various ways, depending on whether camera motion has to be considered or whether real-time execution is required. These different systems and example applications are discussed in Chapter 8. Part II of the thesis extends the described segmentation system to consider object models in the analysis. Object models allow the user to specify which objects should be extracted from the video. In Chapters 9 and 10, a graph-based object model is presented in which the features of the main object regions are summarized in the graph nodes, and the spatial relations between these regions are expressed with the graph edges. The segmentation algorithm is extended by an object-detection algorithm that searches the input image for the user-defined object model. We provide two objectdetection algorithms. The first one is specific for cartoon sequences and uses an efficient sub-graph matching algorithm, whereas the second processes natural video sequences. With the object-model extension, the segmentation system can be controlled to extract individual objects, even if the input sequence comprises many objects. Chapter 11 proposes an alternative approach to incorporate object models into a segmentation algorithm. The chapter describes a semi-automatic segmentation algorithm, in which the user coarsely marks the object and the computer refines this to the exact object boundary. Afterwards, the object is tracked automatically through the sequence. In this algorithm, the object model is defined as the texture along the object contour. This texture is extracted in the first frame and then used during the object tracking to localize the original object. The core of the algorithm uses a graph representation of the image and a newly developed algorithm for computing shortest circular-paths in planar graphs. The proposed algorithm is faster than the currently known algorithms for this problem, and it can also be applied to many alternative problems like shape matching. Part III of the thesis elaborates on different techniques to derive information about the physical 3-D world from the camera motion. In the segmentation system, we employ camera-motion estimation, but the obtained parameters have no direct physical meaning. Chapter 12 discusses an extension to the camera-motion estimation to factorize the motion parameters into physically meaningful parameters (rotation angles, focal-length) using camera autocalibration techniques. The speciality of the algorithm is that it can process camera motion that spans several sprites by employing the above multi-sprite technique. Consequently, the algorithm can be applied to arbitrary rotational camera motion. For the analysis of video sequences, it is often required to determine and follow the position of the objects. Clearly, the object position in image coordinates provides little information if the viewing direction of the camera is not known. Chapter 13 provides a new algorithm to deduce the transformation between the image coordinates and the real-world coordinates for the special application of sport-video analysis. In sport videos, the camera view can be derived from markings on the playing field. For this reason, we employ a model of the playing field that describes the arrangement of lines. After detecting significant lines in the input image, a combinatorial search is carried out to establish correspondences between lines in the input image and lines in the model. The algorithm requires no information about the specific color of the playing field and it is very robust to occlusions or poor lighting conditions. Moreover, the algorithm is generic in the sense that it can be applied to any type of sport by simply exchanging the model of the playing field. In Chapter 14, we again consider panoramic background images and particularly focus ib their visualization. Apart from the planar backgroundsprites discussed previously, a frequently-used visualization technique for panoramic images are projections onto a cylinder surface which is unwrapped into a rectangular image. However, the disadvantage of this approach is that the viewer has no good orientation in the panoramic image because he looks into all directions at the same time. In order to provide a more intuitive presentation of wide-angle views, we have developed a visualization technique specialized for the case of indoor environments. We present an algorithm to determine the 3-D shape of the room in which the image was captured, or, more generally, to compute a complete floor plan if several panoramic images captured in each of the rooms are provided. Based on the obtained 3-D geometry, a graphical model of the rooms is constructed, where the walls are displayed with textures that are extracted from the panoramic images. This representation enables to conduct virtual walk-throughs in the reconstructed room and therefore, provides a better orientation for the user. Summarizing, we can conclude that all segmentation techniques employ some definition of foreground objects. These definitions are either explicit, using object models like in Part II of this thesis, or they are implicitly defined like in the background synthetization in Part I. The results of this thesis show that implicit descriptions, which extract their definition from video content, work well when the sequence is long enough to extract this information reliably. However, high-level semantics are difficult to integrate into the segmentation approaches that are based on implicit models. Intead, those semantics should be added as postprocessing steps. On the other hand, explicit object models apply semantic pre-knowledge at early stages of the segmentation. Moreover, they can be applied to short video sequences or even still pictures since no background model has to be extracted from the video. The definition of a general object-modeling technique that is widely applicable and that also enables an accurate segmentation remains an important yet challenging problem for further research

    Robust surface modelling of visual hull from multiple silhouettes

    Get PDF
    Reconstructing depth information from images is one of the actively researched themes in computer vision and its application involves most vision research areas from object recognition to realistic visualisation. Amongst other useful vision-based reconstruction techniques, this thesis extensively investigates the visual hull (VH) concept for volume approximation and its robust surface modelling when various views of an object are available. Assuming that multiple images are captured from a circular motion, projection matrices are generally parameterised in terms of a rotation angle from a reference position in order to facilitate the multi-camera calibration. However, this assumption is often violated in practice, i.e., a pure rotation in a planar motion with accurate rotation angle is hardly realisable. To address this problem, at first, this thesis proposes a calibration method associated with the approximate circular motion. With these modified projection matrices, a resulting VH is represented by a hierarchical tree structure of voxels from which surfaces are extracted by the Marching cubes (MC) algorithm. However, the surfaces may have unexpected artefacts caused by a coarser volume reconstruction, the topological ambiguity of the MC algorithm, and imperfect image processing or calibration result. To avoid this sensitivity, this thesis proposes a robust surface construction algorithm which initially classifies local convex regions from imperfect MC vertices and then aggregates local surfaces constructed by the 3D convex hull algorithm. Furthermore, this thesis also explores the use of wide baseline images to refine a coarse VH using an affine invariant region descriptor. This improves the quality of VH when a small number of initial views is given. In conclusion, the proposed methods achieve a 3D model with enhanced accuracy. Also, robust surface modelling is retained when silhouette images are degraded by practical noise

    Variational methods and its applications to computer vision

    Get PDF
    Many computer vision applications such as image segmentation can be formulated in a ''variational'' way as energy minimization problems. Unfortunately, the computational task of minimizing these energies is usually difficult as it generally involves non convex functions in a space with thousands of dimensions and often the associated combinatorial problems are NP-hard to solve. Furthermore, they are ill-posed inverse problems and therefore are extremely sensitive to perturbations (e.g. noise). For this reason in order to compute a physically reliable approximation from given noisy data, it is necessary to incorporate into the mathematical model appropriate regularizations that require complex computations. The main aim of this work is to describe variational segmentation methods that are particularly effective for curvilinear structures. Due to their complex geometry, classical regularization techniques cannot be adopted because they lead to the loss of most of low contrasted details. In contrast, the proposed method not only better preserves curvilinear structures, but also reconnects some parts that may have been disconnected by noise. Moreover, it can be easily extensible to graphs and successfully applied to different types of data such as medical imagery (i.e. vessels, hearth coronaries etc), material samples (i.e. concrete) and satellite signals (i.e. streets, rivers etc.). In particular, we will show results and performances about an implementation targeting new generation of High Performance Computing (HPC) architectures where different types of coprocessors cooperate. The involved dataset consists of approximately 200 images of cracks, captured in three different tunnels by a robotic machine designed for the European ROBO-SPECT project.Open Acces

    High-Level Facade Image Interpretation using Marked Point Processes

    Get PDF
    In this thesis, we address facade image interpretation as one essential ingredient for the generation of high-detailed, semantic meaningful, three-dimensional city-models. Given a single rectified facade image, we detect relevant facade objects such as windows, entrances, and balconies, which yield a description of the image in terms of accurate position and size of these objects. Urban digital three-dimensional reconstruction and documentation is an active area of research with several potential applications, e.g., in the area of digital mapping for navigation, urban planning, emergency management, disaster control or the entertainment industry. A detailed building model which is not just a geometric object enriched with texture, allows for semantic requests as the number of floors or the location of balconies and entrances. Facade image interpretation is one essential step in order to yield such models. In this thesis, we propose the interpretation of facade images by combining evidence for the occurrence of individual object classes which we derive from data, and prior knowledge which guides the image interpretation in its entirety. We present a three-step procedure which generates features that are suited to describe relevant objects, learns a representation that is suited for object detection, and that enables the image interpretation using the results of object detection while incorporating prior knowledge about typical configurations of facade objects, which we learn from training data. According to these three sub-tasks, our major achievements are: We propose a novel method for facade image interpretation based on a marked point process. Therefor, we develop a model for the description of typical configurations of facade objects and propose an image interpretation system which combines evidence derived from data and prior knowledge about typical configurations of facade objects. In order to generate evidence from data, we propose a feature type which we call shapelets. They are scale invariant and provide large distinctiveness for facade objects. Segments of lines, arcs, and ellipses serve as basic features for the generation of shapelets. Therefor, we propose a novel line simplification approach which approximates given pixel-chains by a sequence of lines, circular, and elliptical arcs. Among others, it is based on an adaption to Douglas-Peucker's algorithm, which is based on circles as basic geometric elements We evaluate each step separately. We show the effects of polyline segmentation and simplification on several images with comparable good or even better results, referring to a state-of-the-art algorithm, which proves their large distinctiveness for facade objects. Using shapelets we provide a reasonable classification performance on a challenging dataset, including intra-class variations, clutter, and scale changes. Finally, we show promising results for the facade interpretation system on several datasets and provide a qualitative evaluation which demonstrates the capability of complete and accurate detection of facade objectsHigh-Level Interpretation von Fassaden-Bildern unter Benutzung von Markierten PunktprozessenDas Thema dieser Arbeit ist die Interpretation von Fassadenbildern als wesentlicher Beitrag zur Erstellung hoch detaillierter, semantisch reichhaltiger dreidimensionaler Stadtmodelle. In rektifizierten Einzelaufnahmen von Fassaden detektieren wir relevante Objekte wie Fenster, Türen und Balkone, um daraus eine Bildinterpretation in Form von präzisen Positionen und Größen dieser Objekte abzuleiten. Die digitale dreidimensionale Rekonstruktion urbaner Regionen ist ein aktives Forschungsfeld mit zahlreichen Anwendungen, beispielsweise der Herstellung digitaler Kartenwerke für Navigation, Stadtplanung, Notfallmanagement, Katastrophenschutz oder die Unterhaltungsindustrie. Detaillierte Gebäudemodelle, die nicht nur als geometrische Objekte repräsentiert und durch eine geeignete Textur visuell ansprechend dargestellt werden, erlauben semantische Anfragen, wie beispielsweise nach der Anzahl der Geschosse oder der Position der Balkone oder Eingänge. Die semantische Interpretation von Fassadenbildern ist ein wesentlicher Schritt für die Erzeugung solcher Modelle. In der vorliegenden Arbeit lösen wir diese Aufgabe, indem wir aus Daten abgeleitete Evidenz für das Vorkommen einzelner Objekte mit Vorwissen kombinieren, das die Analyse der gesamten Bildinterpretation steuert. Wir präsentieren dafür ein dreistufiges Verfahren: Wir erzeugen Bildmerkmale, die für die Beschreibung der relevanten Objekte geeignet sind. Wir lernen, auf Basis abgeleiteter Merkmale, eine Repräsentation dieser Objekte. Schließlich realisieren wir die Bildinterpretation basierend auf der zuvor gelernten Repräsentation und dem Vorwissen über typische Konfigurationen von Fassadenobjekten, welches wir aus Trainingsdaten ableiten. Wir leisten dazu die folgenden wissenschaftlichen Beiträge: Wir schlagen eine neuartige Me-thode zur Interpretation von Fassadenbildern vor, die einen sogenannten markierten Punktprozess verwendet. Dafür entwickeln wir ein Modell zur Beschreibung typischer Konfigurationen von Fassadenobjekten und entwickeln ein Bildinterpretationssystem, welches aus Daten abgeleitete Evidenz und a priori Wissen über typische Fassadenkonfigurationen kombiniert. Für die Erzeugung der Evidenz stellen wir Merkmale vor, die wir Shapelets nennen und die skaleninvariant und durch eine ausgesprochene Distinktivität im Bezug auf Fassadenobjekte gekennzeichnet sind. Als Basismerkmale für die Erzeugung der Shapelets dienen Linien-, Kreis- und Ellipsensegmente. Dafür stellen wir eine neuartige Methode zur Vereinfachung von Liniensegmenten vor, die eine Pixelkette durch eine Sequenz von geraden Linienstücken und elliptischen Bogensegmenten approximiert. Diese basiert unter anderem auf einer Adaption des Douglas-Peucker Algorithmus, die anstelle gerader Linienstücke, Bogensegmente als geometrische Basiselemente verwendet. Wir evaluieren jeden dieser drei Teilschritte separat. Wir zeigen Ergebnisse der Liniensegmen-tierung anhand verschiedener Bilder und weisen dabei vergleichbare und teilweise verbesserte Ergebnisse im Vergleich zu bestehende Verfahren nach. Für die vorgeschlagenen Shapelets weisen wir in der Evaluation ihre diskriminativen Eigenschaften im Bezug auf Fassadenobjekte nach. Wir erzeugen auf einem anspruchsvollen Datensatz von skalenvariablen Fassadenobjekten, mit starker Variabilität der Erscheinung innerhalb der Klassen, vielversprechende Klassifikationsergebnisse, die die Verwendbarkeit der gelernten Shapelets für die weitere Interpretation belegen. Schließlich zeigen wir Ergebnisse der Interpretation der Fassadenstruktur anhand verschiedener Datensätze. Die qualitative Evaluation demonstriert die Fähigkeit des vorgeschlagenen Lösungsansatzes zur vollständigen und präzisen Detektion der genannten Fassadenobjekte

    3D Reconstruction of Indoor Corridor Models Using Single Imagery and Video Sequences

    Get PDF
    In recent years, 3D indoor modeling has gained more attention due to its role in decision-making process of maintaining the status and managing the security of building indoor spaces. In this thesis, the problem of continuous indoor corridor space modeling has been tackled through two approaches. The first approach develops a modeling method based on middle-level perceptual organization. The second approach develops a visual Simultaneous Localisation and Mapping (SLAM) system with model-based loop closure. In the first approach, the image space was searched for a corridor layout that can be converted into a geometrically accurate 3D model. Manhattan rule assumption was adopted, and indoor corridor layout hypotheses were generated through a random rule-based intersection of image physical line segments and virtual rays of orthogonal vanishing points. Volumetric reasoning, correspondences to physical edges, orientation map and geometric context of an image are all considered for scoring layout hypotheses. This approach provides physically plausible solutions while facing objects or occlusions in a corridor scene. In the second approach, Layout SLAM is introduced. Layout SLAM performs camera localization while maps layout corners and normal point features in 3D space. Here, a new feature matching cost function was proposed considering both local and global context information. In addition, a rotation compensation variable makes Layout SLAM robust against cameras orientation errors accumulations. Moreover, layout model matching of keyframes insures accurate loop closures that prevent miss-association of newly visited landmarks to previously visited scene parts. The comparison of generated single image-based 3D models to ground truth models showed that average ratio differences in widths, heights and lengths were 1.8%, 3.7% and 19.2% respectively. Moreover, Layout SLAM performed with the maximum absolute trajectory error of 2.4m in position and 8.2 degree in orientation for approximately 318m path on RAWSEEDS data set. Loop closing was strongly performed for Layout SLAM and provided 3D indoor corridor layouts with less than 1.05m displacement errors in length and less than 20cm in width and height for approximately 315m path on York University data set. The proposed methods can successfully generate 3D indoor corridor models compared to their major counterpart

    Toward Efficient and Robust Large-Scale Structure-from-Motion Systems

    Get PDF
    The ever-increasing number of images that are uploaded and shared on the Internet has recently been leveraged by computer vision researchers to extract 3D information about the content seen in these images. One key mechanism to extract this information is structure-from-motion, which is the process of recovering the 3D geometry (structure) of a scene via a set of images from different viewpoints (camera motion). However, when dealing with crowdsourced datasets comprised of tens or hundreds of millions of images, the magnitude and diversity of the imagery poses challenges such as robustness, scalability, completeness, and correctness for existing structure-from-motion systems. This dissertation focuses on these challenges and demonstrates practical methods to address the problems of data association and verification within structure-from-motion systems. Data association within structure-from-motion systems consists of the discovery of pairwise image overlap within the input dataset. In order to perform this discovery, previous systems assumed that information about every image in the input dataset could be stored in memory, which is prohibitive for large-scale photo collections. To address this issue, we propose a novel streaming-based framework for the discovery of related sets of images, and demonstrate our approach on a crowdsourced dataset containing 100 million images from all around the world. Results illustrate that our streaming-based approach does not compromise model completeness, but achieves unprecedented levels of efficiency and scalability. The verification of individual data associations is difficult to perform during the process of structure-from-motion, as standard methods have limited scope when determining image overlap. Therefore, it is possible for erroneous associations to form, especially when there are symmetric, repetitive, or duplicate structures which can be incorrectly associated with each other. The consequences of these errors are incorrectly placed cameras and scene geometry within the 3D reconstruction. We present two methods that can detect these local inconsistencies and successfully resolve them into a globally consistent 3D model. In our evaluation, we show that our techniques are efficient, are robust to a variety of scenes, and outperform existing approaches.Doctor of Philosoph
    corecore