131 research outputs found
Automatic video segmentation employing object/camera modeling techniques
Practically established video compression and storage techniques still process video sequences as rectangular images without further semantic structure. However, humans watching a video sequence immediately recognize acting objects as semantic units. This semantic object separation is currently not reflected in the technical system, making it difficult to manipulate the video at the object level. The realization of object-based manipulation will introduce many new possibilities for working with videos like composing new scenes from pre-existing video objects or enabling user-interaction with the scene. Moreover, object-based video compression, as defined in the MPEG-4 standard, can provide high compression ratios because the foreground objects can be sent independently from the background. In the case that the scene background is static, the background views can even be combined into a large panoramic sprite image, from which the current camera view is extracted. This results in a higher compression ratio since the sprite image for each scene only has to be sent once. A prerequisite for employing object-based video processing is automatic (or at least user-assisted semi-automatic) segmentation of the input video into semantic units, the video objects. This segmentation is a difficult problem because the computer does not have the vast amount of pre-knowledge that humans subconsciously use for object detection. Thus, even the simple definition of the desired output of a segmentation system is difficult. The subject of this thesis is to provide algorithms for segmentation that are applicable to common video material and that are computationally efficient. The thesis is conceptually separated into three parts. In Part I, an automatic segmentation system for general video content is described in detail. Part II introduces object models as a tool to incorporate userdefined knowledge about the objects to be extracted into the segmentation process. Part III concentrates on the modeling of camera motion in order to relate the observed camera motion to real-world camera parameters. The segmentation system that is described in Part I is based on a background-subtraction technique. The pure background image that is required for this technique is synthesized from the input video itself. Sequences that contain rotational camera motion can also be processed since the camera motion is estimated and the input images are aligned into a panoramic scene-background. This approach is fully compatible to the MPEG-4 video-encoding framework, such that the segmentation system can be easily combined with an object-based MPEG-4 video codec. After an introduction to the theory of projective geometry in Chapter 2, which is required for the derivation of camera-motion models, the estimation of camera motion is discussed in Chapters 3 and 4. It is important that the camera-motion estimation is not influenced by foreground object motion. At the same time, the estimation should provide accurate motion parameters such that all input frames can be combined seamlessly into a background image. The core motion estimation is based on a feature-based approach where the motion parameters are determined with a robust-estimation algorithm (RANSAC) in order to distinguish the camera motion from simultaneously visible object motion. Our experiments showed that the robustness of the original RANSAC algorithm in practice does not reach the theoretically predicted performance. An analysis of the problem has revealed that this is caused by numerical instabilities that can be significantly reduced by a modification that we describe in Chapter 4. The synthetization of static-background images is discussed in Chapter 5. In particular, we present a new algorithm for the removal of the foreground objects from the background image such that a pure scene background remains. The proposed algorithm is optimized to synthesize the background even for difficult scenes in which the background is only visible for short periods of time. The problem is solved by clustering the image content for each region over time, such that each cluster comprises static content. Furthermore, it is exploited that the times, in which foreground objects appear in an image region, are similar to the corresponding times of neighboring image areas. The reconstructed background could be used directly as the sprite image in an MPEG-4 video coder. However, we have discovered that the counterintuitive approach of splitting the background into several independent parts can reduce the overall amount of data. In the case of general camera motion, the construction of a single sprite image is even impossible. In Chapter 6, a multi-sprite partitioning algorithm is presented, which separates the video sequence into a number of segments, for which independent sprites are synthesized. The partitioning is computed in such a way that the total area of the resulting sprites is minimized, while simultaneously satisfying additional constraints. These include a limited sprite-buffer size at the decoder, and the restriction that the image resolution in the sprite should never fall below the input-image resolution. The described multisprite approach is fully compatible to the MPEG-4 standard, but provides three advantages. First, any arbitrary rotational camera motion can be processed. Second, the coding-cost for transmitting the sprite images is lower, and finally, the quality of the decoded sprite images is better than in previously proposed sprite-generation algorithms. Segmentation masks for the foreground objects are computed with a change-detection algorithm that compares the pure background image with the input images. A special effect that occurs in the change detection is the problem of image misregistration. Since the change detection compares co-located image pixels in the camera-motion compensated images, a small error in the motion estimation can introduce segmentation errors because non-corresponding pixels are compared. We approach this problem in Chapter 7 by integrating risk-maps into the segmentation algorithm that identify pixels for which misregistration would probably result in errors. For these image areas, the change-detection algorithm is modified to disregard the difference values for the pixels marked in the risk-map. This modification significantly reduces the number of false object detections in fine-textured image areas. The algorithmic building-blocks described above can be combined into a segmentation system in various ways, depending on whether camera motion has to be considered or whether real-time execution is required. These different systems and example applications are discussed in Chapter 8. Part II of the thesis extends the described segmentation system to consider object models in the analysis. Object models allow the user to specify which objects should be extracted from the video. In Chapters 9 and 10, a graph-based object model is presented in which the features of the main object regions are summarized in the graph nodes, and the spatial relations between these regions are expressed with the graph edges. The segmentation algorithm is extended by an object-detection algorithm that searches the input image for the user-defined object model. We provide two objectdetection algorithms. The first one is specific for cartoon sequences and uses an efficient sub-graph matching algorithm, whereas the second processes natural video sequences. With the object-model extension, the segmentation system can be controlled to extract individual objects, even if the input sequence comprises many objects. Chapter 11 proposes an alternative approach to incorporate object models into a segmentation algorithm. The chapter describes a semi-automatic segmentation algorithm, in which the user coarsely marks the object and the computer refines this to the exact object boundary. Afterwards, the object is tracked automatically through the sequence. In this algorithm, the object model is defined as the texture along the object contour. This texture is extracted in the first frame and then used during the object tracking to localize the original object. The core of the algorithm uses a graph representation of the image and a newly developed algorithm for computing shortest circular-paths in planar graphs. The proposed algorithm is faster than the currently known algorithms for this problem, and it can also be applied to many alternative problems like shape matching. Part III of the thesis elaborates on different techniques to derive information about the physical 3-D world from the camera motion. In the segmentation system, we employ camera-motion estimation, but the obtained parameters have no direct physical meaning. Chapter 12 discusses an extension to the camera-motion estimation to factorize the motion parameters into physically meaningful parameters (rotation angles, focal-length) using camera autocalibration techniques. The speciality of the algorithm is that it can process camera motion that spans several sprites by employing the above multi-sprite technique. Consequently, the algorithm can be applied to arbitrary rotational camera motion. For the analysis of video sequences, it is often required to determine and follow the position of the objects. Clearly, the object position in image coordinates provides little information if the viewing direction of the camera is not known. Chapter 13 provides a new algorithm to deduce the transformation between the image coordinates and the real-world coordinates for the special application of sport-video analysis. In sport videos, the camera view can be derived from markings on the playing field. For this reason, we employ a model of the playing field that describes the arrangement of lines. After detecting significant lines in the input image, a combinatorial search is carried out to establish correspondences between lines in the input image and lines in the model. The algorithm requires no information about the specific color of the playing field and it is very robust to occlusions or poor lighting conditions. Moreover, the algorithm is generic in the sense that it can be applied to any type of sport by simply exchanging the model of the playing field. In Chapter 14, we again consider panoramic background images and particularly focus ib their visualization. Apart from the planar backgroundsprites discussed previously, a frequently-used visualization technique for panoramic images are projections onto a cylinder surface which is unwrapped into a rectangular image. However, the disadvantage of this approach is that the viewer has no good orientation in the panoramic image because he looks into all directions at the same time. In order to provide a more intuitive presentation of wide-angle views, we have developed a visualization technique specialized for the case of indoor environments. We present an algorithm to determine the 3-D shape of the room in which the image was captured, or, more generally, to compute a complete floor plan if several panoramic images captured in each of the rooms are provided. Based on the obtained 3-D geometry, a graphical model of the rooms is constructed, where the walls are displayed with textures that are extracted from the panoramic images. This representation enables to conduct virtual walk-throughs in the reconstructed room and therefore, provides a better orientation for the user. Summarizing, we can conclude that all segmentation techniques employ some definition of foreground objects. These definitions are either explicit, using object models like in Part II of this thesis, or they are implicitly defined like in the background synthetization in Part I. The results of this thesis show that implicit descriptions, which extract their definition from video content, work well when the sequence is long enough to extract this information reliably. However, high-level semantics are difficult to integrate into the segmentation approaches that are based on implicit models. Intead, those semantics should be added as postprocessing steps. On the other hand, explicit object models apply semantic pre-knowledge at early stages of the segmentation. Moreover, they can be applied to short video sequences or even still pictures since no background model has to be extracted from the video. The definition of a general object-modeling technique that is widely applicable and that also enables an accurate segmentation remains an important yet challenging problem for further research
A Voxel-Based Approach for Imaging Voids in Three-Dimensional Point Clouds
Geographically accurate scene models have enormous potential beyond that of just simple visualizations in regard to automated scene generation. In recent years, thanks to ever increasing computational efficiencies, there has been significant growth in both the computer vision and photogrammetry communities pertaining to automatic scene reconstruction from multiple-view imagery. The result of these algorithms is a three-dimensional (3D) point cloud which can be used to derive a final model using surface reconstruction techniques. However, the fidelity of these point clouds has not been well studied, and voids often exist within the point cloud. Voids exist in texturally difficult areas, as well as areas where multiple views were not obtained during collection, constant occlusion existed due to collection angles or overlapping scene geometry, or in regions that failed to triangulate accurately. It may be possible to fill in small voids in the scene using surface reconstruction or hole-filling techniques, but this is not the case with larger more complex voids, and attempting to reconstruct them using only the knowledge of the incomplete point cloud is neither accurate nor aesthetically pleasing.
A method is presented for identifying voids in point clouds by using a voxel-based approach to partition the 3D space. By using collection geometry and information derived from the point cloud, it is possible to detect unsampled voxels such that voids can be identified. This analysis takes into account the location of the camera and the 3D points themselves to capitalize on the idea of free space, such that voxels that lie on the ray between the camera and point are devoid of obstruction, as a clear line of sight is a necessary requirement for reconstruction. Using this approach, voxels are classified into three categories: occupied (contains points from the point cloud), free (rays from the camera to the point passed through the voxel), and unsampled (does not contain points and no rays passed through the area). Voids in the voxel space are manifested as unsampled voxels. A similar line-of-sight analysis can then be used to pinpoint locations at aircraft altitude at which the voids in the point clouds could theoretically be imaged. This work is based on the assumption that inclusion of more images of the void areas in the 3D reconstruction process will reduce the number of voids in the point cloud that were a result of lack of coverage. Voids resulting from texturally difficult areas will not benefit from more imagery in the reconstruction process, and thus are identified and removed prior to the determination of future potential imaging locations
Deformable 3-D Modelling from Uncalibrated Video Sequences
Submitted for the degree of Doctor of Philosophy, Queen Mary, University of Londo
Determining Epipole Location Integrity by Multimodal Sampling
International audienceIn urban cluttered scenes, a photo provided by a wear-able camera may be used by a walking law-enforcement agent as an additional source of information for localizing themselves, or elements of interest related to public safety and security. In this work, we study the problem of locating the epipole, corresponding to the position of the moving camera, in the field of view of a reference camera. We show that the presence of outliers in the standard pipeline for camera relative pose estimation not only prevents the correct estimation of the epipole localization but also degrades the standard uncertainty propagation for the epipole position. We propose a robust method for constructing an epipole location map, and we evaluate its accuracy as well as its level of integrity with respect to standard approaches
Theorems and algorithms for multiple view geometry with applications to electron tomography
The thesis considers both theory and algorithms for geometric computer vision. The framework of the work is built around the application of autonomous transmission electron microscope image registration.
The theoretical part of the thesis first develops a consistent robust estimator that is evaluated in estimating two view geometry with both affine and projective camera models. The uncertainty of the fundamental matrix is similarly estimated robustly, and the previous observation whether the covariance matrix of the fundamental matrix contains disparity information of the scene is explained and its utilization in matching is discussed. For point tracking purposes, a reliable wavelet-based matching technique and two EM algorithms for the maximum likelihood affine reconstruction under missing data are proposed. The thesis additionally discusses identification of degeneracy as well as affine bundle adjustment.
The application part of the thesis considers transmission electron microscope image registration, first with fiducial gold markers and thereafter without markers. Both methods utilize the techniques proposed in the theoretical part of the thesis and, in addition, a graph matching method is proposed for matching gold markers. Conversely, alignment without markers is disposed by tracking interest points of the intensity surface of the images. At the present level of development, the former method is more accurate but the latter is appropriate for situations where fiducial markers cannot be used.
Perhaps the most significant result of the thesis is the proposed robust estimator because of consistence proof and its many application areas, which are not limited to the computer vision field. The other algorithms could be found useful in multiple view applications in computer vision that have to deal with uncertainty, matching, tracking, and reconstruction. From the viewpoint of image registration, the thesis further achieved its aims since two accurate image alignment methods are suggested for obtaining the most exact reconstructions in electron tomography.reviewe
Efficient Algorithms for Robust Estimation
One of the most commonly encountered tasks in computer vision is the estimation of model parameters from image measurements. This scenario arises in a variety of applications -- for instance, in the estimation of geometric entities, such as camera pose parameters, from feature matches between images. The main challenge in this task is to handle the problem of outliers -- in other words, data points that do not conform to the model being estimated. It is well known that if these outliers are not properly accounted for, even a single outlier in the data can result in arbitrarily bad model estimates. Due to the widespread prevalence of problems of this nature, the field of robust estimation has been well studied over the years, both in the statistics community as well as in computer vision, leading to the development of popular algorithms like Random Sample Consensus (RANSAC). While recent years have seen exciting advances in this area, a number of important issues still remain open. In this dissertation, we aim to address some of these challenges. The main goal of this dissertation is to advance the state of the art in robust estimation techniques by developing algorithms capable of efficiently and accurately delivering model parameter estimates in the face of noise and outliers. To this end, the first contribution of this work is in the development of a coherent framework for the analysis of RANSAC-based robust estimators, which consolidates various improvements made over the years. In turn, this analysis leads naturally to the development of new techniques that combine the strengths of existing methods, and yields high-performance robust estimation algorithms, including for real-time applications. A second contribution of this dissertation is the development of an algorithm that explicitly characterizes the effects of estimation uncertainty in RANSAC. This uncertainty arises from small-scale measurement noise that affects the data points, and consequently, impacts the accuracy of model parameters. We show that knowledge of this measurement noise can be leveraged to develop an inlier classification scheme that is dependent on the model uncertainty, as opposed to a fixed inlier threshold, as in RANSAC. This has the advantage that, given a model with associated uncertainty, we can immediately identify a set of points that support this solution, which in turn leads to an improvement in computational efficiency. Finally, we have also developed an approach to addresses the issue of the inlier threshold, which is a user-supplied parameter that can vary depending on the estimation problem and the data being processed. Our technique is based on the intuition that the residual errors for good models are in some way consistent with each other, while bad models do not exhibit this consistency. In other words, looking at the relationship between \\subsets of models can reveal useful information about the validity of the models themselves. We show that it is possible to efficiently identify this consistent behaviour by exploiting residual ordering information coupled with simple non-parametric statistical tests, which leads to an effective algorithm for threshold-free robust estimation.Doctor of Philosoph
3D object recognition without CAD models for industrial robot manipulation
In this work we present a new algorithm for 3D object recognition. The goal is to identify the correct position and orientation of complex objects without using a CAD model, input of main current systems. The approach we follow performs feature matching. The characteristics extracted belong only by shape information to achieve a system independent to brightness, colour or texture. Designing opportune settable parameters, we allow recognition also in presence of small deformation
Robust and affordable localization and mapping for 3D reconstruction. Application to architecture and construction
La localización y mapeado simultáneo a partir de una sola cámara en movimiento se conoce como Monocular
SLAM. En esta tesis se aborda este problema con cámaras de bajo coste cuyo principal reto consiste en ser
robustos al ruido, blurring y otros artefactos que afectan a la imagen. La aproximaciĂłn al problema es discreta,
utilizando solo puntos de la imagen significativos para localizar la cámara y mapear el entorno. La principal
contribución es una simplificación del grafo de poses que permite mejorar la precisión en las escenas más
habituales, evaluada de forma exhaustiva en 4 datasets. Los resultados del mapeado permiten obtener una
reconstrucciĂłn 3D de la escena que puede ser utilizada en arquitectura y construcciĂłn para Modelar la InformaciĂłn
del Edificio (BIM). En la segunda parte de la tesis proponemos incorporar dicha informaciĂłn en un sistema de
visualizaciĂłn avanzada usando WebGL que ayude a simplificar la implantaciĂłn de la metodologĂa BIM.Departamento de Informática (Arquitectura y TecnologĂa de Computadores, Ciencias de la ComputaciĂłn e Inteligencia Artificial, Lenguajes y Sistemas Informáticos)Doctorado en Informátic
Contributions for the automatic description of multimodal scenes
Tese de doutoramento. Engenharia Electrotécnica e de Computadores. Faculdade de Engenharia. Universidade do Porto. 200
- …