176 research outputs found

    Towards Real-time Mixed Reality Matting In Natural Scenes

    Get PDF
    In Mixed Reality scenarios, background replacement is a common way to immerse a user in a synthetic environment. Properly identifying the background pixels in an image or video is a dif- ficult problem known as matting. Proper alpha mattes usually come from human guidance, special hardware setups, or color dependent algorithms. This is a consequence of the under-constrained nature of the per pixel alpha blending equation. In constant color matting, research identifies and replaces a background that is a single color, known as the chroma key color. Unfortunately, the algorithms force a controlled physical environment and favor constant, uniform lighting. More generic approaches, such as natural image matting, have made progress finding alpha matte solutions in environments with naturally occurring backgrounds. However, even for the quicker algorithms, the generation of trimaps, indicating regions of known foreground and background pixels, normally requires human interaction or offline computation. This research addresses ways to automatically solve an alpha matte for an image in realtime, and by extension a video, using a consumer level GPU. It does so even in the context of noisy environments that result in less reliable constraints than found in controlled settings. To attack these challenges, we are particularly interested in automatically generating trimaps from depth buffers for dynamic scenes so that algorithms requiring more dense constraints may be used. The resulting computation is parallelizable so that it may run on a GPU and should work for natural images as well as chroma key backgrounds. Extra input may be required, but when this occurs, commodity hardware available in most Mixed Reality setups should be able to provide the input. This allows us to provide real-time alpha mattes for Mixed Reality scenarios that take place in relatively controlled environments. As a consequence, while monochromatic backdrops (such as green screens or retro-reflective material) aid the algorithm’s accuracy, they are not an explicit requirement. iii Finally we explore a sub-image based approach to parallelize an existing hierarchical approach on high resolution imagery. We show that locality can be exploited to significantly reduce the memory and compute requirements of previously necessary when computing alpha mattes of high resolution images. We achieve this using a parallelizable scheme that is both independent of the matting algorithm and image features. Combined, these research topics provide a basis for Mixed Reality scenarios using real-time natural image matting on high definition video sources

    Stochastic Methods for Fine-Grained Image Segmentation and Uncertainty Estimation in Computer Vision

    Get PDF
    In this dissertation, we exploit concepts of probability theory, stochastic methods and machine learning to address three existing limitations of deep learning-based models for image understanding. First, although convolutional neural networks (CNN) have substantially improved the state of the art in image understanding, conventional CNNs provide segmentation masks that poorly adhere to object boundaries, a critical limitation for many potential applications. Second, training deep learning models requires large amounts of carefully selected and annotated data, but large-scale annotation of image segmentation datasets is often prohibitively expensive. And third, conventional deep learning models also lack the capability of uncertainty estimation, which compromises both decision making and model interpretability. To address these limitations, we introduce the Region Growing Refinement (RGR) algorithm, an unsupervised post-processing algorithm that exploits Monte Carlo sampling and pixel similarities to propagate high-confidence labels into regions of low-confidence classification. The probabilistic Region Growing Refinement (pRGR) provides RGR with a rigorous mathematical foundation that exploits concepts of Bayesian estimation and variance reduction techniques. Experiments demonstrate both the effectiveness of (p)RGR for the refinement of segmentation predictions, as well as its suitability for uncertainty estimation, since its variance estimates obtained in the Monte Carlo iterations are highly correlated with segmentation accuracy. We also introduce FreeLabel, an intuitive open-source web interface that exploits RGR to allow users to obtain high-quality segmentation masks with just a few freehand scribbles, in a matter of seconds. Designed to benefit the computer vision community, FreeLabel can be used for both crowdsourced or private annotation and has a modular structure that can be easily adapted for any image dataset. The practical relevance of methods developed in this dissertation are illustrated through applications on agricultural and healthcare-related domains. We have combined RGR and modern CNNs for fine segmentation of fruit flowers, motivated by the importance of automated bloom intensity estimation for optimization of fruit orchard management and, possibly, automatizing procedures such as flower thinning and pollination. We also exploited an early version of FreeLabel to annotate novel datasets for segmentation of fruit flowers, which are currently publicly available. Finally, this dissertation also describes works on fine segmentation and gaze estimation for images collected from assisted living environments, with the ultimate goal of assisting geriatricians in evaluating health status of patients in such facilities

    Generalizations of the Multicut Problem for Computer Vision

    Get PDF
    Graph decomposition has always been a very important concept in machine learning and computer vision. Many tasks like image and mesh segmentation, community detection in social networks, as well as object tracking and human pose estimation can be formulated as a graph decomposition problem. The multicut problem in particular is a popular model to optimize for a decomposition of a given graph. Its main advantage is that no prior knowledge about the number of components or their sizes is required. However, it has several limitations, which we address in this thesis: Firstly, the multicut problem allows to specify only cost or reward for putting two direct neighbours into distinct components. This limits the expressibility of the cost function. We introduce special edges into the graph that allow to define cost or reward for putting any two vertices into distinct components, while preserving the original set of feasible solutions. We show that this considerably improves the quality of image and mesh segmentations. Second, multicut is notorious to be NP-hard for general graphs, that limits its applications to small super-pixel graphs. We define and implement two primal feasible heuristics to solve the problem. They do not provide any guarantees on the runtime or quality of solutions, but in practice show good convergence behaviour. We perform an extensive comparison on multiple graphs of different sizes and properties. Third, we extend the multicut framework by introducing node labels, so that we can jointly optimize for graph decomposition and nodes classification by means of exactly the same optimization algorithm, thus eliminating the need to hand-tune optimizers for a particular task. To prove its universality we applied it to diverse computer vision tasks, including human pose estimation, multiple object tracking, and instance-aware semantic segmentation. We show that we can improve the results over the prior art using exactly the same data as in the original works. Finally, we use employ multicuts in two applications: 1) a client-server tool for interactive video segmentation: After the pre-processing of the video a user draws strokes on several frames and a time-coherent segmentation of the entire video is performed on-the-fly. 2) we formulate a method for simultaneous segmentation and tracking of living cells in microscopy data. This task is challenging as cells split and our algorithm accounts for this, creating parental hierarchies. We also present results on multiple model fitting. We find models in data heavily corrupted by noise by finding components defining these models using higher order multicuts. We introduce an interesting extension that allows our optimization to pick better hyperparameters for each discovered model. In summary, this thesis extends the multicut problem in different directions, proposes algorithms for optimization, and applies it to novel data and settings.Die Zerlegung von Graphen ist ein sehr wichtiges Konzept im maschinellen Lernen und maschinellen Sehen. Viele Aufgaben wie Bild- und Gittersegmentierung, Kommunitätserkennung in sozialen Netzwerken, sowie Objektverfolgung und Schätzung von menschlichen Posen können als Graphzerlegungsproblem formuliert werden. Der Mehrfachschnitt-Ansatz ist ein populäres Mittel um über die Zerlegungen eines gegebenen Graphen zu optimieren. Sein größter Vorteil ist, dass kein Vorwissen über die Anzahl an Komponenten und deren Größen benötigt wird. Dennoch hat er mehrere ernsthafte Limitierungen, welche wir in dieser Arbeit behandeln: Erstens erlaubt der klassische Mehrfachschnitt nur die Spezifikation von Kosten oder Belohnungen für die Trennung von zwei Nachbarn in verschiedene Komponenten. Dies schränkt die Ausdrucksfähigkeit der Kostenfunktion ein und führt zu suboptimalen Ergebnissen. Wir fügen dem Graphen spezielle Kanten hinzu, welche es erlauben, Kosten oder Belohnungen für die Trennung von beliebigen Paaren von Knoten in verschiedene Komponenten zu definieren, ohne die Menge an zulässigen Lösungen zu verändern. Wir zeigen, dass dies die Qualität von Bild- und Gittersegmentierungen deutlich verbessert. Zweitens ist das Mehrfachschnittproblem berüchtigt dafür NP-schwer für allgemeine Graphen zu sein, was die Anwendungen auf kleine superpixel-basierte Graphen einschränkt. Wir definieren und implementieren zwei primal-zulässige Heuristiken um das Problem zu lösen. Diese geben keine Garantien bezüglich der Laufzeit oder der Qualität der Lösungen, zeigen in der Praxis jedoch gutes Konvergenzverhalten. Wir führen einen ausführlichen Vergleich auf vielen Graphen verschiedener Größen und Eigenschaften durch. Drittens erweitern wir den Mehrfachschnitt-Ansatz um Knoten-Kennzeichnungen, sodass wir gemeinsam über Zerlegungen und Knoten-Klassifikationen mit dem gleichen Optimierungs-Algorithmus optimieren können. Dadurch wird der Bedarf der Feinabstimmung einzelner aufgabenspezifischer Löser aus dem Weg geräumt. Um die Allgemeingültigkeit dieses Ansatzes zu überprüfen, haben wir ihn auf verschiedenen Aufgaben des maschinellen Sehens, einschließlich menschliche Posenschätzung, Mehrobjektverfolgung und instanz-bewusste semantische Segmentierung, angewandt. Wir zeigen, dass wir Resultate von vorherigen Arbeiten mit exakt den gleichen Daten verbessern können. Abschließend benutzen wir Mehrfachschnitte in zwei Anwendungen: 1) Ein Nutzer-Server-Werkzeug für interaktive Video Segmentierung: Nach der Vorbearbeitung eines Videos zeichnet der Nutzer Striche auf mehrere Einzelbilder und eine zeit-kohärente Segmentierung des gesamten Videos wird in Echtzeit berechnet. 2) Wir formulieren eine Methode für simultane Segmentierung und Verfolgung von lebenden Zellen in Mikroskopie-Aufnahmen. Diese Aufgabe ist anspruchsvoll, da Zellen sich aufteilen und unser Algorithmus dies in der Erstellung von Eltern-Hierarchien mitberücksichtigen muss. Wir präsentieren außerdem Resultate zur Mehrmodellanpassung. Wir berechnen Modelle in stark verrauschten Daten indem wir mithilfe von Mehrfachschnitten höherer Ordnung Komponenten finden, die diesen Modellen entsprechen. Wir führen eine interessante Erweiterung ein, die es unserer Optimierung erlaubt, bessere Hyperparameter für jedes entdeckte Modell auszuwählen. Zusammenfassend erweitert diese Arbeit den Mehrfachschnitt-Ansatz in unterschiedlichen Richtungen, schlägt Algorithmen zur Inferenz in den resultierenden Modellen vor und wendet ihn auf neuartigen Daten und Umgebungen an

    Efficient and Accurate Disparity Estimation from MLA-Based Plenoptic Cameras

    Get PDF
    This manuscript focuses on the processing images from microlens-array based plenoptic cameras. These cameras enable the capturing of the light field in a single shot, recording a greater amount of information with respect to conventional cameras, allowing to develop a whole new set of applications. However, the enhanced information introduces additional challenges and results in higher computational effort. For one, the image is composed of thousand of micro-lens images, making it an unusual case for standard image processing algorithms. Secondly, the disparity information has to be estimated from those micro-images to create a conventional image and a three-dimensional representation. Therefore, the work in thesis is devoted to analyse and propose methodologies to deal with plenoptic images. A full framework for plenoptic cameras has been built, including the contributions described in this thesis. A blur-aware calibration method to model a plenoptic camera, an optimization method to accurately select the best microlenses combination, an overview of the different types of plenoptic cameras and their representation. Datasets consisting of both real and synthetic images have been used to create a benchmark for different disparity estimation algorithm and to inspect the behaviour of disparity under different compression rates. A robust depth estimation approach has been developed for light field microscopy and image of biological samples

    Automated inverse-rendering techniques for realistic 3D artefact compositing in 2D photographs

    Get PDF
    PhD ThesisThe process of acquiring images of a scene and modifying the defining structural features of the scene through the insertion of artefacts is known in literature as compositing. The process can take effect in the 2D domain (where the artefact originates from a 2D image and is inserted into a 2D image), or in the 3D domain (the artefact is defined as a dense 3D triangulated mesh, with textures describing its material properties). Compositing originated as a solution to enhancing, repairing, and more broadly editing photographs and video data alike in the film industry as part of the post-production stage. This is generally thought of as carrying out operations in a 2D domain (a single image with a known width, height, and colour data). The operations involved are sequential and entail separating the foreground from the background (matting), or identifying features from contour (feature matching and segmentation) with the purpose of introducing new data in the original. Since then, compositing techniques have gained more traction in the emerging fields of Mixed Reality (MR), Augmented Reality (AR), robotics and machine vision (scene understanding, scene reconstruction, autonomous navigation). When focusing on the 3D domain, compositing can be translated into a pipeline 1 - the incipient stage acquires the scene data, which then undergoes a number of processing steps aimed at inferring structural properties that ultimately allow for the placement of 3D artefacts anywhere within the scene, rendering a plausible and consistent result with regard to the physical properties of the initial input. This generic approach becomes challenging in the absence of user annotation and labelling of scene geometry, light sources and their respective magnitude and orientation, as well as a clear object segmentation and knowledge of surface properties. A single image, a stereo pair, or even a short image stream may not hold enough information regarding the shape or illumination of the scene, however, increasing the input data will only incur an extensive time penalty which is an established challenge in the field. Recent state-of-the-art methods address the difficulty of inference in the absence of 1In the present document, the term pipeline refers to a software solution formed of stand-alone modules or stages. It implies that the flow of execution runs in a single direction, and that each module has the potential to be used on its own as part of other solutions. Moreover, each module is assumed to take an input set and output data for the following stage, where each module addresses a single type of problem only. data, nonetheless, they do not attempt to solve the challenge of compositing artefacts between existing scene geometry, or cater for the inclusion of new geometry behind complex surface materials such as translucent glass or in front of reflective surfaces. The present work focuses on the compositing in the 3D domain and brings forth a software framework 2 that contributes solutions to a number of challenges encountered in the field, including the ability to render physically-accurate soft shadows in the absence of user annotate scene properties or RGB-D data. Another contribution consists in the timely manner in which the framework achieves a believable result compared to the other compositing methods which rely on offline rendering. The availability of proprietary hardware and user expertise are two of the main factors that are not required in order to achieve a fast and reliable results within the current framework

    Soft Shadow Removal and Image Evaluation Methods

    Get PDF
    High-level image manipulation techniques are in increasing demand as they allow users to intuitively edit photographs to achieve desired effects quickly. As opposed to low-level manipulations, which provide complete freedom, but also require specialized skills and significant effort, high-level editing operations, such as removing objects (inpainting), relighting and material editing, need to respect semantic constraints. As such they shift the burden from the user to the algorithm to only allow a subset of modifications that make sense in a given scenario. Shadow removal is one such high-level objective: it is easy for users to understand and specify, but difficult to accomplish realistically due to the complexity of effects that contribute to the final image. Further, shadows are critical to scene understanding and play a crucial role in making images look realistic. We propose a machine learning-based algorithm that works well with soft shadows, that is shadows with wide penumbrae, outperforming previous techniques both in performance and ease of use. We observe that evaluation of such a technique is a difficult problem in itself and one that is often not considered throughly in the computer graphics and vision communities, even when perceptual validity is the goal. To tackle this, we propose a set of standardized procedures for image evaluation as well as an authoring system for creation of image evaluation user studies. In addition to making it possible for researchers, as well as the industry, to rigorously evaluate their image manipulation techniques on large numbers of participants, we incorporate best practices from the human-computer intraction (HCI) and psychophysics communities and provide analysis tools to explore the results in depth

    Image processing and synthesis: From hand-crafted to data-driven modeling

    Get PDF
    This work investigates image and video restoration problems using effective optimization algorithms. First, we study the problem of single image dehazing to suppress artifacts in compressed or noisy images and videos. Our method is based on the linear haze model and minimizes the gradient residual between the input and output images. This successfully suppresses any new artifacts that are not obvious in the input images. Second, we propose a new method for image inpainting using deep neural networks. Given a set of training data, deep generate models can generate high-quality natural images following the same distribution. We search the nearest neighbor in the latent space of the deep generate models using a weighted context loss and prior loss. This code is then converted to the clean and uncorrupted image of the input. Third, we study the problem of recovering high-quality images from very noisy raw data captured in low-light conditions with short exposures. We build deep neural networks to learn the camera processing pipeline specifically for low-light raw data with an extremely low signal-to-noise ratio (SNR). To train the networks, we capture a new dataset of more than five thousand images with short-exposed and long-exposed pairs. Promising results are obtained compared with the traditional image processing pipeline. Finally, we propose a new method for extreme-low light video processing. The raw video frames are pre-processed using spatial-temporal denoising. A neural network is trained to move the error in the pre-processed data, learning to perform the image processing pipeline and encourage temporal smoothness of the output. Both quantitative and qualitative results demonstrate the proposed method significantly outperform the existing methods. It also paves the way for future research on this area

    Automatic video segmentation employing object/camera modeling techniques

    Get PDF
    Practically established video compression and storage techniques still process video sequences as rectangular images without further semantic structure. However, humans watching a video sequence immediately recognize acting objects as semantic units. This semantic object separation is currently not reflected in the technical system, making it difficult to manipulate the video at the object level. The realization of object-based manipulation will introduce many new possibilities for working with videos like composing new scenes from pre-existing video objects or enabling user-interaction with the scene. Moreover, object-based video compression, as defined in the MPEG-4 standard, can provide high compression ratios because the foreground objects can be sent independently from the background. In the case that the scene background is static, the background views can even be combined into a large panoramic sprite image, from which the current camera view is extracted. This results in a higher compression ratio since the sprite image for each scene only has to be sent once. A prerequisite for employing object-based video processing is automatic (or at least user-assisted semi-automatic) segmentation of the input video into semantic units, the video objects. This segmentation is a difficult problem because the computer does not have the vast amount of pre-knowledge that humans subconsciously use for object detection. Thus, even the simple definition of the desired output of a segmentation system is difficult. The subject of this thesis is to provide algorithms for segmentation that are applicable to common video material and that are computationally efficient. The thesis is conceptually separated into three parts. In Part I, an automatic segmentation system for general video content is described in detail. Part II introduces object models as a tool to incorporate userdefined knowledge about the objects to be extracted into the segmentation process. Part III concentrates on the modeling of camera motion in order to relate the observed camera motion to real-world camera parameters. The segmentation system that is described in Part I is based on a background-subtraction technique. The pure background image that is required for this technique is synthesized from the input video itself. Sequences that contain rotational camera motion can also be processed since the camera motion is estimated and the input images are aligned into a panoramic scene-background. This approach is fully compatible to the MPEG-4 video-encoding framework, such that the segmentation system can be easily combined with an object-based MPEG-4 video codec. After an introduction to the theory of projective geometry in Chapter 2, which is required for the derivation of camera-motion models, the estimation of camera motion is discussed in Chapters 3 and 4. It is important that the camera-motion estimation is not influenced by foreground object motion. At the same time, the estimation should provide accurate motion parameters such that all input frames can be combined seamlessly into a background image. The core motion estimation is based on a feature-based approach where the motion parameters are determined with a robust-estimation algorithm (RANSAC) in order to distinguish the camera motion from simultaneously visible object motion. Our experiments showed that the robustness of the original RANSAC algorithm in practice does not reach the theoretically predicted performance. An analysis of the problem has revealed that this is caused by numerical instabilities that can be significantly reduced by a modification that we describe in Chapter 4. The synthetization of static-background images is discussed in Chapter 5. In particular, we present a new algorithm for the removal of the foreground objects from the background image such that a pure scene background remains. The proposed algorithm is optimized to synthesize the background even for difficult scenes in which the background is only visible for short periods of time. The problem is solved by clustering the image content for each region over time, such that each cluster comprises static content. Furthermore, it is exploited that the times, in which foreground objects appear in an image region, are similar to the corresponding times of neighboring image areas. The reconstructed background could be used directly as the sprite image in an MPEG-4 video coder. However, we have discovered that the counterintuitive approach of splitting the background into several independent parts can reduce the overall amount of data. In the case of general camera motion, the construction of a single sprite image is even impossible. In Chapter 6, a multi-sprite partitioning algorithm is presented, which separates the video sequence into a number of segments, for which independent sprites are synthesized. The partitioning is computed in such a way that the total area of the resulting sprites is minimized, while simultaneously satisfying additional constraints. These include a limited sprite-buffer size at the decoder, and the restriction that the image resolution in the sprite should never fall below the input-image resolution. The described multisprite approach is fully compatible to the MPEG-4 standard, but provides three advantages. First, any arbitrary rotational camera motion can be processed. Second, the coding-cost for transmitting the sprite images is lower, and finally, the quality of the decoded sprite images is better than in previously proposed sprite-generation algorithms. Segmentation masks for the foreground objects are computed with a change-detection algorithm that compares the pure background image with the input images. A special effect that occurs in the change detection is the problem of image misregistration. Since the change detection compares co-located image pixels in the camera-motion compensated images, a small error in the motion estimation can introduce segmentation errors because non-corresponding pixels are compared. We approach this problem in Chapter 7 by integrating risk-maps into the segmentation algorithm that identify pixels for which misregistration would probably result in errors. For these image areas, the change-detection algorithm is modified to disregard the difference values for the pixels marked in the risk-map. This modification significantly reduces the number of false object detections in fine-textured image areas. The algorithmic building-blocks described above can be combined into a segmentation system in various ways, depending on whether camera motion has to be considered or whether real-time execution is required. These different systems and example applications are discussed in Chapter 8. Part II of the thesis extends the described segmentation system to consider object models in the analysis. Object models allow the user to specify which objects should be extracted from the video. In Chapters 9 and 10, a graph-based object model is presented in which the features of the main object regions are summarized in the graph nodes, and the spatial relations between these regions are expressed with the graph edges. The segmentation algorithm is extended by an object-detection algorithm that searches the input image for the user-defined object model. We provide two objectdetection algorithms. The first one is specific for cartoon sequences and uses an efficient sub-graph matching algorithm, whereas the second processes natural video sequences. With the object-model extension, the segmentation system can be controlled to extract individual objects, even if the input sequence comprises many objects. Chapter 11 proposes an alternative approach to incorporate object models into a segmentation algorithm. The chapter describes a semi-automatic segmentation algorithm, in which the user coarsely marks the object and the computer refines this to the exact object boundary. Afterwards, the object is tracked automatically through the sequence. In this algorithm, the object model is defined as the texture along the object contour. This texture is extracted in the first frame and then used during the object tracking to localize the original object. The core of the algorithm uses a graph representation of the image and a newly developed algorithm for computing shortest circular-paths in planar graphs. The proposed algorithm is faster than the currently known algorithms for this problem, and it can also be applied to many alternative problems like shape matching. Part III of the thesis elaborates on different techniques to derive information about the physical 3-D world from the camera motion. In the segmentation system, we employ camera-motion estimation, but the obtained parameters have no direct physical meaning. Chapter 12 discusses an extension to the camera-motion estimation to factorize the motion parameters into physically meaningful parameters (rotation angles, focal-length) using camera autocalibration techniques. The speciality of the algorithm is that it can process camera motion that spans several sprites by employing the above multi-sprite technique. Consequently, the algorithm can be applied to arbitrary rotational camera motion. For the analysis of video sequences, it is often required to determine and follow the position of the objects. Clearly, the object position in image coordinates provides little information if the viewing direction of the camera is not known. Chapter 13 provides a new algorithm to deduce the transformation between the image coordinates and the real-world coordinates for the special application of sport-video analysis. In sport videos, the camera view can be derived from markings on the playing field. For this reason, we employ a model of the playing field that describes the arrangement of lines. After detecting significant lines in the input image, a combinatorial search is carried out to establish correspondences between lines in the input image and lines in the model. The algorithm requires no information about the specific color of the playing field and it is very robust to occlusions or poor lighting conditions. Moreover, the algorithm is generic in the sense that it can be applied to any type of sport by simply exchanging the model of the playing field. In Chapter 14, we again consider panoramic background images and particularly focus ib their visualization. Apart from the planar backgroundsprites discussed previously, a frequently-used visualization technique for panoramic images are projections onto a cylinder surface which is unwrapped into a rectangular image. However, the disadvantage of this approach is that the viewer has no good orientation in the panoramic image because he looks into all directions at the same time. In order to provide a more intuitive presentation of wide-angle views, we have developed a visualization technique specialized for the case of indoor environments. We present an algorithm to determine the 3-D shape of the room in which the image was captured, or, more generally, to compute a complete floor plan if several panoramic images captured in each of the rooms are provided. Based on the obtained 3-D geometry, a graphical model of the rooms is constructed, where the walls are displayed with textures that are extracted from the panoramic images. This representation enables to conduct virtual walk-throughs in the reconstructed room and therefore, provides a better orientation for the user. Summarizing, we can conclude that all segmentation techniques employ some definition of foreground objects. These definitions are either explicit, using object models like in Part II of this thesis, or they are implicitly defined like in the background synthetization in Part I. The results of this thesis show that implicit descriptions, which extract their definition from video content, work well when the sequence is long enough to extract this information reliably. However, high-level semantics are difficult to integrate into the segmentation approaches that are based on implicit models. Intead, those semantics should be added as postprocessing steps. On the other hand, explicit object models apply semantic pre-knowledge at early stages of the segmentation. Moreover, they can be applied to short video sequences or even still pictures since no background model has to be extracted from the video. The definition of a general object-modeling technique that is widely applicable and that also enables an accurate segmentation remains an important yet challenging problem for further research
    corecore