    Spatiotemporal oriented energies for spacetime stereo

    This paper presents a novel approach to recovering tem-porally coherent estimates of 3D structure of a dynamic scene from a sequence of binocular stereo images. The approach is based on matching spatiotemporal orientation distributions between left and right temporal image streams, which encapsulates both local spatial and temporal struc-ture for disparity estimation. By capturing spatial and tem-poral structure in this unified fashion, both sources of in-formation combine to yield disparity estimates that are nat-urally temporal coherent, while helping to resolve matches that might be ambiguous when either source is considered alone. Further, by allowing subsets of the orientation mea-surements to support different disparity estimates, an ap-proach to recovering multilayer disparity from spacetime stereo is realized. The approach has been implemented with real-time performance on commodity GPUs. Empir-ical evaluation shows that the approach yields qualitatively and quantitatively superior disparity estimates in compari-son to various alternative approaches, including the ability to provide accurate multilayer estimates in the presence of (semi)transparent and specular surfaces. 1

    Egomotion estimation using binocular spatiotemporal oriented energy

    Camera egomotion estimation is concerned with the recovery of a camera's motion (e.g., instantaneous translation and rotation) as it moves through its environment. It has been demonstrated to be of both theoretical and practical interest. This thesis documents a novel algorithm for egomotion estimation based on binocularly matched spatiotemporal oriented energy distributions. Basing the estimation on oriented energy measurements makes it possible to recover egomotion without the need to establish temporal correspondences or convert disparity into 3D world coordinates. There sulting algorithm has been realized in software and evaluated quantitatively on a novel laboratory dataset with ground truth as well as qualitatively on both indoor and outdoor real-world datasets. Performance is evaluated relative to comparable alternative algorithms and shown to exhibit best overall performance

    Egomotion Estimation Using Binocular Spatiotemporal Oriented Energy

    Colour videos with depth : acquisition, processing and evaluation

    The human visual system lets us perceive the world around us in three dimensions by integrating evidence from depth cues into a coherent visual model of the world. The equivalent in computer vision and computer graphics are geometric models, which provide a wealth of information about represented objects, such as depth and surface normals. Videos do not contain this information, but only provide per-pixel colour information. In this dissertation, I hence investigate a combination of videos and geometric models: videos with per-pixel depth (also known as RGBZ videos). I consider the full life cycle of these videos: from their acquisition, via filtering and processing, to stereoscopic display. I propose two approaches to capture videos with depth. The first is a spatiotemporal stereo matching approach based on the dual-cross-bilateral grid – a novel real-time technique derived by accelerating a reformulation of an existing stereo matching approach. This is the basis for an extension which incorporates temporal evidence in real time, resulting in increased temporal coherence of disparity maps – particularly in the presence of image noise. The second acquisition approach is a sensor fusion system which combines data from a noisy, low-resolution time-of-flight camera and a high-resolution colour video camera into a coherent, noise-free video with depth. The system consists of a three-step pipeline that aligns the video streams, efficiently removes and fills invalid and noisy geometry, and finally uses a spatiotemporal filter to increase the spatial resolution of the depth data and strongly reduce depth measurement noise. I show that these videos with depth empower a range of video processing effects that are not achievable using colour video alone. These effects critically rely on the geometric information, like a proposed video relighting technique which requires high-quality surface normals to produce plausible results. In addition, I demonstrate enhanced non-photorealistic rendering techniques and the ability to synthesise stereoscopic videos, which allows these effects to be applied stereoscopically. These stereoscopic renderings inspired me to study stereoscopic viewing discomfort. The result of this is a surprisingly simple computational model that predicts the visual comfort of stereoscopic images. I validated this model using a perceptual study, which showed that it correlates strongly with human comfort ratings. This makes it ideal for automatic comfort assessment, without the need for costly and lengthy perceptual studies

    Enhancing low-level features with mid-level cues

    Local features have become an essential tool in visual recognition. Much of the progress in computer vision over the past decade has built on simple, local representations such as SIFT or HOG. SIFT in particular shifted the paradigm in feature representation. Subsequent works have often focused on improving either computational efficiency, or invariance properties. This thesis belongs to the latter group. Invariance is a particularly relevant aspect if we intend to work with dense features. The traditional approach to sparse matching is to rely on stable interest points, such as corners, where scale and orientation can be reliably estimated, enforcing invariance; dense features need to be computed on arbitrary points. Dense features have been shown to outperform sparse matching techniques in many recognition problems, and form the bulk of our work. In this thesis we present strategies to enhance low-level, local features with mid-level, global cues. We devise techniques to construct better features, and use them to handle complex ambiguities, occlusions and background changes. To deal with ambiguities, we explore the use of motion to enforce temporal consistency with optical flow priors. We also introduce a novel technique to exploit segmentation cues, and use it to extract features invariant to background variability. For this, we downplay image measurements most likely to belong to a region different from that where the descriptor is computed. In both cases we follow the same strategy: we incorporate mid-level, "big picture" information into the construction of local features, and proceed to use them in the same manner as we would the baseline features. We apply these techniques to different feature representations, including SIFT and HOG, and use them to address canonical vision problems such as stereo and object detection, demonstrating that the introduction of global cues yields consistent improvements. We prioritize solutions that are simple, general, and efficient. Our main contributions are as follows: (a) An approach to dense stereo reconstruction with spatiotemporal features, which unlike existing works remains applicable to wide baselines. (b) A technique to exploit segmentation cues to construct dense descriptors invariant to background variability, such as occlusions or background motion. (c) A technique to integrate bottom-up segmentation with recognition efficiently, amenable to sliding window detectors.Les "features" locals s'han convertit en una eina fonamental en el camp del reconeixement visual. Gran part del progrés experimentat en el camp de la visió per computador al llarg de l'última decada es basa en representacions locals de baixa complexitat, com SIFT o HOG. SIFT, en concret, ha canviat el paradigma en representació de característiques visuals. Els treballs que l'han succeït s'acostumen a centrar o bé a millorar la seva eficiencia computacional, o bé propietats d'invariança. El treball presentat en aquesta tesi pertany al segon grup. L'invariança es un aspecte especialment rellevant quan volem treballab amb "features" denses, és a dir per a cada pixel. La manera tradicional d'atacar el problema amb "features" de baixa densitat consisteix en seleccionar punts d'interés estables, com per exemple cantonades, on l'escala i l'orientació poden ser estimades de manera robusta. Les "features" denses, per definició, han de ser calculades en punts arbitraris de la imatge. S'ha demostrat que les "features" denses obtenen millors resultats en tècniques de correspondència per a molts problemes en reconeixement, i formen la major part del nostre treball. En aquesta tesi presentem estratègies per a enriquir "features" locals de baix nivell amb "cues" o dades globals, de mitja complexitat. Dissenyem tècniques per a construïr millors "features", que usem per a atacar problemes tals com correspondències amb un grau elevat d'ambigüetat, oclusions, i canvis del fons de la imatge. Per a atacar ambigüetats, explorem l'ús del moviment per a imposar consistència espai-temporal mitjançant informació d'"optical flow". També presentem una tècnica per explotar dades de segmentació que fem servir per a extreure "features" invariants a canvis en el fons de la imatge. Aquest mètode consisteix en atenuar els components de la imatge (i per tant les "features") que probablement corresponguin a regions diferents a la del descriptor que estem calculant. En ambdós casos seguim la mateixa estratègia: la nostra voluntat és incorporar dades globals d'un nivell de complexitat mitja a la construcció de "features" locals, que procedim a utilitzar de la mateixa manera que les "features" originals. Aquestes tècniques són aplicades a diferents tipus de representacions, incloent SIFT i HOG, i mostrem com utilitzar-les per a atacar problemes fonamentals en visió per computador tals com l'estèreo i la detecció d'objectes. En aquest treball demostrem que introduïnt informació global en la construcció de "features" locals podem obtenir millores consistentment. Donem prioritat a solucions senzilles, generals i eficients. Aquestes són les principals contribucions de la tesi: (a) Una tècnica per a reconstrucció estèreo densa mitjançant "features" espai-temporals, amb l'avantatge respecte a treballs existents que podem aplicar-la a càmeres en qualsevol configuració geomètrica ("wide-baseline"). (b) Una tècnica per a explotar dades de segmentació dins la construcció de descriptors densos, fent-los invariants a canvis al fons de la imatge, i per tant a problemes com les oclusions en estèreo o objectes en moviment. (c) Una tècnica per a integrar segmentació de manera ascendent ("bottom-up") en problemes de reconeixement d'una manera eficient, dissenyada per a detectors de tipus "sliding window"

    Quality Index for Stereoscopic Images by Separately Evaluating Adding and Subtracting

    The human visual system (HVS) plays an important role in stereo image quality perception. Therefore, it has aroused many people’s interest in how to take advantage of the knowledge of the visual perception in image quality assessment models. This paper proposes a full-reference metric for quality assessment of stereoscopic images based on the binocular difference channel and binocular summation channel. For a stereo pair, the binocular summation map and binocular difference map are computed first by adding and subtracting the left image and right image. Then the binocular summation is decoupled into two parts, namely additive impairments and detail losses. The quality of binocular summation is obtained as the adaptive combination of the quality of detail losses and additive impairments. The quality of binocular summation is computed by using the Contrast Sensitivity Function (CSF) and weighted multi-scale (MS-SSIM). Finally, the quality of binocular summation and binocular difference is integrated into an overall quality index. The experimental results indicate that compared with existing metrics, the proposed metric is highly consistent with the subjective quality assessment and is a robust measure. The result have also indirectly proved hypothesis of the existence of binocular summation and binocular difference channels

    Single View Modeling and View Synthesis

    This thesis develops new algorithms to produce 3D content from a single camera. Today, amateurs can use hand-held camcorders to capture and display the 3D world in 2D, using mature technologies. However, there is always a strong desire to record and re-explore the 3D world in 3D. To achieve this goal, current approaches usually make use of a camera array, which suffers from tedious setup and calibration processes, as well as lack of portability, limiting its application to lab experiments. In this thesis, I try to produce the 3D contents using a single camera, making it as simple as shooting pictures. It requires a new front end capturing device rather than a regular camcorder, as well as more sophisticated algorithms. First, in order to capture the highly detailed object surfaces, I designed and developed a depth camera based on a novel technique called light fall-off stereo (LFS). The LFS depth camera outputs color+depth image sequences and achieves 30 fps, which is necessary for capturing dynamic scenes. Based on the output color+depth images, I developed a new approach that builds 3D models of dynamic and deformable objects. While the camera can only capture part of a whole object at any instance, partial surfaces are assembled together to form a complete 3D model by a novel warping algorithm. Inspired by the success of single view 3D modeling, I extended my exploration into 2D-3D video conversion that does not utilize a depth camera. I developed a semi-automatic system that converts monocular videos into stereoscopic videos, via view synthesis. It combines motion analysis with user interaction, aiming to transfer as much depth inferring work from the user to the computer. I developed two new methods that analyze the optical flow in order to provide additional qualitative depth constraints. The automatically extracted depth information is presented in the user interface to assist with user labeling work. In this thesis, I developed new algorithms to produce 3D contents from a single camera. Depending on the input data, my algorithm can build high fidelity 3D models for dynamic and deformable objects if depth maps are provided. Otherwise, it can turn the video clips into stereoscopic video

    Inspired by nature: timescale-free and grid-free event-based computing with\ua0spiking neural networks

    Computer vision is enjoying huge success in visual processing applications such as facial recognition, object identification, and navigation. Most of these studies work with traditional cameras which produce frames at predetermined fixed time intervals. Real life visual stimuli are, however, generated when changes occur in the environment and are irregular in timing. Biological visual neural systems operate on these changes and are hence free from any fixed timescales that are related to the timing of events in visual input.Inspired by biological systems, neuromorphic devices provide a new way to record visual\ua0data. These devices typically have parallel arrays of sensors which operate asynchronously. They have particular potential for robotics due to their low latency, efficient use of bandwidth and low power requirements. There are a variety of neuromorphic devices for detecting different sensory information; this thesis focuses on using the Dynamic Vision Sensor (DVS) for visual data collection.Event-based sensory inputs are generated on demand as changes happen in the environment. There are no systematic timescales in these activities and the asynchronous nature of the sensors adds to the irregularity of time intervals between events, making event-based data timescale-free. Although the array of sensors are arranged as a grid in vision sensors generally, events in the real world exist in continuous space. Biological systems are not restricted to grid-based sampling, and it is an open question whether event-based data could similarly take advantage of grid-free processing algorithms. To study visual data in a way which is timescale-free and grid-free, which is\ua0 fundamentally different from traditional video data sampled at fixed time intervals which are dense and rigid in space, requires conceptual viewpoints and methods of computation which are not typically employed in existing studies.Bio-inspired computing involves computational components that mimic or at least take inspiration from how nature works. This fusion of engineering and biology often provides insights into complex computational problems. Artificial neural networks, a computing paradigm that is inspired by how our brains work, have been studied widely with visual data. This thesis uses a type of artificial neural network—event-based spiking neural networks—as the basic framework to process event-based visual data.Building upon spiking neural networks, this thesis introduces two methods that process event-based data with the principles of being timescale-free and grid-free. The first method preprocesses events as distributions of Gaussian shaped spatiotemporal volumes, and then introduces a new neuron model with time-delayed dendrites and dendritic and axonal computation as the main building blocks of the spiking neural network to perform long-term predictions. Gaussians are used for simplicity purposes. This Gaussian-based method is shown in this thesis to outperform a commonly used iterative prediction paradigm on DVS data.The second method involves a new concept for processing event-based data based on the “light cone” idea in physics. Starting from a given point in real space at a given time, a light cone is the set of points in spacetime reachable without exceeding the speed of light, and these points trace out spacetime trajectories called world lines. The light cone concept is applied to DVS data. As an object moves with respect to the DVS, the events generated are related by their speeds relative to the DVS. An observer can calculate possible world lines for each point but has no access to the correct one. The idea of a “motion cone” is introduced to refer to the distribution of possible world lines for an event. Motion cones provide a novel theory for the early stages of visual processing. Instead of spatial clustering, world lines produce a new representation determined by a speed-based clustering of events. A novel spiking neural network model with dendritic connections based on motion cones is proposed, with the ability predict future motion pattern in a long-term prediction.Freedom from timescales and fixed grid sizes are fundamental characteristics of neuromorphic event-based data but few algorithms to date exploit their potential. Focusing on the inter-event relationship in the continuous spatiotemporal volume can preserve these features during processing. This thesis presents two examples of incorporating the features of being timescale-free and grid-free into algorithm development and examines their performance on real world DVS data. These new concepts and models contribute to the neuromorphic computation field by providing new ways of thinking about event-based representations and their associated algorithms. They also have the potential to stimulate rethinking of representations in the early stages of an event-based vision system. To aid algorithm development, a benchmarking data set containing data ranging from simple environment changes collected from a stationary camera to complex environmentally rich navigation performed by mobile robots has been collated. Studies conducted in this thesis use examples from this benchmarking data set which is also made available to the public

    What Influences the Relative Proportion of ‘Rigid Rotation’ Versus ‘Non-Rigid Deformation’ in a Bistable Stroboscopic Motion Display

    When observers are presented with a bistable stroboscopic display of an object that appears to transform over time in three-dimensional (3D) space, the dominance of one percept over another is influenced both by stimulus parameters and by cognitive factors. Two experiments were designed to reveal which of several manipulated variables influence most strongly which of two responses is more often observed, one being termed ‘Rigid Rotation’ and the other termed ‘Non-rigid Deformation.’ These two responses were clearly distinguished when drawings of a 3D rectangular box were presented stroboscopically in a two-frame animation with precise control over the Interstimulus Interval (ISI). In the first experiment, the relative dominance of the ‘Rigid Rotation’ response was reduced by changing the colour of one surface of the rectangular box in a manner that was inconsistent with the rotation of the box. Similarly, the relative dominance of the ‘Non-rigid Deformation’ response was reduced by changing the colour of one surface of the rectangular box in a manner that was inconsistent with deformation of the box. In the second experiment, the changes in the relative dominance of the competing motion percepts were observed after prolonged viewing of four different adapting stimuli. The adaptation aftereffects were shown to depend more upon the Interstimulus Interval (ISI) of the stroboscopic display of the adapting stimulus than upon what motion was reportedly ‘seen’ during the viewing of the adapting stimulus. Ultimately, the adaption aftereffect revealed that the relative dominance of the two movement percepts was affected most strongly by the manipulation of a single temporal variable – the ISI. Nonetheless, the results of the first experiment confirmed the influence of surface colour variations on ‘Rigid Rotation’ versus ‘Non-rigid Deformation’ response