253 research outputs found

    Disparity compensation using geometric transforms

    Get PDF
    This dissertation describes the research and development of some techniques to enhance the disparity compensation in 3D video compression algorithms. Disparity compensation is usually performed using a block matching technique between views, disregarding the various levels of disparity present for objects at different depths in the scene. An alternative coding scheme is proposed, taking advantage of the cameras setup information and the object’s depth in the scene, to compensate more complex spatial distortions, being able to improve disparity compensation even with convergent cameras. In order to perform a more accurate disparity compensation, the reference picture list is enriched with additional geometrically transformed images, for the most relevant object’s levels of depth in the scene, resulting from projections of one view to another. This scheme can be implemented in any state-of-the-art video codec, as H.264/AVC or HEVC, in order to improve the disparity matching accuracy between views. Experimental results, using MV-HEVC extension, show the efficiency of the proposed method for coding stereo video, presenting bitrate savings up to 2.87%, for convergent camera sequences, and 1.52% for parallel camera sequences. Also a method to choose the geometrically transformed inter view reference pictures was developed, in order to reduce unnecessary overhead for unused reference pictures. By selecting and adding to the reference picture list, only the most useful pictures, all results improved, presenting bitrate savings up to 3.06% for convergent camera sequences, and 2% for parallel camera sequences

    Disparity compensation using geometric transforms

    Full text link

    Single View 3D Reconstruction using Deep Learning

    Get PDF
    One of the major challenges in the field of Computer Vision has been the reconstruction of a 3D object or scene from a single 2D image. While there are many notable examples, traditional methods for single view reconstruction often fail to generalise due to the presence of many brittle hand-crafted engineering solutions, limiting their applicability to real world problems. Recently, deep learning has taken over the field of Computer Vision and ”learning to reconstruct” has become the dominant technique for addressing the limitations of traditional methods when performing single view 3D reconstruction. Deep learning allows our reconstruction methods to learn generalisable image features and monocular cues that would otherwise be difficult to engineer through ad-hoc hand-crafted approaches. However, it can often be difficult to efficiently integrate the various 3D shape representations within the deep learning framework. In particular, 3D volumetric representations can be adapted to work with Convolutional Neural Networks, but they are computationally expensive and memory inefficient when using local convolutional layers. Also, the successful learning of generalisable feature representations for 3D reconstruction requires large amounts of diverse training data. In practice, this is challenging for 3D training data, as it entails a costly and time consuming manual data collection and annotation process. Researchers have attempted to address these issues by utilising self-supervised learning and generative modelling techniques, however these approaches often produce suboptimal results when compared with models trained on larger datasets. This thesis addresses several key challenges incurred when using deep learning for ”learning to reconstruct” 3D shapes from single view images. We observe that it is possible to learn a compressed representation for multiple categories of the 3D ShapeNet dataset, improving the computational and memory efficiency when working with 3D volumetric representations. To address the challenge of data acquisition, we leverage deep generative models to ”hallucinate” hidden or latent novel viewpoints for a given input image. Combining these images with depths estimated by a self-supervised depth estimator and the known camera properties, allowed us to reconstruct textured 3D point clouds without any ground truth 3D training data. Furthermore, we show that is is possible to improve upon the previous self-supervised monocular depth estimator by adding a self-attention and a discrete volumetric representation, significantly improving accuracy on the KITTI 2015 dataset and enabling the estimation of uncertainty depth predictions.Thesis (Ph.D.) -- University of Adelaide, School of Computer Science, 202

    Signal processing for improved MPEG-based communication systems

    Get PDF

    Action-oriented Scene Understanding

    Get PDF
    In order to allow robots to act autonomously it is crucial that they do not only describe their environment accurately but also identify how to interact with their surroundings. While we witnessed tremendous progress in descriptive computer vision, approaches that explicitly target action are scarcer. This cumulative dissertation approaches the goal of interpreting visual scenes “in the wild” with respect to actions implied by the scene. We call this approach action-oriented scene understanding. It involves identifying and judging opportunities for interaction with constituents of the scene (e.g. objects and their parts) as well as understanding object functions and how interactions will impact the future. All of these aspects are addressed on three levels of abstraction: elements, perception and reasoning. On the elementary level, we investigate semantic and functional grouping of objects by analyzing annotated natural image scenes. We compare object label-based and visual context definitions with respect to their suitability for generating meaningful object class representations. Our findings suggest that representations generated from visual context are on-par in terms of semantic quality with those generated from large quantities of text. The perceptive level concerns action identification. We propose a system to identify possible interactions for robots and humans with the environment (affordances) on a pixel level using state-of-the-art machine learning methods. Pixel-wise part annotations of images are transformed into 12 affordance maps. Using these maps, a convolutional neural network is trained to densely predict affordance maps from unknown RGB images. In contrast to previous work, this approach operates exclusively on RGB images during both, training and testing, and yet achieves state-of-the-art performance. At the reasoning level, we extend the question from asking what actions are possible to what actions are plausible. For this, we gathered a dataset of household images associated with human ratings of the likelihoods of eight different actions. Based on the judgement provided by the human raters, we train convolutional neural networks to generate plausibility scores from unseen images. Furthermore, having considered only static scenes previously in this thesis, we propose a system that takes video input and predicts plausible future actions. Since this requires careful identification of relevant features in the video sequence, we analyze this particular aspect in detail using a synthetic dataset for several state-of-the-art video models. We identify feature learning as a major obstacle for anticipation in natural video data. The presented projects analyze the role of action in scene understanding from various angles and in multiple settings while highlighting the advantages of assuming an action-oriented perspective. We conclude that action-oriented scene understanding can augment classic computer vision in many real-life applications, in particular robotics

    Multi-task near-field perception for autonomous driving using surround-view fisheye cameras

    Get PDF
    Die Bildung der Augen führte zum Urknall der Evolution. Die Dynamik änderte sich von einem primitiven Organismus, der auf den Kontakt mit der Nahrung wartete, zu einem Organismus, der durch visuelle Sensoren gesucht wurde. Das menschliche Auge ist eine der raffiniertesten Entwicklungen der Evolution, aber es hat immer noch Mängel. Der Mensch hat über Millionen von Jahren einen biologischen Wahrnehmungsalgorithmus entwickelt, der in der Lage ist, Autos zu fahren, Maschinen zu bedienen, Flugzeuge zu steuern und Schiffe zu navigieren. Die Automatisierung dieser Fähigkeiten für Computer ist entscheidend für verschiedene Anwendungen, darunter selbstfahrende Autos, Augmented Realität und architektonische Vermessung. Die visuelle Nahfeldwahrnehmung im Kontext von selbstfahrenden Autos kann die Umgebung in einem Bereich von 0 - 10 Metern und 360° Abdeckung um das Fahrzeug herum wahrnehmen. Sie ist eine entscheidende Entscheidungskomponente bei der Entwicklung eines sichereren automatisierten Fahrens. Jüngste Fortschritte im Bereich Computer Vision und Deep Learning in Verbindung mit hochwertigen Sensoren wie Kameras und LiDARs haben ausgereifte Lösungen für die visuelle Wahrnehmung hervorgebracht. Bisher stand die Fernfeldwahrnehmung im Vordergrund. Ein weiteres wichtiges Problem ist die begrenzte Rechenleistung, die für die Entwicklung von Echtzeit-Anwendungen zur Verfügung steht. Aufgrund dieses Engpasses kommt es häufig zu einem Kompromiss zwischen Leistung und Laufzeiteffizienz. Wir konzentrieren uns auf die folgenden Themen, um diese anzugehen: 1) Entwicklung von Nahfeld-Wahrnehmungsalgorithmen mit hoher Leistung und geringer Rechenkomplexität für verschiedene visuelle Wahrnehmungsaufgaben wie geometrische und semantische Aufgaben unter Verwendung von faltbaren neuronalen Netzen. 2) Verwendung von Multi-Task-Learning zur Überwindung von Rechenengpässen durch die gemeinsame Nutzung von initialen Faltungsschichten zwischen den Aufgaben und die Entwicklung von Optimierungsstrategien, die die Aufgaben ausbalancieren.The formation of eyes led to the big bang of evolution. The dynamics changed from a primitive organism waiting for the food to come into contact for eating food being sought after by visual sensors. The human eye is one of the most sophisticated developments of evolution, but it still has defects. Humans have evolved a biological perception algorithm capable of driving cars, operating machinery, piloting aircraft, and navigating ships over millions of years. Automating these capabilities for computers is critical for various applications, including self-driving cars, augmented reality, and architectural surveying. Near-field visual perception in the context of self-driving cars can perceive the environment in a range of 0 - 10 meters and 360° coverage around the vehicle. It is a critical decision-making component in the development of safer automated driving. Recent advances in computer vision and deep learning, in conjunction with high-quality sensors such as cameras and LiDARs, have fueled mature visual perception solutions. Until now, far-field perception has been the primary focus. Another significant issue is the limited processing power available for developing real-time applications. Because of this bottleneck, there is frequently a trade-off between performance and run-time efficiency. We concentrate on the following issues in order to address them: 1) Developing near-field perception algorithms with high performance and low computational complexity for various visual perception tasks such as geometric and semantic tasks using convolutional neural networks. 2) Using Multi-Task Learning to overcome computational bottlenecks by sharing initial convolutional layers between tasks and developing optimization strategies that balance tasks

    Visual slam in dynamic environments

    Get PDF
    El problema de localización y construcción visual simultánea de mapas (visual SLAM por sus siglas en inglés Simultaneous Localization and Mapping) consiste en localizar una cámara en un mapa que se construye de manera online. Esta tecnología permite la localización de robots en entornos desconocidos y la creación de un mapa de la zona con los sensores que lleva incorporados, es decir, sin contar con ninguna infraestructura externa. A diferencia de los enfoques de odometría en los cuales el movimiento incremental es integrado en el tiempo, un mapa permite que el sensor se localice continuamente en el mismo entorno sin acumular deriva.Asumir que la escena observada es estática es común en los algoritmos de SLAM visual. Aunque la suposición estática es válida para algunas aplicaciones, limita su utilidad en escenas concurridas del mundo real para la conducción autónoma, los robots de servicio o realidad aumentada y virtual entre otros. La detección y el estudio de objetos dinámicos es un requisito para estimar con precisión la posición del sensor y construir mapas estables, útiles para aplicaciones robóticas que operan a largo plazo.Las contribuciones principales de esta tesis son tres: 1. Somos capaces de detectar objetos dinámicos con la ayuda del uso de la segmentación semántica proveniente del aprendizaje profundo y el uso de enfoques de geometría multivisión. Esto nos permite lograr una precisión en la estimación de la trayectoria de la cámara en escenas altamente dinámicas comparable a la que se logra en entornos estáticos, así como construir mapas en 3D que contienen sólo la estructura del entorno estático y estable. 2. Logramos alucinar con imágenes realistas la estructura estática de la escena detrás de los objetos dinámicos. Esto nos permite ofrecer mapas completos con una representación plausible de la escena sin discontinuidades o vacíos ocasionados por las oclusiones de los objetos dinámicos. El reconocimiento visual de lugares también se ve impulsado por estos avances en el procesamiento de imágenes. 3. Desarrollamos un marco conjunto tanto para resolver el problema de SLAM como el seguimiento de múltiples objetos con el fin de obtener un mapa espacio-temporal con información de la trayectoria del sensor y de los alrededores. La comprensión de los objetos dinámicos circundantes es de crucial importancia para los nuevos requisitos de las aplicaciones emergentes de realidad aumentada/virtual o de la navegación autónoma. Estas tres contribuciones hacen avanzar el estado del arte en SLAM visual. Como un producto secundario de nuestra investigación y para el beneficio de la comunidad científica, hemos liberado el código que implementa las soluciones propuestas.<br /

    The tactile sense as a mechanism for the reduction of visual load elicited by control interactions : an automotive case study approach to the development of generic design recommendations

    Get PDF
    This thesis examines the potential for using tactile feedback to reduce the visual load that can be associated with interacting with controls. Using the automotive context as a case study, the thesis describes the process followed in the design of a prototype tactile interface (PTI) for the control of in-car secondary functionality (navigation, entertainment and climate control). There have been many examples of the use of active and passive tactile feedback to provide information to visually impaired people. There is however a paucity of previous research into the field of tactile feedback in mainstream product design. A literature review was performed examining various issues that are associated with tactile design including cognitive processing of tactile inputs, the use oftactile feedback in products used by visually impaired people and standard control design recommendations. This was followed by the generation of initial concepts and the first study, which examined how visually impaired people interact with electronic products that are unfamiliar to them, and also examined how they used their own equipment. The results from this study, and the literature review findings were combined into a series of design recommendations for the production of tactile interfaces that aim to reduce the visual load on the driver. These design recommendations were the basis for an iterative design process that resulted in the first, non functioning PTI interface model. The first PTI was constructed using rapid prototype technologies. The first iteration PTI was examined in the second study, a user trial in a driving simulator. The study produced encouraging results with a >90% success rate for correct control selection without vision, whilst performing a driving task. The results from this study were used to refine the design of the PTI and a working, hi-fidelity prototype was constructed for use in the final study. This study involved 'on the road' user trials comparing the glance durations made to the PTI and to a baseline system using a 'repeated measures' structure. The data from these user trials were examined to determine if the PTI exhibited a reduced visual load when compared to the baseline system. The results showed the PTI fostered significantly reduced summed glance durations for 7 of the 11 tasks performed when compared to the baseline system. Three of the 11 tasks that were performed in the study produced a reduction of summed glance duration of >50%. The PTI was also shown to foster non-visual interaction, with all participants performing at least one control interaction without looking at the control arrays. The tactile coding and symbolic layout of the PTI have been shown to be beneficial in terms of reducing 'eyes off road time' and therefore reducing the risk of distraction related accidents. A review of the results from the three studies described in this thesis has enabled the development of generic design guidelines for the production of tactile interfaces where a reduction in visual load is required for the safety of the operator. The thesis has made a contribution to the understanding of the use of the tactile sense during product interactions, and highlighting the benefits as well as the limitations of the tactile sense as a feedback mechanism

    Autonomous Close Formation Flight of Small UAVs Using Vision-Based Localization

    Get PDF
    As Unmanned Aerial Vehicles (UAVs) are integrated into the national airspace to comply with the 2012 Federal Aviation Administration Reauthorization Act, new civilian uses for robotic aircraft will come about in addition to the more obvious military applications. One particular area of interest for UAV development is the autonomous cooperative control of multiple UAVs. In this thesis, a decentralized leader-follower control strategy is designed, implemented, and tested from the follower’s perspective using vision-based localization. The tasks of localization and control were carried out with separate processing hardware dedicated to each task. First, software was written to estimate the relative state of a lead UAV in real-time from video captured by a camera on-board the following UAV. The software, written using OpenCV computer vision libraries and executed on an embedded single-board computer, uses the Efficient Perspective-n-Point algorithm to compute the 3-D pose from a set of 2-D image points. High-intensity, red, light emitting diodes (LEDs) were affixed to specific locations on the lead aircraft’s airframe to simplify the task if extracting the 2-D image points from video. Next, the following vehicle was controlled by modifying a commercially available, open source, waypoint-guided autopilot to navigate using the relative state vector provided by the vision software. A custom Hardware-In-Loop (HIL) simulation station was set up and used to derive the required localization update rate for various flight patterns and levels of atmospheric turbulence. HIL simulation showed that it should be possible to maintain formation, with a vehicle separation of 50 ± 6 feet and localization estimates updated at 10 Hz, for a range of flight conditions. Finally, the system was implemented into low-cost remote controlled aircraft and flight tested to demonstrate formation convergence to 65.5 ± 15 feet of separation