39 research outputs found

    Learning Monocular Depth in Dynamic Scenes via Instance-Aware Projection Consistency

    Full text link
    We present an end-to-end joint training framework that explicitly models 6-DoF motion of multiple dynamic objects, ego-motion and depth in a monocular camera setup without supervision. Our technical contributions are three-fold. First, we highlight the fundamental difference between inverse and forward projection while modeling the individual motion of each rigid object, and propose a geometrically correct projection pipeline using a neural forward projection module. Second, we design a unified instance-aware photometric and geometric consistency loss that holistically imposes self-supervisory signals for every background and object region. Lastly, we introduce a general-purpose auto-annotation scheme using any off-the-shelf instance segmentation and optical flow models to produce video instance segmentation maps that will be utilized as input to our training pipeline. These proposed elements are validated in a detailed ablation study. Through extensive experiments conducted on the KITTI and Cityscapes dataset, our framework is shown to outperform the state-of-the-art depth and motion estimation methods. Our code, dataset, and models are available at https://github.com/SeokjuLee/Insta-DM .Comment: Accepted to AAAI 2021. Code/dataset/models are available at https://github.com/SeokjuLee/Insta-DM. arXiv admin note: substantial text overlap with arXiv:1912.0935

    Irish Machine Vision and Image Processing Conference Proceedings 2017

    Get PDF

    Multi-task near-field perception for autonomous driving using surround-view fisheye cameras

    Get PDF
    Die Bildung der Augen fĂŒhrte zum Urknall der Evolution. Die Dynamik Ă€nderte sich von einem primitiven Organismus, der auf den Kontakt mit der Nahrung wartete, zu einem Organismus, der durch visuelle Sensoren gesucht wurde. Das menschliche Auge ist eine der raffiniertesten Entwicklungen der Evolution, aber es hat immer noch MĂ€ngel. Der Mensch hat ĂŒber Millionen von Jahren einen biologischen Wahrnehmungsalgorithmus entwickelt, der in der Lage ist, Autos zu fahren, Maschinen zu bedienen, Flugzeuge zu steuern und Schiffe zu navigieren. Die Automatisierung dieser FĂ€higkeiten fĂŒr Computer ist entscheidend fĂŒr verschiedene Anwendungen, darunter selbstfahrende Autos, Augmented RealitĂ€t und architektonische Vermessung. Die visuelle Nahfeldwahrnehmung im Kontext von selbstfahrenden Autos kann die Umgebung in einem Bereich von 0 - 10 Metern und 360° Abdeckung um das Fahrzeug herum wahrnehmen. Sie ist eine entscheidende Entscheidungskomponente bei der Entwicklung eines sichereren automatisierten Fahrens. JĂŒngste Fortschritte im Bereich Computer Vision und Deep Learning in Verbindung mit hochwertigen Sensoren wie Kameras und LiDARs haben ausgereifte Lösungen fĂŒr die visuelle Wahrnehmung hervorgebracht. Bisher stand die Fernfeldwahrnehmung im Vordergrund. Ein weiteres wichtiges Problem ist die begrenzte Rechenleistung, die fĂŒr die Entwicklung von Echtzeit-Anwendungen zur VerfĂŒgung steht. Aufgrund dieses Engpasses kommt es hĂ€ufig zu einem Kompromiss zwischen Leistung und Laufzeiteffizienz. Wir konzentrieren uns auf die folgenden Themen, um diese anzugehen: 1) Entwicklung von Nahfeld-Wahrnehmungsalgorithmen mit hoher Leistung und geringer RechenkomplexitĂ€t fĂŒr verschiedene visuelle Wahrnehmungsaufgaben wie geometrische und semantische Aufgaben unter Verwendung von faltbaren neuronalen Netzen. 2) Verwendung von Multi-Task-Learning zur Überwindung von RechenengpĂ€ssen durch die gemeinsame Nutzung von initialen Faltungsschichten zwischen den Aufgaben und die Entwicklung von Optimierungsstrategien, die die Aufgaben ausbalancieren.The formation of eyes led to the big bang of evolution. The dynamics changed from a primitive organism waiting for the food to come into contact for eating food being sought after by visual sensors. The human eye is one of the most sophisticated developments of evolution, but it still has defects. Humans have evolved a biological perception algorithm capable of driving cars, operating machinery, piloting aircraft, and navigating ships over millions of years. Automating these capabilities for computers is critical for various applications, including self-driving cars, augmented reality, and architectural surveying. Near-field visual perception in the context of self-driving cars can perceive the environment in a range of 0 - 10 meters and 360° coverage around the vehicle. It is a critical decision-making component in the development of safer automated driving. Recent advances in computer vision and deep learning, in conjunction with high-quality sensors such as cameras and LiDARs, have fueled mature visual perception solutions. Until now, far-field perception has been the primary focus. Another significant issue is the limited processing power available for developing real-time applications. Because of this bottleneck, there is frequently a trade-off between performance and run-time efficiency. We concentrate on the following issues in order to address them: 1) Developing near-field perception algorithms with high performance and low computational complexity for various visual perception tasks such as geometric and semantic tasks using convolutional neural networks. 2) Using Multi-Task Learning to overcome computational bottlenecks by sharing initial convolutional layers between tasks and developing optimization strategies that balance tasks

    Virtual Reality Games for Motor Rehabilitation

    Get PDF
    This paper presents a fuzzy logic based method to track user satisfaction without the need for devices to monitor users physiological conditions. User satisfaction is the key to any product’s acceptance; computer applications and video games provide a unique opportunity to provide a tailored environment for each user to better suit their needs. We have implemented a non-adaptive fuzzy logic model of emotion, based on the emotional component of the Fuzzy Logic Adaptive Model of Emotion (FLAME) proposed by El-Nasr, to estimate player emotion in UnrealTournament 2004. In this paper we describe the implementation of this system and present the results of one of several play tests. Our research contradicts the current literature that suggests physiological measurements are needed. We show that it is possible to use a software only method to estimate user emotion

    Mobile Robots Navigation

    Get PDF
    Mobile robots navigation includes different interrelated activities: (i) perception, as obtaining and interpreting sensory information; (ii) exploration, as the strategy that guides the robot to select the next direction to go; (iii) mapping, involving the construction of a spatial representation by using the sensory information perceived; (iv) localization, as the strategy to estimate the robot position within the spatial map; (v) path planning, as the strategy to find a path towards a goal location being optimal or not; and (vi) path execution, where motor actions are determined and adapted to environmental changes. The book addresses those activities by integrating results from the research work of several authors all over the world. Research cases are documented in 32 chapters organized within 7 categories next described

    Spatiotemporal Features and Deep Learning Methods for Video Classification

    Get PDF
    Classification of human actions from real-world video data is one of the most important topics in computer vision and it has been an interesting and challenging research topic in recent decades. It is commonly used in many applications such as video retrieval, video surveillance, human-computer interaction, robotics, and health care. Therefore, robust, fast, and accurate action recognition systems are highly demanded. Deep learning techniques developed for action recognition from the image domain can be extended to the video domain. Nonetheless, deep learning solutions for two-dimensional image data cannot be directly applicable for the video domain because of the larger scale and temporal nature of the video. Specifically, each frame involves spatial information, while the sequence of frames carries temporal information. Therefore, this study focused on both spatial and temporal features, aiming to improve the accuracy of human action recognition from videos by making use of spatiotemporal information. In this thesis, several deep learning architectures were proposed to model both spatial and temporal components. Firstly, a novel deep neural network was developed for video classification by combining Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). Secondly, an action template-based keyframe extraction method was proposed and temporal clues between action regions were used to extract more informative keyframes. Thirdly, a novel decision-level fusion rule was proposed to better combine spatial and temporal aspects of videos in two-stream networks. Finally, an extensive investigation was conducted to find out how to combine various information from feature and decision fusion to improve the video classification performance in multi-stream neural networks. Extensive experiments were conducted using the proposed methods and the results highlighted that using both spatial and temporal information is required in video classification architectures and employing temporal information effectively in multi-stream deep neural networks is crucial to improve video classification accuracy

    Richer object representations for object class detection in challenging real world images

    Get PDF
    Object class detection in real world images has been a synonym for object localization for the longest time. State-of-the-art detection methods, inspired by renowned detection benchmarks, typically target 2D bounding box localization of objects. At the same time, due to the rapid technological and scientific advances, high-level vision applications, aiming at understanding the visual world as a whole, are coming into the focus. The diversity of the visual world challenges these applications in terms of representational complexity, robust inference and training data. As objects play a central role in any vision system, it has been argued that richer object representations, providing higher level of detail than modern detection methods, are a promising direction towards understanding visual scenes. Besides bridging the gap between object class detection and high-level tasks, richer object representations also lead to more natural object descriptions, bringing computer vision closer to human perception. Inspired by these prospects, this thesis explores four different directions towards richer object representations, namely, 3D object representations, fine-grained representations, occlusion representations, as well as understanding convnet representations. Moreover, this thesis illustrates that richer object representations can facilitate high-level applications, providing detailed and natural object descriptions. In addition, the presented representations attain high performance rates, at least on par or often superior to state-of-the-art methods.Detektion von Objektklassen in natĂŒrlichen Bildern war lange Zeit gleichbedeutend mit Lokalisierung von Objekten. Von anerkannten Detektions-Benchmarks inspirierte Detektionsmethoden, die auf dem neuesten Stand der Forschung sind, zielen ĂŒblicherweise auf die Lokalisierung von Objekten im Bild. Gleichzeitig werden durch den schnellen technologischen und wissenschaftlichen Fortschritt abstraktere Bildverarbeitungsanwendungen, die ein VerstĂ€ndnis der visuellen Welt als Ganzes anstreben, immer interessanter. Die DiversitĂ€t der visuellen Welt ist eine Herausforderung fĂŒr diese Anwendungen hinsichtlich der KomplexitĂ€t der Darstellung, robuster Inferenz und Trainingsdaten. Da Objekte eine zentrale Rolle in jedem Visionssystem spielen, wurde argumentiert, dass reichhaltige ObjektreprĂ€sentationen, die höhere Detailgenauigkeit als gegenwĂ€rtige Detektionsmethoden bieten, ein vielversprechender Schritt zum VerstĂ€ndnis visueller Szenen sind. Reichhaltige ObjektreprĂ€sentationen schlagen eine BrĂŒcke zwischen der Detektion von Objektklassen und abstrakteren Aufgabenstellungen, und sie fĂŒhren auch zu natĂŒrlicheren Objektbeschreibungen, wodurch sie die Bildverarbeitung der menschlichen Wahrnehmung weiter annĂ€hern. Aufgrund dieser Perspektiven erforscht die vorliegende Arbeit vier verschiedene Herangehensweisen zu reichhaltigeren ObjektreprĂ€sentationen

    Richer object representations for object class detection in challenging real world images

    Get PDF
    Object class detection in real world images has been a synonym for object localization for the longest time. State-of-the-art detection methods, inspired by renowned detection benchmarks, typically target 2D bounding box localization of objects. At the same time, due to the rapid technological and scientific advances, high-level vision applications, aiming at understanding the visual world as a whole, are coming into the focus. The diversity of the visual world challenges these applications in terms of representational complexity, robust inference and training data. As objects play a central role in any vision system, it has been argued that richer object representations, providing higher level of detail than modern detection methods, are a promising direction towards understanding visual scenes. Besides bridging the gap between object class detection and high-level tasks, richer object representations also lead to more natural object descriptions, bringing computer vision closer to human perception. Inspired by these prospects, this thesis explores four different directions towards richer object representations, namely, 3D object representations, fine-grained representations, occlusion representations, as well as understanding convnet representations. Moreover, this thesis illustrates that richer object representations can facilitate high-level applications, providing detailed and natural object descriptions. In addition, the presented representations attain high performance rates, at least on par or often superior to state-of-the-art methods.Detektion von Objektklassen in natĂŒrlichen Bildern war lange Zeit gleichbedeutend mit Lokalisierung von Objekten. Von anerkannten Detektions-Benchmarks inspirierte Detektionsmethoden, die auf dem neuesten Stand der Forschung sind, zielen ĂŒblicherweise auf die Lokalisierung von Objekten im Bild. Gleichzeitig werden durch den schnellen technologischen und wissenschaftlichen Fortschritt abstraktere Bildverarbeitungsanwendungen, die ein VerstĂ€ndnis der visuellen Welt als Ganzes anstreben, immer interessanter. Die DiversitĂ€t der visuellen Welt ist eine Herausforderung fĂŒr diese Anwendungen hinsichtlich der KomplexitĂ€t der Darstellung, robuster Inferenz und Trainingsdaten. Da Objekte eine zentrale Rolle in jedem Visionssystem spielen, wurde argumentiert, dass reichhaltige ObjektreprĂ€sentationen, die höhere Detailgenauigkeit als gegenwĂ€rtige Detektionsmethoden bieten, ein vielversprechender Schritt zum VerstĂ€ndnis visueller Szenen sind. Reichhaltige ObjektreprĂ€sentationen schlagen eine BrĂŒcke zwischen der Detektion von Objektklassen und abstrakteren Aufgabenstellungen, und sie fĂŒhren auch zu natĂŒrlicheren Objektbeschreibungen, wodurch sie die Bildverarbeitung der menschlichen Wahrnehmung weiter annĂ€hern. Aufgrund dieser Perspektiven erforscht die vorliegende Arbeit vier verschiedene Herangehensweisen zu reichhaltigeren ObjektreprĂ€sentationen

    Multiple Object Tracking in Urban Traffic Scenes

    Get PDF
    RÉSUMÉ:Le suivi multiobjets (MOT) est un domaine trĂšs Ă©tudiĂ© qui a Ă©voluĂ© et changĂ© beaucoup durant les annĂ©es grĂące Ă  ses plusieurs applications potentielles pour amĂ©liorer notre qualitĂ© de vie. Dans notre projet de recherche, spĂ©cifiquement, nous sommes intĂ©ressĂ©s par le MOT dans les scĂšnes de trafic urbain pour extraire prĂ©cisĂ©ment les trajectoires des usagers de la route, afin d’amĂ©liorer les systĂšmes de circulation routiĂšre desquels nous bĂ©nĂ©ficions tous.Notre premiĂšre contribution est l’introduction d’informations sur les Ă©tiquettes de classe dans l’ensemble des caractĂ©ristiques qui dĂ©crivent les objets pour les associer sur diffĂ©rents trames, afin de bien capturer leur mouvement sous forme de trajectoires dans un environnement rĂ©el.Nous capitalisons sur les informations provenant d’un dĂ©tecteur basĂ© sur l’apprentissage profond qui est utilisĂ© pour l’extraction des objets d’intĂ©rĂȘt avant la procĂ©dure de suivi, carnous avons Ă©tĂ© intriguĂ©s par leurs popularitĂ©s croissantes et les bonnes performances qu’ils obtiennent. Cependant, malgrĂ© leur potentiel prometteur dans la littĂ©rature, nous avons constatĂ© que les rĂ©sultats Ă©taient dĂ©cevants dans nos expĂ©riences. La qualitĂ© des dĂ©tections,telle que postulĂ©e, affecte grandement la qualitĂ© des trajectoires finales. NĂ©anmoins, nous avons observĂ© que les informations des Ă©tiquettes de classe, ainsi que son score de confiance, sont trĂšs utiles pour notre application, oĂč il y a un nombre Ă©levĂ© de variabilitĂ© pour les types d’usagers de la route.Ensuite, nous avons concentrĂ© nos efforts sur la fusion des entrĂ©es de deux sources diffĂ©rentes afin d’obtenir un ensemble d’objets en entrĂ©e avec un niveau de prĂ©cision satisfaisant pour procĂ©der Ă  l’étape de suivi. À ce stade, nous avons travaillĂ© sur l’intĂ©gration des boĂźtes englobantes Ă  partir d’un dĂ©tecteur multi-classes par apprentissage et d’une mĂ©thode basĂ©e sur la soustraction d’arriĂšre-plan pour rĂ©soudre les problĂšmes tels que la fragmentation et les reprĂ©sentations redondantes du mĂȘme objet.---------- ABSTRACT:Multiple object tracking (MOT) is an intensively researched area that have evolved and undergone much innovation throughout the years due to its potential in a lot of applications to improve our quality of life. In our research project, specifically, we are interested in applying MOT in urban traffic scenes to portray an accurate representation of the road user trajectories for the eventual improvements of road traffic systems that affect people from all walks of life. Our first contribution is the introduction of class label information as part of the features that describe the targets and for associating them across frames to capture their motion into trajectories in real environment. We capitalize on that information from a deep learning detector that is used for extraction of objects of interest prior to the tracking procedure, since we were intrigued by their growing popularity and reported good performances. However,despite their promising potential in the literature, we found that the results were disappointing in our experiments. The quality of extracted input, as postulated, critically affects the quality of the final trajectories obtained as tracking output. Nevertheless, we observed that the class label information, along with its confidence score, is invaluable for our application of urban traffic settings where there are a high number of variability in terms of types of road users. Next, we focused our effort on fusing inputs from two different sources in order to obtain a set of objects with a satisfactory level of accuracy to proceed with the tracking stage. At this point, we worked on the integration of the bounding boxes from a learned multi-class object detector and a background subtraction-based method to resolve issues, such as fragmentation and redundant representations of the same object

    Shortest Route at Dynamic Location with Node Combination-Dijkstra Algorithm

    Get PDF
    Abstract— Online transportation has become a basic requirement of the general public in support of all activities to go to work, school or vacation to the sights. Public transportation services compete to provide the best service so that consumers feel comfortable using the services offered, so that all activities are noticed, one of them is the search for the shortest route in picking the buyer or delivering to the destination. Node Combination method can minimize memory usage and this methode is more optimal when compared to A* and Ant Colony in the shortest route search like Dijkstra algorithm, but can’t store the history node that has been passed. Therefore, using node combination algorithm is very good in searching the shortest distance is not the shortest route. This paper is structured to modify the node combination algorithm to solve the problem of finding the shortest route at the dynamic location obtained from the transport fleet by displaying the nodes that have the shortest distance and will be implemented in the geographic information system in the form of map to facilitate the use of the system. Keywords— Shortest Path, Algorithm Dijkstra, Node Combination, Dynamic Location (key words
    corecore