369 research outputs found

    Towards Object-Centric Scene Understanding

    Get PDF
    Visual perception for autonomous agents continues to attract community attention due to the disruptive technologies and the wide applicability of such solutions. Autonomous Driving (AD), a major application in this domain, promises to revolutionize our approach to mobility while bringing critical advantages in limiting accident fatalities. Fueled by recent advances in Deep Learning (DL), more computer vision tasks are being addressed using a learning paradigm. Deep Neural Networks (DNNs) succeeded consistently in pushing performances to unprecedented levels and demonstrating the ability of such approaches to generalize to an increasing number of difficult problems, such as 3D vision tasks. In this thesis, we address two main challenges arising from the current approaches. Namely, the computational complexity of multi-task pipelines, and the increasing need for manual annotations. On the one hand, AD systems need to perceive the surrounding environment on different levels of detail and, subsequently, take timely actions. This multitasking further limits the time available for each perception task. On the other hand, the need for universal generalization of such systems to massively diverse situations requires the use of large-scale datasets covering long-tailed cases. Such requirement renders the use of traditional supervised approaches, despite the data readily available in the AD domain, unsustainable in terms of annotation costs, especially for 3D tasks. Driven by the AD environment nature and the complexity dominated (unlike indoor scenes) by the presence of other scene elements (mainly cars and pedestrians) we focus on the above-mentioned challenges in object-centric tasks. We, then, situate our contributions appropriately in fast-paced literature, while supporting our claims with extensive experimental analysis leveraging up-to-date state-of-the-art results and community-adopted benchmarks

    Pedestrian Attribute Recognition: A Survey

    Full text link
    Recognizing pedestrian attributes is an important task in computer vision community due to it plays an important role in video surveillance. Many algorithms has been proposed to handle this task. The goal of this paper is to review existing works using traditional methods or based on deep learning networks. Firstly, we introduce the background of pedestrian attributes recognition (PAR, for short), including the fundamental concepts of pedestrian attributes and corresponding challenges. Secondly, we introduce existing benchmarks, including popular datasets and evaluation criterion. Thirdly, we analyse the concept of multi-task learning and multi-label learning, and also explain the relations between these two learning algorithms and pedestrian attribute recognition. We also review some popular network architectures which have widely applied in the deep learning community. Fourthly, we analyse popular solutions for this task, such as attributes group, part-based, \emph{etc}. Fifthly, we shown some applications which takes pedestrian attributes into consideration and achieve better performance. Finally, we summarized this paper and give several possible research directions for pedestrian attributes recognition. The project page of this paper can be found from the following website: \url{https://sites.google.com/view/ahu-pedestrianattributes/}.Comment: Check our project page for High Resolution version of this survey: https://sites.google.com/view/ahu-pedestrianattributes

    Learning Equivariant Representations

    Get PDF
    State-of-the-art deep learning systems often require large amounts of data and computation. For this reason, leveraging known or unknown structure of the data is paramount. Convolutional neural networks (CNNs) are successful examples of this principle, their defining characteristic being the shift-equivariance. By sliding a filter over the input, when the input shifts, the response shifts by the same amount, exploiting the structure of natural images where semantic content is independent of absolute pixel positions. This property is essential to the success of CNNs in audio, image and video recognition tasks. In this thesis, we extend equivariance to other kinds of transformations, such as rotation and scaling. We propose equivariant models for different transformations defined by groups of symmetries. The main contributions are (i) polar transformer networks, achieving equivariance to the group of similarities on the plane, (ii) equivariant multi-view networks, achieving equivariance to the group of symmetries of the icosahedron, (iii) spherical CNNs, achieving equivariance to the continuous 3D rotation group, (iv) cross-domain image embeddings, achieving equivariance to 3D rotations for 2D inputs, and (v) spin-weighted spherical CNNs, generalizing the spherical CNNs and achieving equivariance to 3D rotations for spherical vector fields. Applications include image classification, 3D shape classification and retrieval, panoramic image classification and segmentation, shape alignment and pose estimation. What these models have in common is that they leverage symmetries in the data to reduce sample and model complexity and improve generalization performance. The advantages are more significant on (but not limited to) challenging tasks where data is limited or input perturbations such as arbitrary rotations are present

    LiDAR-Based Place Recognition For Autonomous Driving: A Survey

    Full text link
    LiDAR-based place recognition (LPR) plays a pivotal role in autonomous driving, which assists Simultaneous Localization and Mapping (SLAM) systems in reducing accumulated errors and achieving reliable localization. However, existing reviews predominantly concentrate on visual place recognition (VPR) methods. Despite the recent remarkable progress in LPR, to the best of our knowledge, there is no dedicated systematic review in this area. This paper bridges the gap by providing a comprehensive review of place recognition methods employing LiDAR sensors, thus facilitating and encouraging further research. We commence by delving into the problem formulation of place recognition, exploring existing challenges, and describing relations to previous surveys. Subsequently, we conduct an in-depth review of related research, which offers detailed classifications, strengths and weaknesses, and architectures. Finally, we summarize existing datasets, commonly used evaluation metrics, and comprehensive evaluation results from various methods on public datasets. This paper can serve as a valuable tutorial for newcomers entering the field of place recognition and for researchers interested in long-term robot localization. We pledge to maintain an up-to-date project on our website https://github.com/ShiPC-AI/LPR-Survey.Comment: 26 pages,13 figures, 5 table

    Vision for Social Robots: Human Perception and Pose Estimation

    Get PDF
    In order to extract the underlying meaning from a scene captured from the surrounding world in a single still image, social robots will need to learn the human ability to detect different objects, understand their arrangement and relationships relative both to their own parts and to each other, and infer the dynamics under which they are evolving. Furthermore, they will need to develop and hold a notion of context to allow assigning different meanings (semantics) to the same visual configuration (syntax) of a scene. The underlying thread of this Thesis is the investigation of new ways for enabling interactions between social robots and humans, by advancing the visual perception capabilities of robots when they process images and videos in which humans are the main focus of attention. First, we analyze the general problem of scene understanding, as social robots moving through the world need to be able to interpret scenes without having been assigned a specific preset goal. Throughout this line of research, i) we observe that human actions and interactions which can be visually discriminated from an image follow a very heavy-tailed distribution; ii) we develop an algorithm that can obtain a spatial understanding of a scene by only using cues arising from the effect of perspective on a picture of a person’s face; and iii) we define a novel taxonomy of errors for the task of estimating the 2D body pose of people in images to better explain the behavior of algorithms and highlight their underlying causes of error. Second, we focus on the specific task of 3D human pose and motion estimation from monocular 2D images using weakly supervised training data, as accurately predicting human pose will open up the possibility of richer interactions between humans and social robots. We show that when 3D ground-truth data is only available in small quantities, or not at all, it is possible to leverage knowledge about the physical properties of the human body, along with additional constraints related to alternative types of supervisory signals, to learn models that can regress the full 3D pose of the human body and predict its motions from monocular 2D images. Taken in its entirety, the intent of this Thesis is to highlight the importance of, and provide novel methodologies for, social robots' ability to interpret their surrounding environment, learn in a way that is robust to low data availability, and generalize previously observed behaviors to unknown situations in a similar way to humans.</p

    Deep learning for object detection in robotic grasping contexts

    Get PDF
    Dans la dernière décennie, les approches basées sur les réseaux de neurones convolutionnels sont devenus les standards pour la plupart des tâches en vision numérique. Alors qu'une grande partie des méthodes classiques de vision étaient basées sur des règles et algorithmes, les réseaux de neurones sont optimisés directement à partir de données d'entraînement qui sont étiquetées pour la tâche voulue. En pratique, il peut être difficile d'obtenir une quantité su sante de données d'entraînement ou d'interpréter les prédictions faites par les réseaux. Également, le processus d'entraînement doit être recommencé pour chaque nouvelle tâche ou ensemble d'objets. Au final, bien que très performantes, les solutions basées sur des réseaux de neurones peuvent être difficiles à mettre en place. Dans cette thèse, nous proposons des stratégies visant à contourner ou solutionner en partie ces limitations en contexte de détection d'instances d'objets. Premièrement, nous proposons d'utiliser une approche en cascade consistant à utiliser un réseau de neurone comme pré-filtrage d'une méthode standard de "template matching". Cette façon de faire nous permet d'améliorer les performances de la méthode de "template matching" tout en gardant son interprétabilité. Deuxièmement, nous proposons une autre approche en cascade. Dans ce cas, nous proposons d'utiliser un réseau faiblement supervisé pour générer des images de probabilité afin d'inférer la position de chaque objet. Cela permet de simplifier le processus d'entraînement et diminuer le nombre d'images d'entraînement nécessaires pour obtenir de bonnes performances. Finalement, nous proposons une architecture de réseau de neurones ainsi qu'une procédure d'entraînement permettant de généraliser un détecteur d'objets à des objets qui ne sont pas vus par le réseau lors de l'entraînement. Notre approche supprime donc la nécessité de réentraîner le réseau de neurones pour chaque nouvel objet.In the last decade, deep convolutional neural networks became a standard for computer vision applications. As opposed to classical methods which are based on rules and hand-designed features, neural networks are optimized and learned directly from a set of labeled training data specific for a given task. In practice, both obtaining sufficient labeled training data and interpreting network outputs can be problematic. Additionnally, a neural network has to be retrained for new tasks or new sets of objects. Overall, while they perform really well, deployment of deep neural network approaches can be challenging. In this thesis, we propose strategies aiming at solving or getting around these limitations for object detection. First, we propose a cascade approach in which a neural network is used as a prefilter to a template matching approach, allowing an increased performance while keeping the interpretability of the matching method. Secondly, we propose another cascade approach in which a weakly-supervised network generates object-specific heatmaps that can be used to infer their position in an image. This approach simplifies the training process and decreases the number of required training images to get state-of-the-art performances. Finally, we propose a neural network architecture and a training procedure allowing detection of objects that were not seen during training, thus removing the need to retrain networks for new objects
    • …
    corecore