28 research outputs found

    AmodalSynthDrive: A Synthetic Amodal Perception Dataset for Autonomous Driving

    Full text link
    Unlike humans, who can effortlessly estimate the entirety of objects even when partially occluded, modern computer vision algorithms still find this aspect extremely challenging. Leveraging this amodal perception for autonomous driving remains largely untapped due to the lack of suitable datasets. The curation of these datasets is primarily hindered by significant annotation costs and mitigating annotator subjectivity in accurately labeling occluded regions. To address these limitations, we introduce AmodalSynthDrive, a synthetic multi-task multi-modal amodal perception dataset. The dataset provides multi-view camera images, 3D bounding boxes, LiDAR data, and odometry for 150 driving sequences with over 1M object annotations in diverse traffic, weather, and lighting conditions. AmodalSynthDrive supports multiple amodal scene understanding tasks including the introduced amodal depth estimation for enhanced spatial understanding. We evaluate several baselines for each of these tasks to illustrate the challenges and set up public benchmarking servers. The dataset is available at http://amodalsynthdrive.cs.uni-freiburg.de

    Amodal Optical Flow

    Full text link
    Optical flow estimation is very challenging in situations with transparent or occluded objects. In this work, we address these challenges at the task level by introducing Amodal Optical Flow, which integrates optical flow with amodal perception. Instead of only representing the visible regions, we define amodal optical flow as a multi-layered pixel-level motion field that encompasses both visible and occluded regions of the scene. To facilitate research on this new task, we extend the AmodalSynthDrive dataset to include pixel-level labels for amodal optical flow estimation. We present several strong baselines, along with the Amodal Flow Quality metric to quantify the performance in an interpretable manner. Furthermore, we propose the novel AmodalFlowNet as an initial step toward addressing this task. AmodalFlowNet consists of a transformer-based cost-volume encoder paired with a recurrent transformer decoder which facilitates recurrent hierarchical feature propagation and amodal semantic grounding. We demonstrate the tractability of amodal optical flow in extensive experiments and show its utility for downstream tasks such as panoptic tracking. We make the dataset, code, and trained models publicly available at http://amodal-flow.cs.uni-freiburg.de

    Segmentatiοn d'images οmnidirectiοnnelles de scènes rοutières

    No full text
    A full comprehensive understanding of the surrounding scene is paramount for all mobile robots in general and intelligent transportation systems (ITS) or autonomous driving vehicles (AD) in particular. Several image-based solutions attempt to reach this goal through computer vision tasks. Mainly these solutions are based on trainable models that are data oriented. The availability of annotated data is then very important to train the models in order to achieve the desired task. In this thesis, we propose synthetic images datasets with ground truth for several computer vision tasks with a focus on omnidirectional images since they are highly used in the field of ITS thanks to their large field of view and theability to reveal information from broader surroundings. The proposed datasets are then used to take stock on the progress made in the literature on semantic segmentation of omnidirectional images through a thorough comparative study. Several experiments were conducted using several image representations and modalities to find the finest solution to semantically segment omnidirectional images. The results of the comparative study in addition to the proposed datasets were then used to propose an application focusing on the added value of deformable convolutions in segmenting omnidirectional images.Une compréhension complète et exhaustive de la scène environnante est primordiale pour tous les robots mobiles en général et les systèmes de transport intelligents (ITS) ou les véhicules à conduite autonome (AD) en particulier. Plusieurs solutions basées sur l’image tentent d’atteindre cet objectif grâce à des tâches de vision par ordinateur. Ces solutions sont principalement basées sur des modèles entraînables sur des données labelisées. La disponibilité des données annotées est alors très importante pour entraîner les modèles afin de réaliser la tâche souhaitée. Dans cette thèse, nous proposons des ensembles de données d’images synthétiques avec vérité terrain pour plusieurs tâches de vision par ordinateur en mettant l’accent sur les images omnidirectionnelles car elles sont très utilisées dans le domaine des STI grâce à leur large étendue et leur capacité à révéler des informations provenant d’un large champ de vision. Les jeux de données proposés sont ensuite utilisés pour faire le point sur l’avancée atteinte dans la littérature sur la segmentation sémantique des images omnidirectionnelles à travers une étude comparative approfondie. Plusieurs expériences ont étémenées en utilisant plusieurs représentations d’images et modalités pour trouver la solution la plus adaptée pour segmenter sémantiquement des images omnidirectionnelles. Les résultats de l’étude comparative en plus des jeux de données proposés ont ensuite été utilisés pour proposer une application se concentrant sur la valeur ajoutée des convolutions déformables dans la segmentation d’images omnidirectionnelle

    The OmniScape Dataset

    No full text
    International audienceDespite the utility and benefits of omnidirectional images in robotics and automotive applications, there are no datasets of omnidirectional images available with semantic segmentation, depth map, and dynamic properties. This is due to the time cost and human effort required to annotate ground truth images. This paper presents a framework for generating omnidirectional images using images that are acquired from a virtual environment. For this purpose, we demonstrate the relevance of the proposed framework on two well-known simulators: CARLA Simulator, which is an open-source simulator for autonomous driving research, and Grand Theft Auto V (GTA V), which is a very high quality video game. We explain in details the generated OmniScape dataset, which includes stereo fisheye and catadioptric images acquired from the two front sides of a motorcycle, including semantic segmentation, depth map, intrinsic parameters of the cameras and the dynamic parameters of the motorcycle. It is worth noting that the case of two-wheeled vehicles is more challenging than cars due to the specific dynamic of these vehicles

    Génération d'images omnidirectionnelles à partir d'un environnement virtuel

    No full text
    International audienceThis paper describes a method for generating omnidirectional images using cubemap images and corresponding depth maps that can be acquired from a virtual environment. For this purpose, we use the video game Grand Theft Auto V (GTA V). GTA V has been used as a data source in many research projects, due to the fact that it is a hyperrealist open-world game that simulates a real city. We take advantage of developments made in reverse engineering this game, in order to extract realistic images and corresponding depth maps using virtual cameras with 6DoF. By combining the extracted information with an omnidirectional camera model, we generate Fish-eye images intended for instance to machine learning based applications.Dans cet article, nous décrivons une méthode pour générer des images omnidirectionnelles en utilisant des images cubemap et les cartes de profondeur correspondantes, à partir d'un environnement virtuel. Pour l'acquisition, on utilise le jeu vidéo Grand Theft Auto V (GTA V). GTA V a été utilisé comme source de données dans plusieurs travaux de recherche, puisque c'est un jeu à monde ouvert, hyperréaliste, simulant une vraie ville. L'avancée réalisée dans l'ingénierie inverse de ce jeu nous offre la possibilité d'extraire des images et les cartes de profondeur correspondantes avec des caméras virtuelles à six degrés de liberté. A partir de ces données et d'un modèle de caméra omnidirectionnelle, on propose de générer des images Fisheye destinées par exemple à l'entraînement de méthodes par apprentissage

    A comparative study of semantic segmentation using omnidirectional images

    No full text
    International audienceThe semantic segmentation of omnidirectional urban driving images is a research topic that has increasingly attracted the attention of researchers. This paper presents a thorough comparative study of different neural network models trained on four different representations: perspective, equirectangular, spherical and fisheye. We use in this study real perspective images, and synthetic perspective, fisheye and equirectangular images, as well as a test set of real fisheye images. We evaluate the performance of convolution on spherical images and perspective images. The conclusions obtained by analyzing the results of this study are multiple and help understanding how different networks learn to deal with omnidirectional distortions. Our main finding is that models trained on omnidirectional images are robust against modality changes and are able to learn a universal representation, giving good results in both perspective and omnidirectional images. The relevance of all results is examined with an analysis of quantitative measures

    A comparative study of semantic segmentation of omnidirectional images from a motorcycle perspective

    No full text
    International audienceThe semantic segmentation of omnidirectional urban driving images is a research topic that has increasingly attracted the attention of researchers, because the use of such images in driving scenes is highly relevant. However, the case of motorized two-wheelers has not been treated yet. Since the dynamics of these vehicles are very different from those of cars, we focus our study on images acquired using a motorcycle. This paper provides a thorough comparative study to show how different deep learning approaches handle omnidirectional images with different representations, including perspective, equirectangular, spherical, and fisheye, and presents the best solution to segment road scene omnidirectional images. We use in this study real perspective images, and synthetic perspective, fisheye and equirectangular images, simulated fisheye images, as well as a test set of real fisheye images. By analyzing both qualitative and quantitative results, the conclusions of this study are multiple, as it helps understand how the networks learn to deal with omnidirectional distortions. Our main findings are that models with planar convolutions give better results than the ones with spherical convolutions, and that models trained on omnidirectional representations transfer better to standard perspective images than vice versa

    Génération d'images omnidirectionnelles à partir d'un environnement virtuel

    No full text
    International audienceThis paper describes a method for generating omnidirectional images using cubemap images and corresponding depth maps that can be acquired from a virtual environment. For this purpose, we use the video game Grand Theft Auto V (GTA V). GTA V has been used as a data source in many research projects, due to the fact that it is a hyperrealist open-world game that simulates a real city. We take advantage of developments made in reverse engineering this game, in order to extract realistic images and corresponding depth maps using virtual cameras with 6DoF. By combining the extracted information with an omnidirectional camera model, we generate Fish-eye images intended for instance to machine learning based applications.Dans cet article, nous décrivons une méthode pour générer des images omnidirectionnelles en utilisant des images cubemap et les cartes de profondeur correspondantes, à partir d'un environnement virtuel. Pour l'acquisition, on utilise le jeu vidéo Grand Theft Auto V (GTA V). GTA V a été utilisé comme source de données dans plusieurs travaux de recherche, puisque c'est un jeu à monde ouvert, hyperréaliste, simulant une vraie ville. L'avancée réalisée dans l'ingénierie inverse de ce jeu nous offre la possibilité d'extraire des images et les cartes de profondeur correspondantes avec des caméras virtuelles à six degrés de liberté. A partir de ces données et d'un modèle de caméra omnidirectionnelle, on propose de générer des images Fisheye destinées par exemple à l'entraînement de méthodes par apprentissage

    TPT-Dance&Actions: a multimodal corpus of hman activities

    No full text
    Nous présentons une nouvelle base de données multimodales d’activités humaines, TPT - Dance & Actions, s’adressant, parmi d’autres, aux domaines de recherche liés à l’analyse de scènes multimodales. Ce corpus se focalise sur des scènes de danse (lindy hop, salsa et danse classique), de fitness rythmées par de la musique, et inclut également des enregistrements d’actions isolées plus “classiques”. Il regroupe 14 chorégraphies de danse et 13 séquences d’autres activités multimodales effectuées en partie par 20 danseurs et 16 participants respectivement. Ces différentes scènes multimodales sont enregistrées à travers une variété de réseaux de capteurs : caméras, capteurs de profondeur, microphones, capteurs piézoélectriques et une combinaison de capteurs inertiels (accéléromètres, gyroscopes et magnétomètres). Ces données seront disponibles sur internet librement à des fins de recherche.We present a new multimodal database of human activities, TPT - Dance & Actions, for research in multimodal scene analysis and understanding. This corpus focuses on dance scenes (lindy hop, salsa and classical dance), fitness and isolated sequences. 20 dancers and 16 participants were recorded performing respectively 14 dance choreographies and 13 sequences of other human activities. These different multimodal scenes have been captured through a variety of media modalities, including video cameras, depth sensors, microphones, piezoelectric transducers and wearable inertial devices (accelerometers, gyroscopes et magnetometers). These data will be available to download for research purpose. MOTS-CLÉS : danse, lindy hop, salsa, danse classique, fitness, actions isolées, données multimodales, audio, vidéos, cartes de profondeur, données inertielles, synchronisation, analyse d’activités multimodales, reconnaissance de gestes et d’actions
    corecore