395 research outputs found
Deep learning for object detection in robotic grasping contexts
Dans la dernière décennie, les approches basées sur les réseaux de neurones convolutionnels sont devenus les standards pour la plupart des tâches en vision numérique. Alors qu'une grande partie des méthodes classiques de vision étaient basées sur des règles et algorithmes, les réseaux de neurones sont optimisés directement à partir de données d'entraînement qui sont étiquetées pour la tâche voulue. En pratique, il peut être difficile d'obtenir une quantité su sante de données d'entraînement ou d'interpréter les prédictions faites par les réseaux. Également, le processus d'entraînement doit être recommencé pour chaque nouvelle tâche ou ensemble d'objets. Au final, bien que très performantes, les solutions basées sur des réseaux de neurones peuvent être difficiles à mettre en place. Dans cette thèse, nous proposons des stratégies visant à contourner ou solutionner en partie ces limitations en contexte de détection d'instances d'objets. Premièrement, nous proposons d'utiliser une approche en cascade consistant à utiliser un réseau de neurone comme pré-filtrage d'une méthode standard de "template matching". Cette façon de faire nous permet d'améliorer les performances de la méthode de "template matching" tout en gardant son interprétabilité. Deuxièmement, nous proposons une autre approche en cascade. Dans ce cas, nous proposons d'utiliser un réseau faiblement supervisé pour générer des images de probabilité afin d'inférer la position de chaque objet. Cela permet de simplifier le processus d'entraînement et diminuer le nombre d'images d'entraînement nécessaires pour obtenir de bonnes performances. Finalement, nous proposons une architecture de réseau de neurones ainsi qu'une procédure d'entraînement permettant de généraliser un détecteur d'objets à des objets qui ne sont pas vus par le réseau lors de l'entraînement. Notre approche supprime donc la nécessité de réentraîner le réseau de neurones pour chaque nouvel objet.In the last decade, deep convolutional neural networks became a standard for computer vision applications. As opposed to classical methods which are based on rules and hand-designed features, neural networks are optimized and learned directly from a set of labeled training data specific for a given task. In practice, both obtaining sufficient labeled training data and interpreting network outputs can be problematic. Additionnally, a neural network has to be retrained for new tasks or new sets of objects. Overall, while they perform really well, deployment of deep neural network approaches can be challenging. In this thesis, we propose strategies aiming at solving or getting around these limitations for object detection. First, we propose a cascade approach in which a neural network is used as a prefilter to a template matching approach, allowing an increased performance while keeping the interpretability of the matching method. Secondly, we propose another cascade approach in which a weakly-supervised network generates object-specific heatmaps that can be used to infer their position in an image. This approach simplifies the training process and decreases the number of required training images to get state-of-the-art performances. Finally, we propose a neural network architecture and a training procedure allowing detection of objects that were not seen during training, thus removing the need to retrain networks for new objects
A^2-Net: Molecular Structure Estimation from Cryo-EM Density Volumes
Constructing of molecular structural models from Cryo-Electron Microscopy
(Cryo-EM) density volumes is the critical last step of structure determination
by Cryo-EM technologies. Methods have evolved from manual construction by
structural biologists to perform 6D translation-rotation searching, which is
extremely compute-intensive. In this paper, we propose a learning-based method
and formulate this problem as a vision-inspired 3D detection and pose
estimation task. We develop a deep learning framework for amino acid
determination in a 3D Cryo-EM density volume. We also design a sequence-guided
Monte Carlo Tree Search (MCTS) to thread over the candidate amino acids to form
the molecular structure. This framework achieves 91% coverage on our newly
proposed dataset and takes only a few minutes for a typical structure with a
thousand amino acids. Our method is hundreds of times faster and several times
more accurate than existing automated solutions without any human intervention.Comment: 8 pages, 5 figures, 4 table
Manifold-Aware Self-Training for Unsupervised Domain Adaptation on Regressing 6D Object Pose
Domain gap between synthetic and real data in visual regression (e.g. 6D pose
estimation) is bridged in this paper via global feature alignment and local
refinement on the coarse classification of discretized anchor classes in target
space, which imposes a piece-wise target manifold regularization into
domain-invariant representation learning. Specifically, our method incorporates
an explicit self-supervised manifold regularization, revealing consistent
cumulative target dependency across domains, to a self-training scheme (e.g.
the popular Self-Paced Self-Training) to encourage more discriminative
transferable representations of regression tasks. Moreover, learning unified
implicit neural functions to estimate relative direction and distance of
targets to their nearest class bins aims to refine target classification
predictions, which can gain robust performance against inconsistent feature
scaling sensitive to UDA regressors. Experiment results on three public
benchmarks of the challenging 6D pose estimation task can verify the
effectiveness of our method, consistently achieving superior performance to the
state-of-the-art for UDA on 6D pose estimation.Comment: Accepted by IJCAI 202
PourIt!: Weakly-supervised Liquid Perception from a Single Image for Visual Closed-Loop Robotic Pouring
Liquid perception is critical for robotic pouring tasks. It usually requires
the robust visual detection of flowing liquid. However, while recent works have
shown promising results in liquid perception, they typically require labeled
data for model training, a process that is both time-consuming and reliant on
human labor. To this end, this paper proposes a simple yet effective framework
PourIt!, to serve as a tool for robotic pouring tasks. We design a simple data
collection pipeline that only needs image-level labels to reduce the reliance
on tedious pixel-wise annotations. Then, a binary classification model is
trained to generate Class Activation Map (CAM) that focuses on the visual
difference between these two kinds of collected data, i.e., the existence of
liquid drop or not. We also devise a feature contrast strategy to improve the
quality of the CAM, thus entirely and tightly covering the actual liquid
regions. Then, the container pose is further utilized to facilitate the 3D
point cloud recovery of the detected liquid region. Finally, the
liquid-to-container distance is calculated for visual closed-loop control of
the physical robot. To validate the effectiveness of our proposed method, we
also contribute a novel dataset for our task and name it PourIt! dataset.
Extensive results on this dataset and physical Franka robot have shown the
utility and effectiveness of our method in the robotic pouring tasks. Our
dataset, code and pre-trained models will be available on the project page.Comment: ICCV202
Towards Object-Centric Scene Understanding
Visual perception for autonomous agents continues to attract community attention due to the disruptive technologies and the wide applicability of such solutions. Autonomous Driving (AD), a major application in this domain, promises to revolutionize our approach to mobility while bringing critical advantages in limiting accident fatalities.
Fueled by recent advances in Deep Learning (DL), more computer vision tasks are being addressed using a learning paradigm. Deep Neural Networks (DNNs) succeeded consistently in pushing performances to unprecedented levels and demonstrating the ability of such approaches to generalize to an increasing number of difficult problems, such as 3D vision tasks.
In this thesis, we address two main challenges arising from the current approaches. Namely, the computational complexity of multi-task pipelines, and the increasing need for manual annotations. On the one hand, AD systems need to perceive the surrounding environment on different levels of detail and, subsequently, take timely actions. This multitasking further limits the time available for each perception task. On the other hand, the need for universal generalization of such systems to massively diverse situations requires the use of large-scale datasets covering long-tailed cases. Such requirement renders the use of traditional supervised approaches, despite the data readily available in the AD domain, unsustainable in terms of annotation costs, especially for 3D tasks.
Driven by the AD environment nature and the complexity dominated (unlike indoor scenes) by the presence of other scene elements (mainly cars and pedestrians) we focus on the above-mentioned challenges in object-centric tasks. We, then, situate our contributions appropriately in fast-paced literature, while supporting our claims with extensive experimental analysis leveraging up-to-date state-of-the-art results and community-adopted benchmarks
- …