1,764 research outputs found
Multi-View Picking: Next-best-view Reaching for Improved Grasping in Clutter
Camera viewpoint selection is an important aspect of visual grasp detection,
especially in clutter where many occlusions are present. Where other approaches
use a static camera position or fixed data collection routines, our Multi-View
Picking (MVP) controller uses an active perception approach to choose
informative viewpoints based directly on a distribution of grasp pose estimates
in real time, reducing uncertainty in the grasp poses caused by clutter and
occlusions. In trials of grasping 20 objects from clutter, our MVP controller
achieves 80% grasp success, outperforming a single-viewpoint grasp detector by
12%. We also show that our approach is both more accurate and more efficient
than approaches which consider multiple fixed viewpoints.Comment: ICRA 2019 Video: https://youtu.be/Vn3vSPKlaEk Code:
https://github.com/dougsm/mvp_gras
Facial Expression Analysis under Partial Occlusion: A Survey
Automatic machine-based Facial Expression Analysis (FEA) has made substantial
progress in the past few decades driven by its importance for applications in
psychology, security, health, entertainment and human computer interaction. The
vast majority of completed FEA studies are based on non-occluded faces
collected in a controlled laboratory environment. Automatic expression
recognition tolerant to partial occlusion remains less understood, particularly
in real-world scenarios. In recent years, efforts investigating techniques to
handle partial occlusion for FEA have seen an increase. The context is right
for a comprehensive perspective of these developments and the state of the art
from this perspective. This survey provides such a comprehensive review of
recent advances in dataset creation, algorithm development, and investigations
of the effects of occlusion critical for robust performance in FEA systems. It
outlines existing challenges in overcoming partial occlusion and discusses
possible opportunities in advancing the technology. To the best of our
knowledge, it is the first FEA survey dedicated to occlusion and aimed at
promoting better informed and benchmarked future work.Comment: Authors pre-print of the article accepted for publication in ACM
Computing Surveys (accepted on 02-Nov-2017
EARL: Eye-on-Hand Reinforcement Learner for Dynamic Grasping with Active Pose Estimation
In this paper, we explore the dynamic grasping of moving objects through
active pose tracking and reinforcement learning for hand-eye coordination
systems. Most existing vision-based robotic grasping methods implicitly assume
target objects are stationary or moving predictably. Performing grasping of
unpredictably moving objects presents a unique set of challenges. For example,
a pre-computed robust grasp can become unreachable or unstable as the target
object moves, and motion planning must also be adaptive. In this work, we
present a new approach, Eye-on-hAnd Reinforcement Learner (EARL), for enabling
coupled Eye-on-Hand (EoH) robotic manipulation systems to perform real-time
active pose tracking and dynamic grasping of novel objects without explicit
motion prediction. EARL readily addresses many thorny issues in automated
hand-eye coordination, including fast-tracking of 6D object pose from vision,
learning control policy for a robotic arm to track a moving object while
keeping the object in the camera's field of view, and performing dynamic
grasping. We demonstrate the effectiveness of our approach in extensive
experiments validated on multiple commercial robotic arms in both simulations
and complex real-world tasks.Comment: Presented on IROS 2023 Corresponding author Siddarth Jai
Robust fulfillment of constraints in robot visual servoing
[EN] In this work, an approach based on sliding mode ideas is proposed to satisfy constraints in robot visual servoing.
In particular, different types of constraints are defined in order to: fulfill the visibility constraints (camera fieldof-view
and occlusions) for the image features of the detected object; to avoid exceeding the joint range limits
and maximum joint speeds; and to avoid forbidden areas in the robot workspace. Moreover, another task with
low-priority is considered to track the target object. The main advantages of the proposed approach are low
computational cost, robustness and fully utilization of the allowed space for the constraints. The applicability
and effectiveness of the proposed approach is demonstrated by simulation results for a simple 2D case and a
complex 3D case study. Furthermore, the feasibility and robustness of the proposed approach is substantiated by
experimental results using a conventional 6R industrial manipulator.This work was supported in part by the Spanish Government under grants BES-2010-038486 and Project DPI2013-42302-R, and the Generalitat Valenciana under grants VALi+d APOSTD/2016/044 and BEST/2017/029.Muñoz-Benavent, P.; Gracia Calandin, LI.; Solanes Galbis, JE.; Esparza Peidro, A.; Tornero Montserrat, J. (2018). Robust fulfillment of constraints in robot visual servoing. Control Engineering Practice. 71(1):79-95. https://doi.org/10.1016/j.conengprac.2017.10.017S799571
Saliency-based approaches for multidimensional explainability of deep networks
In deep learning, visualization techniques extract the salient patterns exploited by deep networks to perform a task (e.g. image classification) focusing on single images. These methods allow a better understanding of these complex models, empowering the identification of the most informative parts of the input data. Beyond the deep network understanding, visual saliency is useful for many quantitative reasons and applications, both in the 2D and 3D domains, such as the analysis of the generalization capabilities of a classifier and autonomous navigation. In this thesis, we describe an approach to cope with the interpretability problem of a convolutional neural network and propose our ideas on how to exploit the visualization for applications like image classification and active object recognition. After a brief overview on common visualization methods producing attention/saliency maps, we will address two separate points: firstly, we will describe how visual saliency can be effectively used in the 2D domain (e.g. RGB images) to boost image classification performances: as a matter of fact, visual summaries, i.e. a compact representation of an ensemble of saliency maps, can be used to improve the classification accuracy of a network through summary-driven specializations. Then, we will present a 3D active recognition system that allows to consider different views of a target object, overcoming the single-view hypothesis of classical object recognition, making the classification problem much easier in principle. Here we adopt such attention maps in a quantitative fashion, by building a 3D dense saliency volume which fuses together saliency maps obtained from different viewpoints, obtaining a continuous proxy on which parts of an object are more discriminative for a given classifier. Finally, we will show how to inject this representations in a real world application, so that an agent (e.g. robot) can move knowing the capabilities of its classifier
An original framework for understanding human actions and body language by using deep neural networks
The evolution of both fields of Computer Vision (CV) and Artificial Neural Networks (ANNs) has allowed the development of efficient automatic systems for the analysis of people's behaviour.
By studying hand movements it is possible to recognize gestures, often used by people to communicate information in a non-verbal way.
These gestures can also be used to control or interact with devices without physically touching them. In particular, sign language and semaphoric hand gestures are the two foremost areas of interest due to their importance in Human-Human Communication (HHC) and Human-Computer Interaction (HCI), respectively.
While the processing of body movements play a key role in the action recognition and affective computing fields. The former is essential to understand how people act in an environment, while the latter tries to interpret people's emotions based on their poses and movements;
both are essential tasks in many computer vision applications, including event recognition, and video surveillance.
In this Ph.D. thesis, an original framework for understanding Actions and body language is presented. The framework is composed of three main modules: in the first one, a Long Short Term Memory Recurrent Neural Networks (LSTM-RNNs) based method for the Recognition of Sign Language and Semaphoric Hand Gestures is proposed; the second module presents a solution based on 2D skeleton and two-branch stacked LSTM-RNNs for action recognition in video sequences; finally, in the last module, a solution for basic non-acted emotion recognition by using 3D skeleton and Deep Neural Networks (DNNs) is provided.
The performances of RNN-LSTMs are explored in depth, due to their ability to model the long term contextual information of temporal sequences, making them suitable for analysing body movements.
All the modules were tested by using challenging datasets, well known in the state of the art, showing remarkable results compared to the current literature methods
Unsupervised Domain Adaptation for Face Recognition in Unlabeled Videos
Despite rapid advances in face recognition, there remains a clear gap between
the performance of still image-based face recognition and video-based face
recognition, due to the vast difference in visual quality between the domains
and the difficulty of curating diverse large-scale video datasets. This paper
addresses both of those challenges, through an image to video feature-level
domain adaptation approach, to learn discriminative video frame
representations. The framework utilizes large-scale unlabeled video data to
reduce the gap between different domains while transferring discriminative
knowledge from large-scale labeled still images. Given a face recognition
network that is pretrained in the image domain, the adaptation is achieved by
(i) distilling knowledge from the network to a video adaptation network through
feature matching, (ii) performing feature restoration through synthetic data
augmentation and (iii) learning a domain-invariant feature through a domain
adversarial discriminator. We further improve performance through a
discriminator-guided feature fusion that boosts high-quality frames while
eliminating those degraded by video domain-specific factors. Experiments on the
YouTube Faces and IJB-A datasets demonstrate that each module contributes to
our feature-level domain adaptation framework and substantially improves video
face recognition performance to achieve state-of-the-art accuracy. We
demonstrate qualitatively that the network learns to suppress diverse artifacts
in videos such as pose, illumination or occlusion without being explicitly
trained for them.Comment: accepted for publication at International Conference on Computer
Vision (ICCV) 201
- …