12 research outputs found

    Multi-view Human Parsing for Human-Robot Collaboration

    Get PDF
    In human-robot collaboration, perception plays a major role in enabling the robot to understand the surrounding environment and the position of humans inside the working area, which represents a key element for an effective and safe collaboration. Human pose estimators based on skeletal models are among the most popular approaches to monitor the position of humans around the robot, but they do not take into account information such as the body volume, needed by the robot for effective collision avoidance. In this paper, we propose a novel 3D human representation derived from body parts segmentation which combines high-level semantic information (i.e., human body parts) and volume information. To compute such body parts segmentation, also known as human parsing in the literature, we propose a multi-view system based on a camera network. People body parts are segmented in the frames acquired by each camera, projected into 3D world coordinates, and then aggregated to build a 3D representation of the human that is robust to occlusions. A further step of 3D data filtering has been implemented to improve robustness to outliers and segmentation accuracy. The proposed multi-view human parsing approach was tested in a real environment and its performance measured in terms of global and class accuracy on a dedicated dataset, acquired to thoroughly test the system under various conditions. The experimental results demonstrated the performance improvements that can be achieved thanks to the proposed multi-view approach

    A general skeleton-based action and gesture recognition framework for human-robot collaboration

    No full text
    Recognizing human actions is crucial for an effective and safe collaboration between humans and robots. For example, in a collaborative assembly task, human workers can use gestures to communicate with the robot, and the robot can use the recognized actions to anticipate the next steps in the assembly process, leading to improved safety and productivity. In this work, we propose a general framework for human action recognition based on 3D pose estimation and ensemble techniques, which allows to recognize both body actions and hand gestures. The framework relies on OpenPose and 2D to 3D lifting methods to estimate 3D joints for the human body and the hands, feeding then these joints into a set of graph convolutional networks based on the Shift- GCN architecture. The output scores of all networks are combined using an ensemble approach to predict the final human action. The proposed framework was evaluated on a custom dataset designed for human–robot collaboration tasks, named IAS-Lab Collaborative HAR dataset. The results showed that using an ensemble of action recognition models improves the accuracy and robustness of the overall system; moreover, the proposed framework can be easily specialized on different scenarios and achieve state-of-the-art results on the HRI30 dataset when coupled with an object detector or classifier

    Clustering-based refinement for 3D human body parts segmentation

    No full text
    A common approach to address human body parts segmentation on 3D data involves the use of a 2D segmentation network and 3D projection. Following this approach, several errors could be introduced in the final 3D segmentation output, such as segmentation errors and reprojection errors. Such errors are even more significant when considering very small body parts such as hands. In this paper, we propose a new algorithm that aims to reduce such errors and improve 3D segmentation of human body parts. The algorithm detects noise points and wrong clusters using DBSCAN algorithm, and changes the labels of the points exploiting the shape and position of the clusters. We evaluated the proposed algorithm on the 3DPeople synthetic dataset and on a real dataset, highlighting how it can greatly improve the 3D segmentation of small body parts like hands. With our algorithm we achieved an improvement up to 4.68% of IoU on the synthetic dataset and up to 2.30% of IoU in the real scenario

    Building Ensemble of Deep Networks: Convolutional Networks and Transformers

    No full text
    This paper presents a study on an automated system for image classification, which is based on the fusion of various deep learning methods. The study explores how to create an ensemble of different Convolutional Neural Network (CNN) models and transformer topologies that are fine-tuned on several datasets to leverage their diversity. The research question addressed in this work is whether different optimization algorithms can help in developing robust and efficient machine learning systems to be used in different domains for classification purposes. To do that, we introduce novel Adam variants. We employed these new approaches, coupled with several CNN topologies, for building an ensemble of classifiers that outperforms both other Adam-based methods and stochastic gradient descent. Additionally, the study combines the ensemble of CNNs with an ensemble of transformers based on different topologies, such as Deit, Vit, Swin, and Coat. To the best of our knowledge, this is the first work in which an in-depth study of a set of transformers and convolutional neural networks in a large set of small/medium-sized images is carried out. The experiments performed on several datasets demonstrate that the combination of such different models results in a substantial performance improvement in all tested problems. All resources are available at https://github.com/LorisNanni

    Heterogeneous Ensemble for Medical Data Classification

    Get PDF
    For robust classification, selecting a proper classifier is of primary importance. However, selecting the best classifiers depends on the problem, as some classifiers work better at some tasks than on others. Despite the many results collected in the literature, the support vector machine (SVM) remains the leading adopted solution in many domains, thanks to its ease of use. In this paper, we propose a new method based on convolutional neural networks (CNNs) as an alternative to SVM. CNNs are specialized in processing data in a grid-like topology that usually represents images. To enable CNNs to work on different data types, we investigate reshaping one-dimensional vector representations into two-dimensional matrices and compared different approaches for feeding standard CNNs using two-dimensional feature vector representations. We evaluate the different techniques proposing a heterogeneous ensemble based on three classifiers: an SVM, a model based on random subspace of rotation boosting (RB), and a CNN. The robustness of our approach is tested across a set of benchmark datasets that represent a wide range of medical classification tasks. The proposed ensembles provide promising performance on all datasets

    Clustering-based refinement for 3D human body parts segmentation

    No full text
    A common approach to address human body parts segmentation on 3D data involves the use of a 2D segmentation network and 3D projection. Following this approach, several errors could be introduced in the final 3D segmentation output, such as segmentation errors and reprojection errors. Such errors are even more significant when considering very small body parts such as hands. In this paper, we propose a new algorithm that aims to reduce such errors and improve 3D segmentation of human body parts. The algorithm detects noise points and wrong clusters using DBSCAN algorithm, and changes the labels of the points exploiting the shape and position of the clusters. We evaluated the proposed algorithm on the 3DPeople synthetic dataset and on a real dataset, highlighting how it can greatly improve the 3D segmentation of small body parts like hands. With our algorithm we achieved an improvement up to 4.68% of IoU on the synthetic dataset and up to 2.30% of IoU in the real scenario

    FSG-Net: a deep learning model for semantic robot grasping through few-shot learning

    No full text
    Robot grasping has been widely studied in the last decade. Recently, Deep Learning made possible to achieve remarkable results in grasp pose estimation, using depth and RGB images. However, only few works consider the choice of the object to grasp. Moreover, they require a huge amount of data for generalizing to unseen object categories. For this reason, we introduce the Few-shot Semantic Grasping task where the objective is inferring a correct grasp given only five labelled images of a target unseen object. We propose a new deep learning architecture able to solve the aforementioned problem, leveraging on a Few-shot Semantic Segmentation module. We have evaluated the proposed model both in the Graspnet dataset and in a real scenario. In Graspnet, we achieve 40,95% accuracy in the Few-shot Semantic Grasping task, outperforming baseline approaches. In the real experiments, the results confirmed the generalization ability of the network
    corecore