59 research outputs found
Point-to-Pose Voting based Hand Pose Estimation using Residual Permutation Equivariant Layer
Recently, 3D input data based hand pose estimation methods have shown
state-of-the-art performance, because 3D data capture more spatial information
than the depth image. Whereas 3D voxel-based methods need a large amount of
memory, PointNet based methods need tedious preprocessing steps such as
K-nearest neighbour search for each point. In this paper, we present a novel
deep learning hand pose estimation method for an unordered point cloud. Our
method takes 1024 3D points as input and does not require additional
information. We use Permutation Equivariant Layer (PEL) as the basic element,
where a residual network version of PEL is proposed for the hand pose
estimation task. Furthermore, we propose a voting based scheme to merge
information from individual points to the final pose output. In addition to the
pose estimation task, the voting-based scheme can also provide point cloud
segmentation result without ground-truth for segmentation. We evaluate our
method on both NYU dataset and the Hands2017Challenge dataset. Our method
outperforms recent state-of-the-art methods, where our pose accuracy is
currently the best for the Hands2017Challenge dataset
Learning Equivariant Representations
State-of-the-art deep learning systems often require large amounts of data
and computation. For this reason, leveraging known or unknown structure of the
data is paramount. Convolutional neural networks (CNNs) are successful examples
of this principle, their defining characteristic being the shift-equivariance.
By sliding a filter over the input, when the input shifts, the response shifts
by the same amount, exploiting the structure of natural images where semantic
content is independent of absolute pixel positions. This property is essential
to the success of CNNs in audio, image and video recognition tasks. In this
thesis, we extend equivariance to other kinds of transformations, such as
rotation and scaling. We propose equivariant models for different
transformations defined by groups of symmetries. The main contributions are (i)
polar transformer networks, achieving equivariance to the group of similarities
on the plane, (ii) equivariant multi-view networks, achieving equivariance to
the group of symmetries of the icosahedron, (iii) spherical CNNs, achieving
equivariance to the continuous 3D rotation group, (iv) cross-domain image
embeddings, achieving equivariance to 3D rotations for 2D inputs, and (v)
spin-weighted spherical CNNs, generalizing the spherical CNNs and achieving
equivariance to 3D rotations for spherical vector fields. Applications include
image classification, 3D shape classification and retrieval, panoramic image
classification and segmentation, shape alignment and pose estimation. What
these models have in common is that they leverage symmetries in the data to
reduce sample and model complexity and improve generalization performance. The
advantages are more significant on (but not limited to) challenging tasks where
data is limited or input perturbations such as arbitrary rotations are present
Monocular Real-time Hand Shape and Motion Capture using Multi-modal Data
We present a novel method for monocular hand shape and pose estimation at unprecedented runtime performance of 100fps and at state-of-the-art accuracy. This is enabled by a new learning based architecture designed such that it can make use of all the sources of available hand training data: image data with either 2D or 3D annotations, as well as stand-alone 3D animations without corresponding image data. It features a 3D hand joint detection module and an inverse kinematics module which regresses not only 3D joint positions but also maps them to joint rotations in a single feed-forward pass. This output makes the method more directly usable for applications in computer vision and graphics compared to only regressing 3D joint positions. We demonstrate that our architectural design leads to a significant quantitative and qualitative improvement over the state of the art on several challenging benchmarks. Our model is publicly available for future research
Odhad polohy ruky v RGBD obrazech
It is surprising that even with increasing ubiquitousness of Augmented Reality applications on mobile devices, users nowadays still interact with said applications via on-screen controls, rather than controlling presented environment directly with their hands in real-world. To enable this degree of interactivity, a stable and robust hand pose estimation pipeline is needed. This thesis therefore serves as a study of a possible approach to hand pose estimation that consists of two parts; segmentation and estimation of hand model parameters.Je překvapivé, že i s přibývajícím rozšířením mobilních aplikací využívající virtuální reality, uživatélé stále interagují s těmito aplikacemi pomocí dotyků obrazovky namísto ovládání virtuálního obsahu přímo pomocí rukou v prostoru. K vytvoření této úrovně interaktivity je zapotřebí stabilní a robustní algoritmus pro odhadování polohy ruky. Tato diplomová práce proto slouží jako studie možných přístupů k problému odhadu polohy ruky a sestává ze dvou částí; ze segmentace a z odhadu parametrů modelu ruky.460 - Katedra informatikyvýborn
Hand Pose-based Task Learning from Visual Observations with Semantic Skill Extraction
Learning from Demonstrations is a promising technique to transfer task knowledge from a user to a robot. We propose a framework for task programming by observing the human hand pose and object locations solely with a depth camera. By extracting skills from the demonstrations, we are able to represent what the robot has learned, generalize to unseen object locations and optimize the robotic execution instead of replaying a non-optimal behavior. A two-staged segmentation algorithm that employs skill template matching via Hidden Markov Models has been developed to extract motion primitives from the demonstration and gives them semantic meanings. In this way, the transfer of task knowledge has been improved from a simple replay of the demonstration towards a semantically annotated, optimized and generalized execution. We evaluated the extraction of a set of skills in simulation and prove that the task execution can be optimized by such means
Robustness, scalability and interpretability of equivariant neural networks across different low-dimensional geometries
In this thesis we develop neural networks that exploit the symmetries of four different low-dimensional geometries, namely 1D grids, 2D grids, 3D continuous spaces and graphs, through the consideration of translational, rotational, cylindrical and permutation symmetries. We apply these models to applications across a range of scientific disciplines demonstrating the predictive ability, robustness, scalability, and interpretability.
We develop a neural network that exploits the translational symmetries on 1D grids to predict age and species of mosquitoes from high-dimensional mid-infrared spectra. We show that the model can learn to predict mosquito age and species with a higher accuracy than models that do not utilise any inductive bias. We also demonstrate that the model is sensitive to regions within the input spectra that are in agreement with regions identified by a domain expert. We present a transfer learning approach to overcome the challenge of working with small, real-world, wild collected data sets and demonstrate the benefit of the approach on a real-world application.
We demonstrate the benefit of rotation equivariant neural networks on the task of segmenting deforestation regions from satellite images through exploiting the rotational symmetry present on 2D grids. We develop a novel physics-informed architecture, exploiting the cylindrical symmetries of the group SO+ (2, 1), which can invert the transmission effects of multi-mode optical fibres (MMFs). We develop a new connection between a physics understanding of MMFs and group equivariant neural networks. We show that this novel architecture requires fewer training samples to learn, better generalises to out-of-distribution data sets, scales to higher-resolution images, is more interpretable, and reduces the parameter count of the model. We demonstrate the capability of the model on real-world data and provide an adaption to the model to handle real-world deviations from theory. We also show that the model can scale to higher resolution images than was previously possible.
We develop a novel architecture which provides a symmetry-preserving mapping between two different low-dimensional geometries and demonstrate its practical benefit for the application of 3D hand mesh generation from 2D images. This models exploits both the 2D rotational symmetries present in a 2D image and in a 3D hand mesh, and provides a mapping between the two data domains. We demonstrate that the model performs competitively on a range of benchmark data sets and justify the choice of inductive bias in the model.
We develop an architecture which is equivariant to a novel choice of automorphism group through the use of a sub-graph selection policy. We demonstrate the benefit of the architecture, theoretically through proving the improved expressivity and improved scalability, and experimentally on a range of widely studied benchmark graph classification tasks. We present a method of comparison between models that had not been previously considered in this area of research, demonstrating recent SOTA methods are statistically indistinguishable
Generalized Product-of-Experts for Learning Multimodal Representations in Noisy Environments
A real-world application or setting involves interaction between different
modalities (e.g., video, speech, text). In order to process the multimodal
information automatically and use it for an end application, Multimodal
Representation Learning (MRL) has emerged as an active area of research in
recent times. MRL involves learning reliable and robust representations of
information from heterogeneous sources and fusing them. However, in practice,
the data acquired from different sources are typically noisy. In some extreme
cases, a noise of large magnitude can completely alter the semantics of the
data leading to inconsistencies in the parallel multimodal data. In this paper,
we propose a novel method for multimodal representation learning in a noisy
environment via the generalized product of experts technique. In the proposed
method, we train a separate network for each modality to assess the credibility
of information coming from that modality, and subsequently, the contribution
from each modality is dynamically varied while estimating the joint
distribution. We evaluate our method on two challenging benchmarks from two
diverse domains: multimodal 3D hand-pose estimation and multimodal surgical
video segmentation. We attain state-of-the-art performance on both benchmarks.
Our extensive quantitative and qualitative evaluations show the advantages of
our method compared to previous approaches.Comment: 11 Pages, Accepted at ICMI 2022 Ora
- …