50 research outputs found
Learned Semantic Multi-Sensor Depth Map Fusion
Volumetric depth map fusion based on truncated signed distance functions has
become a standard method and is used in many 3D reconstruction pipelines. In
this paper, we are generalizing this classic method in multiple ways: 1)
Semantics: Semantic information enriches the scene representation and is
incorporated into the fusion process. 2) Multi-Sensor: Depth information can
originate from different sensors or algorithms with very different noise and
outlier statistics which are considered during data fusion. 3) Scene denoising
and completion: Sensors can fail to recover depth for certain materials and
light conditions, or data is missing due to occlusions. Our method denoises the
geometry, closes holes and computes a watertight surface for every semantic
class. 4) Learning: We propose a neural network reconstruction method that
unifies all these properties within a single powerful framework. Our method
learns sensor or algorithm properties jointly with semantic depth fusion and
scene completion and can also be used as an expert system, e.g. to unify the
strengths of various photometric stereo algorithms. Our approach is the first
to unify all these properties. Experimental evaluations on both synthetic and
real data sets demonstrate clear improvements.Comment: 11 pages, 7 figures, 2 tables, accepted for the 2nd Workshop on 3D
Reconstruction in the Wild (3DRW2019) in conjunction with ICCV201
HoloPose: Holistic 3D Human Reconstruction In-The-Wild.
We introduce HoloPose, a method for holistic monocular 3D human body reconstruction. We first introduce a
part-based model for 3D model parameter regression that
allows our method to operate in-the-wild, gracefully handling severe occlusions and large pose variation. We further
train a multi-task network comprising 2D, 3D and Dense
Pose estimation to drive the 3D reconstruction task. For
this we introduce an iterative refinement method that aligns
the model-based 3D estimates of 2D/3D joint positions and
DensePose with their image-based counterparts delivered
by CNNs, achieving both model-based, global consistency
and high spatial accuracy thanks to the bottom-up CNN
processing. We validate our contributions on challenging
benchmarks, showing that our method allows us to get both
accurate joint and 3D surface estimates, while operating
at more than 10fps in-the-wild. More information about
our approach, including videos and demos is available at
http://arielai.com/holopose
From pixels to people : recovering location, shape and pose of humans in images
Humans are at the centre of a significant amount of research in computer vision. Endowing machines with the ability to perceive people from visual data is an immense scientific challenge with a high degree of direct practical relevance. Success in automatic perception can be measured at different levels of abstraction, and this will depend on which intelligent behaviour we are trying to replicate: the ability to localise persons in an image or in the environment, understanding how persons are moving at the skeleton and at the surface level, interpreting their interactions with the environment including with other people, and perhaps even anticipating future actions. In this thesis we tackle different sub-problems of the broad research area referred to as "looking at people", aiming to perceive humans in images at different levels of granularity. We start with bounding box-level pedestrian detection: We present a retrospective analysis of methods published in the decade preceding our work, identifying various strands of research that have advanced the state of the art. With quantitative exper- iments, we demonstrate the critical role of developing better feature representations and having the right training distribution. We then contribute two methods based on the insights derived from our analysis: one that combines the strongest aspects of past detectors and another that focuses purely on learning representations. The latter method outperforms more complicated approaches, especially those based on hand- crafted features. We conclude our work on pedestrian detection with a forward-looking analysis that maps out potential avenues for future research. We then turn to pixel-level methods: Perceiving humans requires us to both separate them precisely from the background and identify their surroundings. To this end, we introduce Cityscapes, a large-scale dataset for street scene understanding. This has since established itself as a go-to benchmark for segmentation and detection. We additionally develop methods that relax the requirement for expensive pixel-level annotations, focusing on the task of boundary detection, i.e. identifying the outlines of relevant objects and surfaces. Next, we make the jump from pixels to 3D surfaces, from localising and labelling to fine-grained spatial understanding. We contribute a method for recovering 3D human shape and pose, which marries the advantages of learning-based and model- based approaches. We conclude the thesis with a detailed discussion of benchmarking practices in computer vision. Among other things, we argue that the design of future datasets should be driven by the general goal of combinatorial robustness besides task-specific considerations.Der Mensch steht im Zentrum vieler Forschungsanstrengungen im Bereich des maschinellen Sehens. Es ist eine immense wissenschaftliche Herausforderung mit hohem unmittelbarem Praxisbezug, Maschinen mit der Fähigkeit auszustatten, Menschen auf der Grundlage von visuellen Daten wahrzunehmen. Die automatische Wahrnehmung kann auf verschiedenen Abstraktionsebenen erfolgen. Dies hängt davon ab, welches intelligente Verhalten wir nachbilden wollen: die Fähigkeit, Personen auf der Bildfläche oder im 3D-Raum zu lokalisieren, die Bewegungen von Körperteilen und Körperoberflächen zu erfassen, Interaktionen einer Person mit ihrer Umgebung einschließlich mit anderen Menschen zu deuten, und vielleicht sogar zukünftige Handlungen zu antizipieren. In dieser Arbeit beschäftigen wir uns mit verschiedenen Teilproblemen die dem breiten Forschungsgebiet "Betrachten von Menschen" gehören. Beginnend mit der Fußgängererkennung präsentieren wir eine Analyse von Methoden, die im Jahrzehnt vor unserem Ausgangspunkt veröffentlicht wurden, und identifizieren dabei verschiedene Forschungsstränge, die den Stand der Technik vorangetrieben haben. Unsere quantitativen Experimente zeigen die entscheidende Rolle sowohl der Entwicklung besserer Bildmerkmale als auch der Trainingsdatenverteilung. Anschließend tragen wir zwei Methoden bei, die auf den Erkenntnissen unserer Analyse basieren: eine Methode, die die stärksten Aspekte vergangener Detektoren kombiniert, eine andere, die sich im Wesentlichen auf das Lernen von Bildmerkmalen konzentriert. Letztere übertrifft kompliziertere Methoden, insbesondere solche, die auf handgefertigten Bildmerkmalen basieren. Wir schließen unsere Arbeit zur Fußgängererkennung mit einer vorausschauenden Analyse ab, die mögliche Wege für die zukünftige Forschung aufzeigt. Anschließend wenden wir uns Methoden zu, die Entscheidungen auf Pixelebene betreffen. Um Menschen wahrzunehmen, müssen wir diese sowohl praezise vom Hintergrund trennen als auch ihre Umgebung verstehen. Zu diesem Zweck führen wir Cityscapes ein, einen umfangreichen Datensatz zum Verständnis von Straßenszenen. Dieser hat sich seitdem als Standardbenchmark für Segmentierung und Erkennung etabliert. Darüber hinaus entwickeln wir Methoden, die die Notwendigkeit teurer Annotationen auf Pixelebene reduzieren. Wir konzentrieren uns hierbei auf die Aufgabe der Umgrenzungserkennung, d. h. das Erkennen der Umrisse relevanter Objekte und Oberflächen. Als nächstes machen wir den Sprung von Pixeln zu 3D-Oberflächen, vom Lokalisieren und Beschriften zum präzisen räumlichen Verständnis. Wir tragen eine Methode zur Schätzung der 3D-Körperoberfläche sowie der 3D-Körperpose bei, die die Vorteile von lernbasierten und modellbasierten Ansätzen vereint. Wir schließen die Arbeit mit einer ausführlichen Diskussion von Evaluationspraktiken im maschinellen Sehen ab. Unter anderem argumentieren wir, dass der Entwurf zukünftiger Datensätze neben aufgabenspezifischen Überlegungen vom allgemeinen Ziel der kombinatorischen Robustheit bestimmt werden sollte
Recommended from our members
Human Shape from Silhouettes Using Generative HKS Descriptors and Cross-Modal Neural Networks
In this work, we present a novel method for capturing
human body shape from a single scaled silhouette. We combine deep correlated features capturing different 2D views,
and embedding spaces based on 3D cues in a novel convolutional neural network (CNN) based architecture. We
first train a CNN to find a richer body shape representation space from pose invariant 3D human shape descriptors.
Then, we learn a mapping from silhouettes to this representation space, with the help of a novel architecture that exploits the correlation of multi-view data during training time, to
improve prediction at test time. We extensively validate our
results on synthetic and real data, demonstrating significant
improvements in accuracy as compared to the state-of-the-art, and providing a practical system for detailed human
body measurements from a single image
When Deep Learning Meets Data Alignment: A Review on Deep Registration Networks (DRNs)
Registration is the process that computes the transformation that aligns sets
of data. Commonly, a registration process can be divided into four main steps:
target selection, feature extraction, feature matching, and transform
computation for the alignment. The accuracy of the result depends on multiple
factors, the most significant are the quantity of input data, the presence of
noise, outliers and occlusions, the quality of the extracted features,
real-time requirements and the type of transformation, especially those ones
defined by multiple parameters, like non-rigid deformations.
Recent advancements in machine learning could be a turning point in these
issues, particularly with the development of deep learning (DL) techniques,
which are helping to improve multiple computer vision problems through an
abstract understanding of the input data. In this paper, a review of deep
learning-based registration methods is presented. We classify the different
papers proposing a framework extracted from the traditional registration
pipeline to analyse the new learning-based proposal strengths. Deep
Registration Networks (DRNs) try to solve the alignment task either replacing
part of the traditional pipeline with a network or fully solving the
registration problem. The main conclusions extracted are, on the one hand, 1)
learning-based registration techniques cannot always be clearly classified in
the traditional pipeline. 2) These approaches allow more complex inputs like
conceptual models as well as the traditional 3D datasets. 3) In spite of the
generality of learning, the current proposals are still ad hoc solutions.
Finally, 4) this is a young topic that still requires a large effort to reach
general solutions able to cope with the problems that affect traditional
approaches.Comment: Submitted to Pattern Recognitio
On the Production of Semantic and Textured 3D Meshes of Large scale Urban Environments from Mobile Mapping Images and LiDAR scans
International audienceDans cet article nous présentons un cadre entièrement au-tomatique pour la reconstruction d'un maillage, sa textu-ration et sa sémantisation à large échelle à partir de scans LiDAR et d'images orientées de scènes urbaines collectés par une plateforme de cartographie mobile terrestre. Tout d'abord, les points et les images georéferencés sont dé-coupés temporellement pour assurer une cohèrence entre la geométrie (les points) et la photométrie (les images). Ensuite, une reconstruction de surface 3D simple et ra-pide basée sur la topologie d'acquisition du capteur est effectuée sur chaque segment après un rééchantillonnage du nuage de points obtenu à partir des balayages LiDAR. L'algorithme de [31] est par la suite adapté pour texturer la surface reconstruite avec les images acquises simultané-ment assurant une texture de haute qualité et un ajustement photométrique global. Enfin, en se basant sur le schéma de texturation, une sémantisation par texel est appliquée sur le modèle final. Mots Clef scène urbaine, cartographie mobile, LiDAR, reconstruction de surface, texturation, sémantisation, apprentissage pro-fond. Abstract In this paper we present a fully automatic framework for the reconstruction of a 3D mesh, its texture mapping and its semantization using oriented images and LiDAR scans acquired in a large urban area by a terrestrial Mobile Mapping System (MMS). First, the acquired points and images are sliced into temporal chunks ensuring a reasonable size and time consistency between geometry (points) and pho-tometry (images). Then, a simple and fast 3D surface reconstruction relying on the sensor space topology is performed on each chunk after an isotropic sampling of the point cloud obtained from the raw LiDAR scans. The method of [31] is subsequently adapted to texture the reconstructed surface with the images acquired simultaneously, ensuring a high quality texture and global color adjustment. Finally, based on the texturing scheme a per-texel semantization is conducted on the final model