413 research outputs found
Facetwise Mesh Refinement for Multi-View Stereo
Mesh refinement is a fundamental step for accurate Multi-View Stereo. It
modifies the geometry of an initial manifold mesh to minimize the photometric
error induced in a set of camera pairs. This initial mesh is usually the output
of volumetric 3D reconstruction based on min-cut over Delaunay Triangulations.
Such methods produce a significant amount of non-manifold vertices, therefore
they require a vertex split step to explicitly repair them. In this paper, we
extend this method to preemptively fix the non-manifold vertices by reasoning
directly on the Delaunay Triangulation and avoid most vertex splits. The main
contribution of this paper addresses the problem of choosing the camera pairs
adopted by the refinement process. We treat the problem as a mesh labeling
process, where each label corresponds to a camera pair. Differently from the
state-of-the-art methods, which use each camera pair to refine all the visible
parts of the mesh, we choose, for each facet, the best pair that enforces both
the overall visibility and coverage. The refinement step is applied for each
facet using only the camera pair selected. This facetwise refinement helps the
process to be applied in the most evenly way possible.Comment: Accepted as Oral ICPR202
H3WB: Human3.6M 3D WholeBody Dataset and Benchmark
We present a benchmark for 3D human whole-body pose estimation, which
involves identifying accurate 3D keypoints on the entire human body, including
face, hands, body, and feet. Currently, the lack of a fully annotated and
accurate 3D whole-body dataset results in deep networks being trained
separately on specific body parts, which are combined during inference. Or they
rely on pseudo-groundtruth provided by parametric body models which are not as
accurate as detection based methods. To overcome these issues, we introduce the
Human3.6M 3D WholeBody (H3WB) dataset, which provides whole-body annotations
for the Human3.6M dataset using the COCO Wholebody layout. H3WB comprises 133
whole-body keypoint annotations on 100K images, made possible by our new
multi-view pipeline. We also propose three tasks: i) 3D whole-body pose lifting
from 2D complete whole-body pose, ii) 3D whole-body pose lifting from 2D
incomplete whole-body pose, and iii) 3D whole-body pose estimation from a
single RGB image. Additionally, we report several baselines from popular
methods for these tasks. Furthermore, we also provide automated 3D whole-body
annotations of TotalCapture and experimentally show that when used with H3WB it
helps to improve the performance. Code and dataset is available at
https://github.com/wholebody3d/wholebody3dComment: Accepted by ICCV 202
Deep Learning-Based Human Pose Estimation: A Survey
Human pose estimation aims to locate the human body parts and build human
body representation (e.g., body skeleton) from input data such as images and
videos. It has drawn increasing attention during the past decade and has been
utilized in a wide range of applications including human-computer interaction,
motion analysis, augmented reality, and virtual reality. Although the recently
developed deep learning-based solutions have achieved high performance in human
pose estimation, there still remain challenges due to insufficient training
data, depth ambiguities, and occlusion. The goal of this survey paper is to
provide a comprehensive review of recent deep learning-based solutions for both
2D and 3D pose estimation via a systematic analysis and comparison of these
solutions based on their input data and inference procedures. More than 240
research papers since 2014 are covered in this survey. Furthermore, 2D and 3D
human pose estimation datasets and evaluation metrics are included.
Quantitative performance comparisons of the reviewed methods on popular
datasets are summarized and discussed. Finally, the challenges involved,
applications, and future research directions are concluded. We also provide a
regularly updated project page: \url{https://github.com/zczcwh/DL-HPE
Algorithms for the reconstruction, analysis, repairing and enhancement of 3D urban models from multiple data sources
Over the last few years, there has been a notorious growth in the field of digitization of 3D buildings and urban environments. The substantial improvement of both scanning hardware and reconstruction algorithms has led to the development of representations of buildings and cities that can be remotely transmitted and inspected in real-time. Among the applications that implement these technologies are several GPS navigators and virtual globes such as Google Earth or the tools provided by the Institut Cartogrà fic i Geològic de Catalunya.
In particular, in this thesis, we conceptualize cities as a collection of individual buildings. Hence, we focus on the individual processing of one structure at a time, rather than on the larger-scale processing of urban environments.
Nowadays, there is a wide diversity of digitization technologies, and the choice of the appropriate one is key for each particular application. Roughly, these techniques can be grouped around three main families:
- Time-of-flight (terrestrial and aerial LiDAR).
- Photogrammetry (street-level, satellite, and aerial imagery).
- Human-edited vector data (cadastre and other map sources).
Each of these has its advantages in terms of covered area, data quality, economic cost, and processing effort.
Plane and car-mounted LiDAR devices are optimal for sweeping huge areas, but acquiring and calibrating such devices is not a trivial task. Moreover, the capturing process is done by scan lines, which need to be registered using GPS and inertial data. As an alternative, terrestrial LiDAR devices are more accessible but cover smaller areas, and their sampling strategy usually produces massive point clouds with over-represented plain regions. A more inexpensive option is street-level imagery. A dense set of images captured with a commodity camera can be fed to state-of-the-art multi-view stereo algorithms to produce realistic-enough reconstructions. One other advantage of this approach is capturing high-quality color data, whereas the geometric information is usually lacking.
In this thesis, we analyze in-depth some of the shortcomings of these data-acquisition methods and propose new ways to overcome them. Mainly, we focus on the technologies that allow high-quality digitization of individual buildings. These are terrestrial LiDAR for geometric information and street-level imagery for color information.
Our main goal is the processing and completion of detailed 3D urban representations. For this, we will work with multiple data sources and combine them when possible to produce models that can be inspected in real-time. Our research has focused on the following contributions:
- Effective and feature-preserving simplification of massive point clouds.
- Developing normal estimation algorithms explicitly designed for LiDAR data.
- Low-stretch panoramic representation for point clouds.
- Semantic analysis of street-level imagery for improved multi-view stereo reconstruction.
- Color improvement through heuristic techniques and the registration of LiDAR and imagery data.
- Efficient and faithful visualization of massive point clouds using image-based techniques.Durant els darrers anys, hi ha hagut un creixement notori en el camp de la digitalització d'edificis en 3D i entorns urbans. La millora substancial tant del maquinari d'escaneig com dels algorismes de reconstrucció ha portat al desenvolupament de representacions d'edificis i ciutats que es poden transmetre i inspeccionar remotament en temps real. Entre les aplicacions que implementen aquestes tecnologies hi ha diversos navegadors GPS i globus virtuals com Google Earth o les eines proporcionades per l'Institut Cartogrà fic i Geològic de Catalunya. En particular, en aquesta tesi, conceptualitzem les ciutats com una col·lecció d'edificis individuals. Per tant, ens centrem en el processament individual d'una estructura a la vegada, en lloc del processament a gran escala d'entorns urbans. Avui en dia, hi ha una à mplia diversitat de tecnologies de digitalització i la selecció de l'adequada és clau per a cada aplicació particular. Aproximadament, aquestes tècniques es poden agrupar en tres famÃlies principals: - Temps de vol (LiDAR terrestre i aeri). - Fotogrametria (imatges a escala de carrer, de satèl·lit i aèries). - Dades vectorials editades per humans (cadastre i altres fonts de mapes). Cadascun d'ells presenta els seus avantatges en termes d'à rea coberta, qualitat de les dades, cost econòmic i esforç de processament. Els dispositius LiDAR muntats en avió i en cotxe són òptims per escombrar à rees enormes, però adquirir i calibrar aquests dispositius no és una tasca trivial. A més, el procés de captura es realitza mitjançant lÃnies d'escaneig, que cal registrar mitjançant GPS i dades inercials. Com a alternativa, els dispositius terrestres de LiDAR són més accessibles, però cobreixen à rees més petites, i la seva estratègia de mostreig sol produir núvols de punts massius amb regions planes sobrerepresentades. Una opció més barata són les imatges a escala de carrer. Es pot fer servir un conjunt dens d'imatges capturades amb una cà mera de qualitat mitjana per obtenir reconstruccions prou realistes mitjançant algorismes estèreo d'última generació per produir. Un altre avantatge d'aquest mètode és la captura de dades de color d'alta qualitat. Tanmateix, la informació geomètrica resultant sol ser de baixa qualitat. En aquesta tesi, analitzem en profunditat algunes de les mancances d'aquests mètodes d'adquisició de dades i proposem noves maneres de superar-les. Principalment, ens centrem en les tecnologies que permeten una digitalització d'alta qualitat d'edificis individuals. Es tracta de LiDAR terrestre per obtenir informació geomètrica i imatges a escala de carrer per obtenir informació sobre colors. El nostre objectiu principal és el processament i la millora de representacions urbanes 3D amb molt detall. Per a això, treballarem amb diverses fonts de dades i les combinarem quan sigui possible per produir models que es puguin inspeccionar en temps real. La nostra investigació s'ha centrat en les següents contribucions: - Simplificació eficaç de núvols de punts massius, preservant detalls d'alta resolució. - Desenvolupament d'algoritmes d'estimació normal dissenyats explÃcitament per a dades LiDAR. - Representació panorà mica de baixa distorsió per a núvols de punts. - Anà lisi semà ntica d'imatges a escala de carrer per millorar la reconstrucció estèreo de façanes. - Millora del color mitjançant tècniques heurÃstiques i el registre de dades LiDAR i imatge. - Visualització eficient i fidel de núvols de punts massius mitjançant tècniques basades en imatges
CroCo v2: Improved Cross-view Completion Pre-training for Stereo Matching and Optical Flow
Despite impressive performance for high-level downstream tasks,
self-supervised pre-training methods have not yet fully delivered on dense
geometric vision tasks such as stereo matching or optical flow. The application
of self-supervised concepts, such as instance discrimination or masked image
modeling, to geometric tasks is an active area of research. In this work, we
build on the recent cross-view completion framework, a variation of masked
image modeling that leverages a second view from the same scene which makes it
well suited for binocular downstream tasks. The applicability of this concept
has so far been limited in at least two ways: (a) by the difficulty of
collecting real-world image pairs -- in practice only synthetic data have been
used -- and (b) by the lack of generalization of vanilla transformers to dense
downstream tasks for which relative position is more meaningful than absolute
position. We explore three avenues of improvement. First, we introduce a method
to collect suitable real-world image pairs at large scale. Second, we
experiment with relative positional embeddings and show that they enable vision
transformers to perform substantially better. Third, we scale up vision
transformer based cross-completion architectures, which is made possible by the
use of large amounts of data. With these improvements, we show for the first
time that state-of-the-art results on stereo matching and optical flow can be
reached without using any classical task-specific techniques like correlation
volume, iterative estimation, image warping or multi-scale reasoning, thus
paving the way towards universal vision models.Comment: ICCV 202
Learning to Generate and Refine Object Proposals
Visual object recognition is a fundamental and challenging
problem in computer vision. To build a practical recognition
system, one is first confronted with high computation complexity
due to an enormous search space from an image, which is caused by
large variations in object appearance, pose and mutual occlusion,
as well as other environmental factors. To reduce the search
complexity, a moderate set of image regions that are likely to
contain an object, regardless of its category, are usually first
generated in modern object recognition subsystems. These possible
object regions are called object proposals, object hypotheses or
object candidates, which can be used for down-stream
classification or global reasoning in many different vision tasks
like object detection, segmentation and tracking, etc.
This thesis addresses the problem of object proposal generation,
including bounding box and segment proposal generation, in
real-world scenarios. In particular, we investigate the
representation learning in object proposal generation with 3D
cues and contextual information, aiming to propose higher-quality
object candidates which have higher object recall, better
boundary coverage and lower number. We focus on three main
issues: 1) how can we incorporate additional geometric and
high-level semantic context information into the proposal
generation for stereo images? 2) how do we generate object
segment proposals for stereo images with learning representations
and learning grouping process? and 3) how can we learn a
context-driven representation to refine segment proposals
efficiently?
In this thesis, we propose a series of solutions to address each
of the raised problems. We first propose a semantic context and
depth-aware object proposal generation method. We design a set of
new cues to encode the objectness, and then train an efficient
random forest classifier to re-rank the initial proposals and
linear regressors to fine-tune their locations. Next, we extend
the task to the segment proposal generation in the same setting
and develop a learning-based segment proposal generation method
for stereo images. Our method makes use of learned deep features
and designed geometric features to represent a region and learns
a similarity network to guide the superpixel grouping process. We
also learn a ranking network to predict the objectness score for
each segment proposal. To address the third problem, we take a
transformation-based approach to improve the quality of a given
segment candidate pool based on context information. We propose
an efficient deep network that learns affine transformations to
warp an initial object mask towards nearby object region, based
on a novel feature pooling strategy. Finally, we extend our
affine warping approach to address the object-mask alignment
problem and particularly the problem of refining a set of segment
proposals. We design an end-to-end deep spatial transformer
network that learns free-form deformations (FFDs) to non-rigidly
warp the shape mask towards the ground truth, based on a
multi-level dual mask feature pooling strategy. We evaluate all
our approaches on several publicly available object recognition
datasets and show superior performance
Sim2real transfer learning for 3D human pose estimation: motion to the rescue
Synthetic visual data can provide practically infinite diversity and rich
labels, while avoiding ethical issues with privacy and bias. However, for many
tasks, current models trained on synthetic data generalize poorly to real data.
The task of 3D human pose estimation is a particularly interesting example of
this sim2real problem, because learning-based approaches perform reasonably
well given real training data, yet labeled 3D poses are extremely difficult to
obtain in the wild, limiting scalability. In this paper, we show that standard
neural-network approaches, which perform poorly when trained on synthetic RGB
images, can perform well when the data is pre-processed to extract cues about
the person's motion, notably as optical flow and the motion of 2D keypoints.
Therefore, our results suggest that motion can be a simple way to bridge a
sim2real gap when video is available. We evaluate on the 3D Poses in the Wild
dataset, the most challenging modern benchmark for 3D pose estimation, where we
show full 3D mesh recovery that is on par with state-of-the-art methods trained
on real 3D sequences, despite training only on synthetic humans from the
SURREAL dataset.Comment: Accepted at NeurIPS 201
- …