19 research outputs found
TextDeformer: Geometry Manipulation using Text Guidance
We present a technique for automatically producing a deformation of an input
triangle mesh, guided solely by a text prompt. Our framework is capable of
deformations that produce both large, low-frequency shape changes, and small
high-frequency details. Our framework relies on differentiable rendering to
connect geometry to powerful pre-trained image encoders, such as CLIP and DINO.
Notably, updating mesh geometry by taking gradient steps through differentiable
rendering is notoriously challenging, commonly resulting in deformed meshes
with significant artifacts. These difficulties are amplified by noisy and
inconsistent gradients from CLIP. To overcome this limitation, we opt to
represent our mesh deformation through Jacobians, which updates deformations in
a global, smooth manner (rather than locally-sub-optimal steps). Our key
observation is that Jacobians are a representation that favors smoother, large
deformations, leading to a global relation between vertices and pixels, and
avoiding localized noisy gradients. Additionally, to ensure the resulting shape
is coherent from all 3D viewpoints, we encourage the deep features computed on
the 2D encoding of the rendering to be consistent for a given vertex from all
viewpoints. We demonstrate that our method is capable of smoothly-deforming a
wide variety of source mesh and target text prompts, achieving both large
modifications to, e.g., body proportions of animals, as well as adding fine
semantic details, such as shoe laces on an army boot and fine details of a
face
CNOS: A Strong Baseline for CAD-based Novel Object Segmentation
We propose a simple three-stage approach to segment unseen objects in RGB
images using their CAD models. Leveraging recent powerful foundation models,
DINOv2 and Segment Anything, we create descriptors and generate proposals,
including binary masks for a given input RGB image. By matching proposals with
reference descriptors created from CAD models, we achieve precise object ID
assignment along with modal masks. We experimentally demonstrate that our
method achieves state-of-the-art results in CAD-based novel object
segmentation, surpassing existing approaches on the seven core datasets of the
BOP challenge by 19.8\% AP using the same BOP evaluation protocol. Our source
code is available at https://github.com/nv-nguyen/cnos
NOPE: Novel Object Pose Estimation from a Single Image
The practicality of 3D object pose estimation remains limited for many
applications due to the need for prior knowledge of a 3D model and a training
period for new objects. To address this limitation, we propose an approach that
takes a single image of a new object as input and predicts the relative pose of
this object in new images without prior knowledge of the object's 3D model and
without requiring training time for new objects and categories. We achieve this
by training a model to directly predict discriminative embeddings for
viewpoints surrounding the object. This prediction is done using a simple U-Net
architecture with attention and conditioned on the desired pose, which yields
extremely fast inference. We compare our approach to state-of-the-art methods
and show it outperforms them both in terms of accuracy and robustness. Our
source code is publicly available at https://github.com/nv-nguyen/nop
3DMiner: Discovering Shapes from Large-Scale Unannotated Image Datasets
We present 3DMiner -- a pipeline for mining 3D shapes from challenging
large-scale unannotated image datasets. Unlike other unsupervised 3D
reconstruction methods, we assume that, within a large-enough dataset, there
must exist images of objects with similar shapes but varying backgrounds,
textures, and viewpoints. Our approach leverages the recent advances in
learning self-supervised image representations to cluster images with
geometrically similar shapes and find common image correspondences between
them. We then exploit these correspondences to obtain rough camera estimates as
initialization for bundle-adjustment. Finally, for every image cluster, we
apply a progressive bundle-adjusting reconstruction method to learn a neural
occupancy field representing the underlying shape. We show that this procedure
is robust to several types of errors introduced in previous steps (e.g., wrong
camera poses, images containing dissimilar shapes, etc.), allowing us to obtain
shape and pose annotations for images in-the-wild. When using images from Pix3D
chairs, our method is capable of producing significantly better results than
state-of-the-art unsupervised 3D reconstruction techniques, both quantitatively
and qualitatively. Furthermore, we show how 3DMiner can be applied to
in-the-wild data by reconstructing shapes present in images from the LAION-5B
dataset. Project Page: https://ttchengab.github.io/3dminerOfficialComment: In ICCV 202
Reconstruction et correspondance de formes par apprentissage
The goal of this thesis is to develop deep learning approaches to model and analyse 3D shapes. Progress in this field could democratize artistic creation of 3D assets which currently requires time and expert skills with technical software.We focus on the design of deep learning solutions for two particular tasks, key to many 3D modeling applications: single-view reconstruction and shape matching.A single-view reconstruction (SVR) method takes as input a single image and predicts the physical world which produced that image. SVR dates back to the early days of computer vision. In particular, in the 1960s, Lawrence G. Roberts proposed to align simple 3D primitives to the input image under the assumption that the physical world is made of cuboids. Another approach proposed by Berthold Horn in the 1970s is to decompose the input image in intrinsic images and use those to predict the depth of every input pixel.Since several configurations of shapes, texture and illumination can explain the same image, both approaches need to form assumptions on the distribution of images and 3D shapes to resolve the ambiguity. In this thesis, we learn these assumptions from large-scale datasets instead of manually designing them. Learning allows us to perform complete object reconstruction, including parts which are not visible in the input image.Shape matching aims at finding correspondences between 3D objects. Solving this task requires both a local and global understanding of 3D shapes which is hard to achieve explicitly. Instead we train neural networks on large-scale datasets to solve this task and capture this knowledge implicitly through their internal parameters.Shape matching supports many 3D modeling applications such as attribute transfer, automatic rigging for animation, or mesh editing.The first technical contribution of this thesis is a new parametric representation of 3D surfaces modeled by neural networks.The choice of data representation is a critical aspect of any 3D reconstruction algorithm. Until recently, most of the approaches in deep 3D model generation were predicting volumetric voxel grids or point clouds, which are discrete representations. Instead, we present an alternative approach that predicts a parametric surface deformation ie a mapping from a template to a target geometry. To demonstrate the benefits of such a representation, we train a deep encoder-decoder for single-view reconstruction using our new representation. Our approach, dubbed AtlasNet, is the first deep single-view reconstruction approach able to reconstruct meshes from images without relying on an independent post-processing, and can do it at arbitrary resolution without memory issues. A more detailed analysis of AtlasNet reveals it also generalizes better to categories it has not been trained on than other deep 3D generation approaches.Our second main contribution is a novel shape matching approach purely based on reconstruction via deformations. We show that the quality of the shape reconstructions is critical to obtain good correspondences, and therefore introduce a test-time optimization scheme to refine the learned deformations. For humans and other deformable shape categories deviating by a near-isometry, our approach can leverage a shape template and isometric regularization of the surface deformations. As category exhibiting non-isometric variations, such as chairs, do not have a clear template, we learn how to deform any shape into any other and leverage cycle-consistency constraints to learn meaningful correspondences. Our reconstruction-for-matching strategy operates directly on point clouds, is robust to many types of perturbations, and outperforms the state of the art by 15% on dense matching of real human scansL'objectif de cette thèse est de développer des approches d'apprentissage profond pour modéliser et analyser les formes 3D. Les progrès dans ce domaine pourraient démocratiser la création artistique d'actifs 3D, actuellement coûteuse en temps et réservés aux experts du domaine. Nous nous concentrons en particulier sur deux tâches clefs pour la modélisation 3D : la reconstruction à vue unique et la mise en correspondance de formes.Une méthode de reconstruction à vue unique (SVR) prend comme entrée une seule image et prédit le monde physique qui a produit cette image. SVR remonte aux premiers jours de la vision par ordinateur. Étant donné que plusieurs configurations de formes, de textures et d'éclairage peuvent expliquer la même image il faut formuler des hypothèses sur la distribution d'images et de formes 3D pour résoudre l’ambiguïté. Dans cette thèse, nous apprenons ces hypothèses à partir de jeux de données à grande échelle au lieu de les concevoir manuellement. Les méthodes d'apprentissage nous permettent d'effectuer une reconstruction complète et réaliste de l'objet, y compris des parties qui ne sont pas visibles dans l'image d'entrée.La mise en correspondance de forme vise à établir des correspondances entre des objets 3D. Résoudre cette tâche nécessite à la fois une compréhension locale et globale des formes 3D qui est difficile à obtenir explicitement. Au lieu de cela, nous entraînons des réseaux neuronaux sur de grands jeux de données pour capturer ces connaissances implicitement.La mise en correspondance de forme a de nombreuses applications en modélisation 3D telles que le transfert d'attribut, le gréement automatique pour l'animation ou l'édition de maillage.La première contribution technique de cette thèse est une nouvelle représentation paramétrique des surfaces 3D modélisées par les réseaux neuronaux. Le choix de la représentation des données est un aspect critique de tout algorithme de reconstruction 3D. Jusqu'à récemment, la plupart des approches profondes en génération 3D prédisaient des grilles volumétriques de voxel ou des nuages de points, qui sont des représentations discrètes. Au lieu de cela, nous présentons une approche qui prédit une déformation paramétrique de surface, c'est-à -dire une déformation d'un modèle source vers une forme objectif. Pour démontrer les avantages ses avantages, nous utilisons notre nouvelle représentation pour la reconstruction à vue unique. Notre approche, baptisée AtlasNet, est la première approche profonde de reconstruction à vue unique capable de reconstruire des maillages à partir d'images sans s’appuyer sur un post-traitement indépendant, et peut le faire à une résolution arbitraire sans problèmes de mémoire. Une analyse plus détaillée d’AtlasNet révèle qu'il généralise également mieux que les autres approches aux catégories sur lesquelles il n'a pas été entraîné.Notre deuxième contribution est une nouvelle approche de correspondance de forme purement basée sur la reconstruction par des déformations. Nous montrons que la qualité des reconstructions de forme est essentielle pour obtenir de bonnes correspondances, et donc introduisons une optimisation au moment de l'inférence pour affiner les déformations apprises. Pour les humains et d'autres catégories de formes déformables déviant par une quasi-isométrie, notre approche peut tirer parti d'un modèle et d'une régularisation isométrique des déformations. Comme les catégories présentant des variations non isométriques, telles que les chaises, n'ont pas de modèle clair, nous apprenons à déformer n'importe quelle forme en n'importe quelle autre et tirons parti des contraintes de cohérence du cycle pour apprendre des correspondances qui respectent la sémantique des objets. Notre approche de correspondance de forme fonctionne directement sur les nuages de points, est robuste à de nombreux types de perturbations, et surpasse l'état de l'art de 15% sur des scans d'humains réel
Deep Transformation-Invariant Clustering
Project webpage: http://imagine.enpc.fr/~monniert/DTIClustering/International audienceRecent advances in image clustering typically focus on learning better deep representations. In contrast, we present an orthogonal approach that does not rely on abstract features but instead learns to predict transformations and performs clustering directly in pixel space. This learning process naturally fits in the gradient-based training of K-means and Gaussian mixture model, without requiring any additional loss or hyper-parameters. It leads us to two new deep transformation-invariant clustering frameworks, which jointly learn prototypes and transformations. More specifically, we use deep learning modules that enable us to resolve invariance to spatial, color and morphological transformations. Our approach is conceptually simple and comes with several advantages, including the possibility to easily adapt the desired invariance to the task and a strong interpretability of both cluster centers and assignments to clusters. We demonstrate that our novel approach yields competitive and highly promising results on standard image clustering benchmarks. Finally, we showcase its robustness and the advantages of its improved interpretability by visualizing clustering results over real photograph collections
Neural Face Rigging for Animating and Retargeting Facial Meshes in the Wild
We propose an end-to-end deep-learning approach for automatic rigging and
retargeting of 3D models of human faces in the wild. Our approach, called
Neural Face Rigging (NFR), holds three key properties:
(i) NFR's expression space maintains human-interpretable editing parameters
for artistic controls;
(ii) NFR is readily applicable to arbitrary facial meshes with different
connectivity and expressions;
(iii) NFR can encode and produce fine-grained details of complex expressions
performed by arbitrary subjects.
To the best of our knowledge, NFR is the first approach to provide realistic
and controllable deformations of in-the-wild facial meshes, without the manual
creation of blendshapes or correspondence. We design a deformation autoencoder
and train it through a multi-dataset training scheme, which benefits from the
unique advantages of two data sources: a linear 3DMM with interpretable control
parameters as in FACS, and 4D captures of real faces with fine-grained details.
Through various experiments, we show NFR's ability to automatically produce
realistic and accurate facial deformations across a wide range of existing
datasets as well as noisy facial scans in-the-wild, while providing
artist-controlled, editable parameters.Comment: SIGGRAPH 2023(Conference Track), 13 pages, 15 figure
AtlasNet: A Papier-Mâché Approach to Learning 3D Surface Generation
International audienceWe introduce a method for learning to generate the surface of 3D shapes. Our approach represents a 3D shape as a collection of parametric surface elements and, in contrast to methods generating voxel grids or point clouds, naturally infers a surface representation of the shape. Beyond its novelty, our new shape generation framework, AtlasNet, comes with significant advantages, such as improved precision and generalization capabilities, and the possibility to generate a shape of arbitrary resolution without memory issues. We demonstrate these benefits and compare to strong baselines on the ShapeNet benchmark for two applications: (i) auto-encoding shapes, and (ii) single-view reconstruction from a still image. We also provide results showing its potential for other applications, such as morphing, parametrization, super-resolution, matching, and co-segmentation