12 research outputs found
OReX: Object Reconstruction from Planner Cross-sections Using Neural Fields
Reconstructing 3D shapes from planar cross-sections is a challenge inspired
by downstream applications like medical imaging and geographic informatics. The
input is an in/out indicator function fully defined on a sparse collection of
planes in space, and the output is an interpolation of the indicator function
to the entire volume. Previous works addressing this sparse and ill-posed
problem either produce low quality results, or rely on additional priors such
as target topology, appearance information, or input normal directions. In this
paper, we present OReX, a method for 3D shape reconstruction from slices alone,
featuring a Neural Field as the interpolation prior. A simple neural network is
trained on the input planes to receive a 3D coordinate and return an
inside/outside estimate for the query point. This prior is powerful in inducing
smoothness and self-similarities. The main challenge for this approach is
high-frequency details, as the neural prior is overly smoothing. To alleviate
this, we offer an iterative estimation architecture and a hierarchical input
sampling scheme that encourage coarse-to-fine training, allowing focusing on
high frequencies at later stages. In addition, we identify and analyze a common
ripple-like effect stemming from the mesh extraction step. We mitigate it by
regularizing the spatial gradients of the indicator function around input
in/out boundaries, cutting the problem at the root.
Through extensive qualitative and quantitative experimentation, we
demonstrate our method is robust, accurate, and scales well with the size of
the input. We report state-of-the-art results compared to previous approaches
and recent potential solutions, and demonstrate the benefit of our individual
contributions through analysis and ablation studies
Human Motion Diffusion as a Generative Prior
Recent work has demonstrated the significant potential of denoising diffusion
models for generating human motion, including text-to-motion capabilities.
However, these methods are restricted by the paucity of annotated motion data,
a focus on single-person motions, and a lack of detailed control. In this
paper, we introduce three forms of composition based on diffusion priors:
sequential, parallel, and model composition. Using sequential composition, we
tackle the challenge of long sequence generation. We introduce DoubleTake, an
inference-time method with which we generate long animations consisting of
sequences of prompted intervals and their transitions, using a prior trained
only for short clips. Using parallel composition, we show promising steps
toward two-person generation. Beginning with two fixed priors as well as a few
two-person training examples, we learn a slim communication block, ComMDM, to
coordinate interaction between the two resulting motions. Lastly, using model
composition, we first train individual priors to complete motions that realize
a prescribed motion for a given joint. We then introduce DiffusionBlending, an
interpolation mechanism to effectively blend several such models to enable
flexible and efficient fine-grained joint and trajectory-level control and
editing. We evaluate the composition methods using an off-the-shelf motion
diffusion model, and further compare the results to dedicated models trained
for these specific tasks
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion
Text-to-image models offer unprecedented freedom to guide creation through
natural language. Yet, it is unclear how such freedom can be exercised to
generate images of specific unique concepts, modify their appearance, or
compose them in new roles and novel scenes. In other words, we ask: how can we
use language-guided models to turn our cat into a painting, or imagine a new
product based on our favorite toy? Here we present a simple approach that
allows such creative freedom. Using only 3-5 images of a user-provided concept,
like an object or a style, we learn to represent it through new "words" in the
embedding space of a frozen text-to-image model. These "words" can be composed
into natural language sentences, guiding personalized creation in an intuitive
way. Notably, we find evidence that a single word embedding is sufficient for
capturing unique and varied concepts. We compare our approach to a wide range
of baselines, and demonstrate that it can more faithfully portray the concepts
across a range of applications and tasks.
Our code, data and new words will be available at:
https://textual-inversion.github.ioComment: Project page: https://textual-inversion.github.i
Domain-Agnostic Tuning-Encoder for Fast Personalization of Text-To-Image Models
Text-to-image (T2I) personalization allows users to guide the creative image
generation process by combining their own visual concepts in natural language
prompts. Recently, encoder-based techniques have emerged as a new effective
approach for T2I personalization, reducing the need for multiple images and
long training times. However, most existing encoders are limited to a
single-class domain, which hinders their ability to handle diverse concepts. In
this work, we propose a domain-agnostic method that does not require any
specialized dataset or prior information about the personalized concepts. We
introduce a novel contrastive-based regularization technique to maintain high
fidelity to the target concept characteristics while keeping the predicted
embeddings close to editable regions of the latent space, by pushing the
predicted tokens toward their nearest existing CLIP tokens. Our experimental
results demonstrate the effectiveness of our approach and show how the learned
tokens are more semantic than tokens predicted by unregularized models. This
leads to a better representation that achieves state-of-the-art performance
while being more flexible than previous methods.Comment: Project page at https://datencoder.github.i
State of the Art on Diffusion Models for Visual Computing
The field of visual computing is rapidly advancing due to the emergence of
generative artificial intelligence (AI), which unlocks unprecedented
capabilities for the generation, editing, and reconstruction of images, videos,
and 3D scenes. In these domains, diffusion models are the generative AI
architecture of choice. Within the last year alone, the literature on
diffusion-based tools and applications has seen exponential growth and relevant
papers are published across the computer graphics, computer vision, and AI
communities with new works appearing daily on arXiv. This rapid growth of the
field makes it difficult to keep up with all recent developments. The goal of
this state-of-the-art report (STAR) is to introduce the basic mathematical
concepts of diffusion models, implementation details and design choices of the
popular Stable Diffusion model, as well as overview important aspects of these
generative AI tools, including personalization, conditioning, inversion, among
others. Moreover, we give a comprehensive overview of the rapidly growing
literature on diffusion-based generation and editing, categorized by the type
of generated medium, including 2D images, videos, 3D objects, locomotion, and
4D scenes. Finally, we discuss available datasets, metrics, open challenges,
and social implications. This STAR provides an intuitive starting point to
explore this exciting topic for researchers, artists, and practitioners alike