13 research outputs found
OReX: Object Reconstruction from Planner Cross-sections Using Neural Fields
Reconstructing 3D shapes from planar cross-sections is a challenge inspired
by downstream applications like medical imaging and geographic informatics. The
input is an in/out indicator function fully defined on a sparse collection of
planes in space, and the output is an interpolation of the indicator function
to the entire volume. Previous works addressing this sparse and ill-posed
problem either produce low quality results, or rely on additional priors such
as target topology, appearance information, or input normal directions. In this
paper, we present OReX, a method for 3D shape reconstruction from slices alone,
featuring a Neural Field as the interpolation prior. A simple neural network is
trained on the input planes to receive a 3D coordinate and return an
inside/outside estimate for the query point. This prior is powerful in inducing
smoothness and self-similarities. The main challenge for this approach is
high-frequency details, as the neural prior is overly smoothing. To alleviate
this, we offer an iterative estimation architecture and a hierarchical input
sampling scheme that encourage coarse-to-fine training, allowing focusing on
high frequencies at later stages. In addition, we identify and analyze a common
ripple-like effect stemming from the mesh extraction step. We mitigate it by
regularizing the spatial gradients of the indicator function around input
in/out boundaries, cutting the problem at the root.
Through extensive qualitative and quantitative experimentation, we
demonstrate our method is robust, accurate, and scales well with the size of
the input. We report state-of-the-art results compared to previous approaches
and recent potential solutions, and demonstrate the benefit of our individual
contributions through analysis and ablation studies
Human Motion Diffusion as a Generative Prior
Recent work has demonstrated the significant potential of denoising diffusion
models for generating human motion, including text-to-motion capabilities.
However, these methods are restricted by the paucity of annotated motion data,
a focus on single-person motions, and a lack of detailed control. In this
paper, we introduce three forms of composition based on diffusion priors:
sequential, parallel, and model composition. Using sequential composition, we
tackle the challenge of long sequence generation. We introduce DoubleTake, an
inference-time method with which we generate long animations consisting of
sequences of prompted intervals and their transitions, using a prior trained
only for short clips. Using parallel composition, we show promising steps
toward two-person generation. Beginning with two fixed priors as well as a few
two-person training examples, we learn a slim communication block, ComMDM, to
coordinate interaction between the two resulting motions. Lastly, using model
composition, we first train individual priors to complete motions that realize
a prescribed motion for a given joint. We then introduce DiffusionBlending, an
interpolation mechanism to effectively blend several such models to enable
flexible and efficient fine-grained joint and trajectory-level control and
editing. We evaluate the composition methods using an off-the-shelf motion
diffusion model, and further compare the results to dedicated models trained
for these specific tasks
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion
Text-to-image models offer unprecedented freedom to guide creation through
natural language. Yet, it is unclear how such freedom can be exercised to
generate images of specific unique concepts, modify their appearance, or
compose them in new roles and novel scenes. In other words, we ask: how can we
use language-guided models to turn our cat into a painting, or imagine a new
product based on our favorite toy? Here we present a simple approach that
allows such creative freedom. Using only 3-5 images of a user-provided concept,
like an object or a style, we learn to represent it through new "words" in the
embedding space of a frozen text-to-image model. These "words" can be composed
into natural language sentences, guiding personalized creation in an intuitive
way. Notably, we find evidence that a single word embedding is sufficient for
capturing unique and varied concepts. We compare our approach to a wide range
of baselines, and demonstrate that it can more faithfully portray the concepts
across a range of applications and tasks.
Our code, data and new words will be available at:
https://textual-inversion.github.ioComment: Project page: https://textual-inversion.github.i
Domain-Agnostic Tuning-Encoder for Fast Personalization of Text-To-Image Models
Text-to-image (T2I) personalization allows users to guide the creative image
generation process by combining their own visual concepts in natural language
prompts. Recently, encoder-based techniques have emerged as a new effective
approach for T2I personalization, reducing the need for multiple images and
long training times. However, most existing encoders are limited to a
single-class domain, which hinders their ability to handle diverse concepts. In
this work, we propose a domain-agnostic method that does not require any
specialized dataset or prior information about the personalized concepts. We
introduce a novel contrastive-based regularization technique to maintain high
fidelity to the target concept characteristics while keeping the predicted
embeddings close to editable regions of the latent space, by pushing the
predicted tokens toward their nearest existing CLIP tokens. Our experimental
results demonstrate the effectiveness of our approach and show how the learned
tokens are more semantic than tokens predicted by unregularized models. This
leads to a better representation that achieves state-of-the-art performance
while being more flexible than previous methods.Comment: Project page at https://datencoder.github.i
Not All Similarities Are Created Equal: Leveraging Data-Driven Biases to Inform GenAI Copyright Disputes
The advent of Generative Artificial Intelligence (GenAI) models, including
GitHub Copilot, OpenAI GPT, and Stable Diffusion, has revolutionized content
creation, enabling non-professionals to produce high-quality content across
various domains. This transformative technology has led to a surge of synthetic
content and sparked legal disputes over copyright infringement. To address
these challenges, this paper introduces a novel approach that leverages the
learning capacity of GenAI models for copyright legal analysis, demonstrated
with GPT2 and Stable Diffusion models. Copyright law distinguishes between
original expressions and generic ones (Sc\`enes \`a faire), protecting the
former and permitting reproduction of the latter. However, this distinction has
historically been challenging to make consistently, leading to over-protection
of copyrighted works. GenAI offers an unprecedented opportunity to enhance this
legal analysis by revealing shared patterns in preexisting works. We propose a
data-driven approach to identify the genericity of works created by GenAI,
employing "data-driven bias" to assess the genericity of expressive
compositions. This approach aids in copyright scope determination by utilizing
the capabilities of GenAI to identify and prioritize expressive elements and
rank them according to their frequency in the model's dataset. The potential
implications of measuring expressive genericity for copyright law are profound.
Such scoring could assist courts in determining copyright scope during
litigation, inform the registration practices of Copyright Offices, allowing
registration of only highly original synthetic works, and help copyright owners
signal the value of their works and facilitate fairer licensing deals. More
generally, this approach offers valuable insights to policymakers grappling
with adapting copyright law to the challenges posed by the era of GenAI.Comment: Presented at ACM CSLAW 202
State of the Art on Diffusion Models for Visual Computing
The field of visual computing is rapidly advancing due to the emergence of
generative artificial intelligence (AI), which unlocks unprecedented
capabilities for the generation, editing, and reconstruction of images, videos,
and 3D scenes. In these domains, diffusion models are the generative AI
architecture of choice. Within the last year alone, the literature on
diffusion-based tools and applications has seen exponential growth and relevant
papers are published across the computer graphics, computer vision, and AI
communities with new works appearing daily on arXiv. This rapid growth of the
field makes it difficult to keep up with all recent developments. The goal of
this state-of-the-art report (STAR) is to introduce the basic mathematical
concepts of diffusion models, implementation details and design choices of the
popular Stable Diffusion model, as well as overview important aspects of these
generative AI tools, including personalization, conditioning, inversion, among
others. Moreover, we give a comprehensive overview of the rapidly growing
literature on diffusion-based generation and editing, categorized by the type
of generated medium, including 2D images, videos, 3D objects, locomotion, and
4D scenes. Finally, we discuss available datasets, metrics, open challenges,
and social implications. This STAR provides an intuitive starting point to
explore this exciting topic for researchers, artists, and practitioners alike