8 research outputs found
Interpreting CLIP's Image Representation via Text-Based Decomposition
We investigate the CLIP image encoder by analyzing how individual model
components affect the final representation. We decompose the image
representation as a sum across individual image patches, model layers, and
attention heads, and use CLIP's text representation to interpret the summands.
Interpreting the attention heads, we characterize each head's role by
automatically finding text representations that span its output space, which
reveals property-specific roles for many heads (e.g. location or shape). Next,
interpreting the image patches, we uncover an emergent spatial localization
within CLIP. Finally, we use this understanding to remove spurious features
from CLIP and to create a strong zero-shot image segmenter. Our results
indicate that a scalable understanding of transformer models is attainable and
can be used to repair and improve models.Comment: Project page and code:
https://yossigandelsman.github.io/clip_decomposition
Synthesizing Moving People with 3D Control
In this paper, we present a diffusion model-based framework for animating
people from a single image for a given target 3D motion sequence. Our approach
has two core components: a) learning priors about invisible parts of the human
body and clothing, and b) rendering novel body poses with proper clothing and
texture. For the first part, we learn an in-filling diffusion model to
hallucinate unseen parts of a person given a single image. We train this model
on texture map space, which makes it more sample-efficient since it is
invariant to pose and viewpoint. Second, we develop a diffusion-based rendering
pipeline, which is controlled by 3D human poses. This produces realistic
renderings of novel poses of the person, including clothing, hair, and
plausible in-filling of unseen regions. This disentangled approach allows our
method to generate a sequence of images that are faithful to the target motion
in the 3D pose and, to the input image in terms of visual similarity. In
addition to that, the 3D control allows various synthetic camera trajectories
to render a person. Our experiments show that our method is resilient in
generating prolonged motions and varied challenging and complex poses compared
to prior methods. Please check our website for more details:
https://boyiliee.github.io/3DHM.github.io/
Idempotent Generative Network
We propose a new approach for generative modeling based on training a neural
network to be idempotent. An idempotent operator is one that can be applied
sequentially without changing the result beyond the initial application, namely
. The proposed model is trained to map a source distribution
(e.g, Gaussian noise) to a target distribution (e.g. realistic images) using
the following objectives: (1) Instances from the target distribution should map
to themselves, namely . We define the target manifold as the set of all
instances that maps to themselves. (2) Instances that form the source
distribution should map onto the defined target manifold. This is achieved by
optimizing the idempotence term, which encourages the range of
to be on the target manifold. Under ideal assumptions such a process
provably converges to the target distribution. This strategy results in a model
capable of generating an output in one step, maintaining a consistent latent
space, while also allowing sequential applications for refinement.
Additionally, we find that by processing inputs from both target and source
distributions, the model adeptly projects corrupted or modified data back to
the target manifold. This work is a first step towards a ``global projector''
that enables projecting any input into a target data distribution
MyStyle: A Personalized Generative Prior
We introduce MyStyle, a personalized deep generative prior trained with a few
shots of an individual. MyStyle allows to reconstruct, enhance and edit images
of a specific person, such that the output is faithful to the person's key
facial characteristics. Given a small reference set of portrait images of a
person (~100), we tune the weights of a pretrained StyleGAN face generator to
form a local, low-dimensional, personalized manifold in the latent space. We
show that this manifold constitutes a personalized region that spans latent
codes associated with diverse portrait images of the individual. Moreover, we
demonstrate that we obtain a personalized generative prior, and propose a
unified approach to apply it to various ill-posed image enhancement problems,
such as inpainting and super-resolution, as well as semantic editing. Using the
personalized generative prior we obtain outputs that exhibit high-fidelity to
the input images and are also faithful to the key facial characteristics of the
individual in the reference set. We demonstrate our method with fair-use images
of numerous widely recognizable individuals for whom we have the prior
knowledge for a qualitative evaluation of the expected outcome. We evaluate our
approach against few-shots baselines and show that our personalized prior,
quantitatively and qualitatively, outperforms state-of-the-art alternatives.Comment: Project webpage: https://mystyle-personalized-prior.github.io/,
Video: https://youtu.be/QvOdQR3tlO