566 research outputs found
Deep Cross-Modal Audio-Visual Generation
Cross-modal audio-visual perception has been a long-lasting topic in
psychology and neurology, and various studies have discovered strong
correlations in human perception of auditory and visual stimuli. Despite works
in computational multimodal modeling, the problem of cross-modal audio-visual
generation has not been systematically studied in the literature. In this
paper, we make the first attempt to solve this cross-modal generation problem
leveraging the power of deep generative adversarial training. Specifically, we
use conditional generative adversarial networks to achieve cross-modal
audio-visual generation of musical performances. We explore different encoding
methods for audio and visual signals, and work on two scenarios:
instrument-oriented generation and pose-oriented generation. Being the first to
explore this new problem, we compose two new datasets with pairs of images and
sounds of musical performances of different instruments. Our experiments using
both classification and human evaluations demonstrate that our model has the
ability to generate one modality, i.e., audio/visual, from the other modality,
i.e., visual/audio, to a good extent. Our experiments on various design choices
along with the datasets will facilitate future research in this new problem
space
Hierarchical Cross-Modal Talking Face Generationwith Dynamic Pixel-Wise Loss
We devise a cascade GAN approach to generate talking face video, which is
robust to different face shapes, view angles, facial characteristics, and noisy
audio conditions. Instead of learning a direct mapping from audio to video
frames, we propose first to transfer audio to high-level structure, i.e., the
facial landmarks, and then to generate video frames conditioned on the
landmarks. Compared to a direct audio-to-image approach, our cascade approach
avoids fitting spurious correlations between audiovisual signals that are
irrelevant to the speech content. We, humans, are sensitive to temporal
discontinuities and subtle artifacts in video. To avoid those pixel jittering
problems and to enforce the network to focus on audiovisual-correlated regions,
we propose a novel dynamically adjustable pixel-wise loss with an attention
mechanism. Furthermore, to generate a sharper image with well-synchronized
facial movements, we propose a novel regression-based discriminator structure,
which considers sequence-level information along with frame-level information.
Thoughtful experiments on several datasets and real-world samples demonstrate
significantly better results obtained by our method than the state-of-the-art
methods in both quantitative and qualitative comparisons
MyStyle++: A Controllable Personalized Generative Prior
In this paper, we propose an approach to obtain a personalized generative
prior with explicit control over a set of attributes. We build upon MyStyle, a
recently introduced method, that tunes the weights of a pre-trained StyleGAN
face generator on a few images of an individual. This system allows
synthesizing, editing, and enhancing images of the target individual with high
fidelity to their facial features. However, MyStyle does not demonstrate
precise control over the attributes of the generated images. We propose to
address this problem through a novel optimization system that organizes the
latent space in addition to tuning the generator. Our key contribution is to
formulate a loss that arranges the latent codes, corresponding to the input
images, along a set of specific directions according to their attributes. We
demonstrate that our approach, dubbed MyStyle++, is able to synthesize, edit,
and enhance images of an individual with great control over the attributes,
while preserving the unique facial characteristics of that individual
DGMem: Learning Visual Navigation Policy without Any Labels by Dynamic Graph Memory
In recent years, learning-based approaches have demonstrated significant
promise in addressing intricate navigation tasks. Traditional methods for
training deep neural network navigation policies rely on meticulously designed
reward functions or extensive teleoperation datasets as navigation
demonstrations. However, the former is often confined to simulated
environments, and the latter demands substantial human labor, making it a
time-consuming process. Our vision is for robots to autonomously learn
navigation skills and adapt their behaviors to environmental changes without
any human intervention. In this work, we discuss the self-supervised navigation
problem and present Dynamic Graph Memory (DGMem), which facilitates training
only with on-board observations. With the help of DGMem, agents can actively
explore their surroundings, autonomously acquiring a comprehensive navigation
policy in a data-efficient manner without external feedback. Our method is
evaluated in photorealistic 3D indoor scenes, and empirical studies demonstrate
the effectiveness of DGMem.Comment: 8 pages, 6 figure
Joint Generative Modeling of Scene Graphs and Images via Diffusion Models
In this paper, we present a novel generative task: joint scene graph - image
generation. While previous works have explored image generation conditioned on
scene graphs or layouts, our task is distinctive and important as it involves
generating scene graphs themselves unconditionally from noise, enabling
efficient and interpretable control for image generation. Our task is
challenging, requiring the generation of plausible scene graphs with
heterogeneous attributes for nodes (objects) and edges (relations among
objects), including continuous object bounding boxes and discrete object and
relation categories. We introduce a novel diffusion model, DiffuseSG, that
jointly models the adjacency matrix along with heterogeneous node and edge
attributes. We explore various types of encodings for the categorical data,
relaxing it into a continuous space. With a graph transformer being the
denoiser, DiffuseSG successively denoises the scene graph representation in a
continuous space and discretizes the final representation to generate the clean
scene graph. Additionally, we introduce an IoU regularization to enhance the
empirical performance. Our model significantly outperforms existing methods in
scene graph generation on the Visual Genome and COCO-Stuff datasets, both on
standard and newly introduced metrics that better capture the problem
complexity. Moreover, we demonstrate the additional benefits of our model in
two downstream applications: 1) excelling in a series of scene graph completion
tasks, and 2) improving scene graph detection models by using extra training
samples generated from DiffuseSG
Recommended from our members
A multi-modular tensegrity model of an actin stress fiber
Stress fibers are contractile bundles in the cytoskeleton that stabilize cell structure by exerting traction forces on the extracellular matrix. Individual stress fibers are molecular bundles composed of parallel actin and myosin filaments linked by various actin-binding proteins, which are organized end-on-end in a sarcomere-like pattern within an elongated three-dimensional network. While measurements of single stress fibers in living cells show that they behave like tensed viscoelastic fibers, precisely how this mechanical behavior arises from this complex supramolecular arrangement of protein components remains unclear. Here we show that computationally modeling a stress fiber as a multi-modular tensegrity network can predict several key behaviors of stress fibers measured in living cells, including viscoelastic retraction, fiber splaying after severing, non-uniform contraction, and elliptical strain of a puncture wound within the fiber. The tensegrity model can also explain how they simultaneously experience passive tension and generate active contraction forces; in contrast, a tensed cable net model predicts some, but not all, of these properties. Thus, tensegrity models may provide a useful link between molecular and cellular scale mechanical behaviors and represent a new handle on multi-scale modeling of living materials.Engineering and Applied SciencesOther Research Uni
- …