Search CORE

566 research outputs found

Deep Cross-Modal Audio-Visual Generation

Author: Chen Lele
Duan Zhiyao
Srivastava Sudhanshu
Xu Chenliang
Publication venue
Publication date: 01/01/2017
Field of study

Cross-modal audio-visual perception has been a long-lasting topic in psychology and neurology, and various studies have discovered strong correlations in human perception of auditory and visual stimuli. Despite works in computational multimodal modeling, the problem of cross-modal audio-visual generation has not been systematically studied in the literature. In this paper, we make the first attempt to solve this cross-modal generation problem leveraging the power of deep generative adversarial training. Specifically, we use conditional generative adversarial networks to achieve cross-modal audio-visual generation of musical performances. We explore different encoding methods for audio and visual signals, and work on two scenarios: instrument-oriented generation and pose-oriented generation. Being the first to explore this new problem, we compose two new datasets with pairs of images and sounds of musical performances of different instruments. Our experiments using both classification and human evaluations demonstrate that our model has the ability to generate one modality, i.e., audio/visual, from the other modality, i.e., visual/audio, to a good extent. Our experiments on various design choices along with the datasets will facilitate future research in this new problem space

arXiv.org e-Print Archive

Crossref

Hierarchical Cross-Modal Talking Face Generationwith Dynamic Pixel-Wise Loss

Author: Chen Lele
Duan Zhiyao
Maddox Ross K.
Xu Chenliang
Publication venue
Publication date: 09/05/2019
Field of study

We devise a cascade GAN approach to generate talking face video, which is robust to different face shapes, view angles, facial characteristics, and noisy audio conditions. Instead of learning a direct mapping from audio to video frames, we propose first to transfer audio to high-level structure, i.e., the facial landmarks, and then to generate video frames conditioned on the landmarks. Compared to a direct audio-to-image approach, our cascade approach avoids fitting spurious correlations between audiovisual signals that are irrelevant to the speech content. We, humans, are sensitive to temporal discontinuities and subtle artifacts in video. To avoid those pixel jittering problems and to enforce the network to focus on audiovisual-correlated regions, we propose a novel dynamically adjustable pixel-wise loss with an attention mechanism. Furthermore, to generate a sharper image with well-synchronized facial movements, we propose a novel regression-based discriminator structure, which considers sequence-level information along with frame-level information. Thoughtful experiments on several datasets and real-world samples demonstrate significantly better results obtained by our method than the state-of-the-art methods in both quantitative and qualitative comparisons

arXiv.org e-Print Archive

Crossref

MyStyle++: A Controllable Personalized Generative Prior

Author: Chen Lele
Kalantari Nima
Xu Yi
Zeng Libing
Publication venue
Publication date: 07/06/2023
Field of study

In this paper, we propose an approach to obtain a personalized generative prior with explicit control over a set of attributes. We build upon MyStyle, a recently introduced method, that tunes the weights of a pre-trained StyleGAN face generator on a few images of an individual. This system allows synthesizing, editing, and enhancing images of the target individual with high fidelity to their facial features. However, MyStyle does not demonstrate precise control over the attributes of the generated images. We propose to address this problem through a novel optimization system that organizes the latent space in addition to tuning the generator. Our key contribution is to formulate a loss that arranges the latent codes, corresponding to the input images, along a set of specific directions according to their attributes. We demonstrate that our approach, dubbed MyStyle++, is able to synthesize, edit, and enhance images of an individual with great control over the attributes, while preserving the unique facial characteristics of that individual

arXiv.org e-Print Archive

DGMem: Learning Visual Navigation Policy without Any Labels by Dynamic Graph Memory

Author: Cai Wenzhe
Cheng Guangran
Sun Changyin
Wang Teng
Xu Lele
Publication venue
Publication date: 30/11/2023
Field of study

In recent years, learning-based approaches have demonstrated significant promise in addressing intricate navigation tasks. Traditional methods for training deep neural network navigation policies rely on meticulously designed reward functions or extensive teleoperation datasets as navigation demonstrations. However, the former is often confined to simulated environments, and the latter demands substantial human labor, making it a time-consuming process. Our vision is for robots to autonomously learn navigation skills and adapt their behaviors to environmental changes without any human intervention. In this work, we discuss the self-supervised navigation problem and present Dynamic Graph Memory (DGMem), which facilitates training only with on-board observations. With the help of DGMem, agents can actively explore their surroundings, autonomously acquiring a comprehensive navigation policy in a data-efficient manner without external feedback. Our method is evaluated in photorealistic 3D indoor scenes, and empirical studies demonstrate the effectiveness of DGMem.Comment: 8 pages, 6 figure

arXiv.org e-Print Archive

Joint Generative Modeling of Scene Graphs and Images via Diffusion Models

Author: Liao Renjie
Sigal Leonid
Wang Lele
Xu Bicheng
Yan Qi
Publication venue
Publication date: 02/01/2024
Field of study

In this paper, we present a novel generative task: joint scene graph - image generation. While previous works have explored image generation conditioned on scene graphs or layouts, our task is distinctive and important as it involves generating scene graphs themselves unconditionally from noise, enabling efficient and interpretable control for image generation. Our task is challenging, requiring the generation of plausible scene graphs with heterogeneous attributes for nodes (objects) and edges (relations among objects), including continuous object bounding boxes and discrete object and relation categories. We introduce a novel diffusion model, DiffuseSG, that jointly models the adjacency matrix along with heterogeneous node and edge attributes. We explore various types of encodings for the categorical data, relaxing it into a continuous space. With a graph transformer being the denoiser, DiffuseSG successively denoises the scene graph representation in a continuous space and discretizes the final representation to generate the clean scene graph. Additionally, we introduce an IoU regularization to enhance the empirical performance. Our model significantly outperforms existing methods in scene graph generation on the Visual Genome and COCO-Stuff datasets, both on standard and newly introduced metrics that better capture the problem complexity. Moreover, we demonstrate the additional benefits of our model in two downstream applications: 1) excelling in a series of scene graph completion tasks, and 2) improving scene graph detection models by using extra training samples generated from DiffuseSG

arXiv.org e-Print Archive

Recommended from our members

A multi-modular tensegrity model of an actin stress fiber

Author: Ingber Donald Elliot
Kumar Sanjay
Lele Tanmay
Luo Yaozhi
Xu Xiang
Publication venue: 'Elsevier BV'
Publication date: 24/03/2014
Field of study

Stress fibers are contractile bundles in the cytoskeleton that stabilize cell structure by exerting traction forces on the extracellular matrix. Individual stress fibers are molecular bundles composed of parallel actin and myosin filaments linked by various actin-binding proteins, which are organized end-on-end in a sarcomere-like pattern within an elongated three-dimensional network. While measurements of single stress fibers in living cells show that they behave like tensed viscoelastic fibers, precisely how this mechanical behavior arises from this complex supramolecular arrangement of protein components remains unclear. Here we show that computationally modeling a stress fiber as a multi-modular tensegrity network can predict several key behaviors of stress fibers measured in living cells, including viscoelastic retraction, fiber splaying after severing, non-uniform contraction, and elliptical strain of a puncture wound within the fiber. The tensegrity model can also explain how they simultaneously experience passive tension and generate active contraction forces; in contrast, a tensed cable net model predicts some, but not all, of these properties. Thus, tensegrity models may provide a useful link between molecular and cellular scale mechanical behaviors and represent a new handle on multi-scale modeling of living materials.Engineering and Applied SciencesOther Research Uni

Harvard University - DASH