601,032 research outputs found
Unsupervised Learning of Complex Articulated Kinematic Structures combining Motion and Skeleton Information
In this paper we present a novel framework for unsupervised kinematic structure learning of complex articulated objects from a single-view image sequence. In contrast to prior motion information based methods, which estimate relatively simple articulations, our method can generate arbitrarily complex kinematic structures with skeletal topology by a successive iterative merge process. The iterative merge process is guided by a skeleton distance function which is generated from a novel object boundary generation method from sparse points. Our main contributions can be summarised as follows: (i) Unsupervised complex articulated kinematic structure learning by combining motion and skeleton information. (ii) Iterative fine-to-coarse merging strategy for adaptive motion segmentation and structure smoothing. (iii) Skeleton estimation from sparse feature points. (iv) A new highly articulated object dataset containing multi-stage complexity with ground truth. Our experiments show that the proposed method out-performs state-of-the-art methods both quantitatively and qualitatively
GlyphDraw: Learning to Draw Chinese Characters in Image Synthesis Models Coherently
Recent breakthroughs in the field of language-guided image generation have
yielded impressive achievements, enabling the creation of high-quality and
diverse images based on user instructions. Although the synthesis performance
is fascinating, one significant limitation of current image generation models
is their insufficient ability to generate coherent text within images,
particularly for complex glyph structures like Chinese characters. To address
this problem, we introduce GlyphDraw, a general learning framework aiming at
endowing image generation models with the capacity to generate images embedded
with coherent text. To the best of our knowledge, this is the first work in the
field of image synthesis to address the generation of Chinese characters. % we
first adopt the OCR technique to collect images with Chinese characters as
training samples, and extract the text and locations as auxiliary information.
We first sophisticatedly design the image-text dataset's construction strategy,
then build our model specifically on a diffusion-based image generator and
carefully modify the network structure to allow the model to learn drawing
Chinese characters with the help of glyph and position information.
Furthermore, we maintain the model's open-domain image synthesis capability by
preventing catastrophic forgetting by using a variety of training techniques.
Extensive qualitative and quantitative experiments demonstrate that our method
not only produces accurate Chinese characters as in prompts, but also naturally
blends the generated text into the background. Please refer to
https://1073521013.github.io/glyph-draw.github.ioComment: 24 pages, 5 figure
A Structure-Guided Diffusion Model for Large-Hole Image Completion
Image completion techniques have made significant progress in filling missing
regions (i.e., holes) in images. However, large-hole completion remains
challenging due to limited structural information. In this paper, we address
this problem by integrating explicit structural guidance into diffusion-based
image completion, forming our structure-guided diffusion model (SGDM). It
consists of two cascaded diffusion probabilistic models: structure and texture
generators. The structure generator generates an edge image representing
plausible structures within the holes, which is then used for guiding the
texture generation process. To train both generators jointly, we devise a novel
strategy that leverages optimal Bayesian denoising, which denoises the output
of the structure generator in a single step and thus allows backpropagation.
Our diffusion-based approach enables a diversity of plausible completions,
while the editable edges allow for editing parts of an image. Our experiments
on natural scene (Places) and face (CelebA-HQ) datasets demonstrate that our
method achieves a superior or comparable visual quality compared to
state-of-the-art approaches. The code is available for research purposes at
https://github.com/UdonDa/Structure_Guided_Diffusion_Model.Comment: BMVC2023. Code:
https://github.com/UdonDa/Structure_Guided_Diffusion_Mode
GIT-Mol: A Multi-modal Large Language Model for Molecular Science with Graph, Image, and Text
Large language models have made significant strides in natural language
processing, paving the way for innovative applications including molecular
representation and generation. However, most existing single-modality
approaches cannot capture the abundant and complex information in molecular
data. Here, we introduce GIT-Mol, a multi-modal large language model that
integrates the structure Graph, Image, and Text information, including the
Simplified Molecular Input Line Entry System (SMILES) and molecular captions.
To facilitate the integration of multi-modal molecular data, we propose
GIT-Former, a novel architecture capable of mapping all modalities into a
unified latent space. Our study develops an innovative any-to-language
molecular translation strategy and achieves a 10%-15% improvement in molecular
captioning, a 5%-10% accuracy increase in property prediction, and a 20% boost
in molecule generation validity compared to baseline or single-modality models.Comment: 16 pages, 5 figure
Hierarchical Cross-Modal Talking Face Generationwith Dynamic Pixel-Wise Loss
We devise a cascade GAN approach to generate talking face video, which is
robust to different face shapes, view angles, facial characteristics, and noisy
audio conditions. Instead of learning a direct mapping from audio to video
frames, we propose first to transfer audio to high-level structure, i.e., the
facial landmarks, and then to generate video frames conditioned on the
landmarks. Compared to a direct audio-to-image approach, our cascade approach
avoids fitting spurious correlations between audiovisual signals that are
irrelevant to the speech content. We, humans, are sensitive to temporal
discontinuities and subtle artifacts in video. To avoid those pixel jittering
problems and to enforce the network to focus on audiovisual-correlated regions,
we propose a novel dynamically adjustable pixel-wise loss with an attention
mechanism. Furthermore, to generate a sharper image with well-synchronized
facial movements, we propose a novel regression-based discriminator structure,
which considers sequence-level information along with frame-level information.
Thoughtful experiments on several datasets and real-world samples demonstrate
significantly better results obtained by our method than the state-of-the-art
methods in both quantitative and qualitative comparisons
Towards Reliable Image Outpainting: Learning Structure-Aware Multimodal Fusion with Depth Guidance
Image outpainting technology generates visually plausible content regardless
of authenticity, making it unreliable to be applied in practice. Thus, we
propose a reliable image outpainting task, introducing the sparse depth from
LiDARs to extrapolate authentic RGB scenes. The large field view of LiDARs
allows it to serve for data enhancement and further multimodal tasks.
Concretely, we propose a Depth-Guided Outpainting Network to model different
feature representations of two modalities and learn the structure-aware
cross-modal fusion. And two components are designed: 1) The Multimodal Learning
Module produces unique depth and RGB feature representations from the
perspectives of different modal characteristics. 2) The Depth Guidance Fusion
Module leverages the complete depth modality to guide the establishment of RGB
contents by progressive multimodal feature fusion. Furthermore, we specially
design an additional constraint strategy consisting of Cross-modal Loss and
Edge Loss to enhance ambiguous contours and expedite reliable content
generation. Extensive experiments on KITTI and Waymo datasets demonstrate our
superiority over the state-of-the-art method, quantitatively and qualitatively
Unsupervised Adversarial Depth Estimation using Cycled Generative Networks
While recent deep monocular depth estimation approaches based on supervised
regression have achieved remarkable performance, costly ground truth
annotations are required during training. To cope with this issue, in this
paper we present a novel unsupervised deep learning approach for predicting
depth maps and show that the depth estimation task can be effectively tackled
within an adversarial learning framework. Specifically, we propose a deep
generative network that learns to predict the correspondence field i.e. the
disparity map between two image views in a calibrated stereo camera setting.
The proposed architecture consists of two generative sub-networks jointly
trained with adversarial learning for reconstructing the disparity map and
organized in a cycle such as to provide mutual constraints and supervision to
each other. Extensive experiments on the publicly available datasets KITTI and
Cityscapes demonstrate the effectiveness of the proposed model and competitive
results with state of the art methods. The code and trained model are available
on https://github.com/andrea-pilzer/unsup-stereo-depthGAN.Comment: To appear in 3DV 2018. Code is available on GitHu
- …