188 research outputs found
AI Illustrator: Translating Raw Descriptions into Images by Prompt-based Cross-Modal Generation
AI illustrator aims to automatically design visually appealing images for
books to provoke rich thoughts and emotions. To achieve this goal, we propose a
framework for translating raw descriptions with complex semantics into
semantically corresponding images. The main challenge lies in the complexity of
the semantics of raw descriptions, which may be hard to be visualized (e.g.,
"gloomy" or "Asian"). It usually poses challenges for existing methods to
handle such descriptions. To address this issue, we propose a Prompt-based
Cross-Modal Generation Framework (PCM-Frame) to leverage two powerful
pre-trained models, including CLIP and StyleGAN. Our framework consists of two
components: a projection module from Text Embeddings to Image Embeddings based
on prompts, and an adapted image generation module built on StyleGAN which
takes Image Embeddings as inputs and is trained by combined semantic
consistency losses. To bridge the gap between realistic images and illustration
designs, we further adopt a stylization model as post-processing in our
framework for better visual effects. Benefiting from the pre-trained models,
our method can handle complex descriptions and does not require external paired
data for training. Furthermore, we have built a benchmark that consists of 200
raw descriptions. We conduct a user study to demonstrate our superiority over
the competing methods with complicated texts. We release our code at
https://github.com/researchmm/AI_Illustrator
Solving Diffusion ODEs with Optimal Boundary Conditions for Better Image Super-Resolution
Diffusion models, as a kind of powerful generative model, have given
impressive results on image super-resolution (SR) tasks. However, due to the
randomness introduced in the reverse process of diffusion models, the
performances of diffusion-based SR models are fluctuating at every time of
sampling, especially for samplers with few resampled steps. This inherent
randomness of diffusion models results in ineffectiveness and instability,
making it challenging for users to guarantee the quality of SR results.
However, our work takes this randomness as an opportunity: fully analyzing and
leveraging it leads to the construction of an effective plug-and-play sampling
method that owns the potential to benefit a series of diffusion-based SR
methods. More in detail, we propose to steadily sample high-quality SR images
from pretrained diffusion-based SR models by solving diffusion ordinary
differential equations (diffusion ODEs) with optimal boundary conditions (BCs)
and analyze the characteristics between the choices of BCs and their
corresponding SR results. Our analysis shows the route to obtain an
approximately optimal BC via an efficient exploration in the whole space. The
quality of SR results sampled by the proposed method with fewer steps
outperforms the quality of results sampled by current methods with randomness
from the same pretrained diffusion-based SR model, which means that our
sampling method "boosts" current diffusion-based SR models without any
additional training
OFAR: A Multimodal Evidence Retrieval Framework for Illegal Live-streaming Identification
Illegal live-streaming identification, which aims to help live-streaming
platforms immediately recognize the illegal behaviors in the live-streaming,
such as selling precious and endangered animals, plays a crucial role in
purifying the network environment. Traditionally, the live-streaming platform
needs to employ some professionals to manually identify the potential illegal
live-streaming. Specifically, the professional needs to search for related
evidence from a large-scale knowledge database for evaluating whether a given
live-streaming clip contains illegal behavior, which is time-consuming and
laborious. To address this issue, in this work, we propose a multimodal
evidence retrieval system, named OFAR, to facilitate the illegal live-streaming
identification. OFAR consists of three modules: Query Encoder, Document
Encoder, and MaxSim-based Contrastive Late Intersection. Both query encoder and
document encoder are implemented with the advanced OFA encoder, which is
pretrained on a large-scale multimodal dataset. In the last module, we
introduce contrastive learning on the basis of the MaxiSim-based late
intersection, to enhance the model's ability of query-document matching. The
proposed framework achieves significant improvement on our industrial dataset
TaoLive, demonstrating the advances of our scheme
XTQA: Span-Level Explanations of the Textbook Question Answering
Textbook Question Answering (TQA) is a task that one should answer a
diagram/non-diagram question given a large multi-modal context consisting of
abundant essays and diagrams. We argue that the explainability of this task
should place students as a key aspect to be considered. To address this issue,
we devise a novel architecture towards span-level eXplanations of the TQA
(XTQA) based on our proposed coarse-to-fine grained algorithm, which can
provide not only the answers but also the span-level evidences to choose them
for students. This algorithm first coarsely chooses top paragraphs relevant
to questions using the TF-IDF method, and then chooses top evidence spans
finely from all candidate spans within these paragraphs by computing the
information gain of each span to questions. Experimental results shows that
XTQA significantly improves the state-of-the-art performance compared with
baselines. The source code is available at
https://github.com/keep-smile-001/opentqaComment: 10 page
MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation
We propose the first joint audio-video generation framework that brings
engaging watching and listening experiences simultaneously, towards
high-quality realistic videos. To generate joint audio-video pairs, we propose
a novel Multi-Modal Diffusion model (i.e., MM-Diffusion), with two-coupled
denoising autoencoders. In contrast to existing single-modal diffusion models,
MM-Diffusion consists of a sequential multi-modal U-Net for a joint denoising
process by design. Two subnets for audio and video learn to gradually generate
aligned audio-video pairs from Gaussian noises. To ensure semantic consistency
across modalities, we propose a novel random-shift based attention block
bridging over the two subnets, which enables efficient cross-modal alignment,
and thus reinforces the audio-video fidelity for each other. Extensive
experiments show superior results in unconditional audio-video generation, and
zero-shot conditional tasks (e.g., video-to-audio). In particular, we achieve
the best FVD and FAD on Landscape and AIST++ dancing datasets. Turing tests of
10k votes further demonstrate dominant preferences for our model. The code and
pre-trained models can be downloaded at
https://github.com/researchmm/MM-Diffusion.Comment: Accepted by CVPR 202
- …