185 research outputs found

    AI Illustrator: Translating Raw Descriptions into Images by Prompt-based Cross-Modal Generation

    Full text link
    AI illustrator aims to automatically design visually appealing images for books to provoke rich thoughts and emotions. To achieve this goal, we propose a framework for translating raw descriptions with complex semantics into semantically corresponding images. The main challenge lies in the complexity of the semantics of raw descriptions, which may be hard to be visualized (e.g., "gloomy" or "Asian"). It usually poses challenges for existing methods to handle such descriptions. To address this issue, we propose a Prompt-based Cross-Modal Generation Framework (PCM-Frame) to leverage two powerful pre-trained models, including CLIP and StyleGAN. Our framework consists of two components: a projection module from Text Embeddings to Image Embeddings based on prompts, and an adapted image generation module built on StyleGAN which takes Image Embeddings as inputs and is trained by combined semantic consistency losses. To bridge the gap between realistic images and illustration designs, we further adopt a stylization model as post-processing in our framework for better visual effects. Benefiting from the pre-trained models, our method can handle complex descriptions and does not require external paired data for training. Furthermore, we have built a benchmark that consists of 200 raw descriptions. We conduct a user study to demonstrate our superiority over the competing methods with complicated texts. We release our code at https://github.com/researchmm/AI_Illustrator

    Solving Diffusion ODEs with Optimal Boundary Conditions for Better Image Super-Resolution

    Full text link
    Diffusion models, as a kind of powerful generative model, have given impressive results on image super-resolution (SR) tasks. However, due to the randomness introduced in the reverse process of diffusion models, the performances of diffusion-based SR models are fluctuating at every time of sampling, especially for samplers with few resampled steps. This inherent randomness of diffusion models results in ineffectiveness and instability, making it challenging for users to guarantee the quality of SR results. However, our work takes this randomness as an opportunity: fully analyzing and leveraging it leads to the construction of an effective plug-and-play sampling method that owns the potential to benefit a series of diffusion-based SR methods. More in detail, we propose to steadily sample high-quality SR images from pretrained diffusion-based SR models by solving diffusion ordinary differential equations (diffusion ODEs) with optimal boundary conditions (BCs) and analyze the characteristics between the choices of BCs and their corresponding SR results. Our analysis shows the route to obtain an approximately optimal BC via an efficient exploration in the whole space. The quality of SR results sampled by the proposed method with fewer steps outperforms the quality of results sampled by current methods with randomness from the same pretrained diffusion-based SR model, which means that our sampling method "boosts" current diffusion-based SR models without any additional training

    OFAR: A Multimodal Evidence Retrieval Framework for Illegal Live-streaming Identification

    Full text link
    Illegal live-streaming identification, which aims to help live-streaming platforms immediately recognize the illegal behaviors in the live-streaming, such as selling precious and endangered animals, plays a crucial role in purifying the network environment. Traditionally, the live-streaming platform needs to employ some professionals to manually identify the potential illegal live-streaming. Specifically, the professional needs to search for related evidence from a large-scale knowledge database for evaluating whether a given live-streaming clip contains illegal behavior, which is time-consuming and laborious. To address this issue, in this work, we propose a multimodal evidence retrieval system, named OFAR, to facilitate the illegal live-streaming identification. OFAR consists of three modules: Query Encoder, Document Encoder, and MaxSim-based Contrastive Late Intersection. Both query encoder and document encoder are implemented with the advanced OFA encoder, which is pretrained on a large-scale multimodal dataset. In the last module, we introduce contrastive learning on the basis of the MaxiSim-based late intersection, to enhance the model's ability of query-document matching. The proposed framework achieves significant improvement on our industrial dataset TaoLive, demonstrating the advances of our scheme

    XTQA: Span-Level Explanations of the Textbook Question Answering

    Full text link
    Textbook Question Answering (TQA) is a task that one should answer a diagram/non-diagram question given a large multi-modal context consisting of abundant essays and diagrams. We argue that the explainability of this task should place students as a key aspect to be considered. To address this issue, we devise a novel architecture towards span-level eXplanations of the TQA (XTQA) based on our proposed coarse-to-fine grained algorithm, which can provide not only the answers but also the span-level evidences to choose them for students. This algorithm first coarsely chooses top MM paragraphs relevant to questions using the TF-IDF method, and then chooses top KK evidence spans finely from all candidate spans within these paragraphs by computing the information gain of each span to questions. Experimental results shows that XTQA significantly improves the state-of-the-art performance compared with baselines. The source code is available at https://github.com/keep-smile-001/opentqaComment: 10 page

    MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation

    Full text link
    We propose the first joint audio-video generation framework that brings engaging watching and listening experiences simultaneously, towards high-quality realistic videos. To generate joint audio-video pairs, we propose a novel Multi-Modal Diffusion model (i.e., MM-Diffusion), with two-coupled denoising autoencoders. In contrast to existing single-modal diffusion models, MM-Diffusion consists of a sequential multi-modal U-Net for a joint denoising process by design. Two subnets for audio and video learn to gradually generate aligned audio-video pairs from Gaussian noises. To ensure semantic consistency across modalities, we propose a novel random-shift based attention block bridging over the two subnets, which enables efficient cross-modal alignment, and thus reinforces the audio-video fidelity for each other. Extensive experiments show superior results in unconditional audio-video generation, and zero-shot conditional tasks (e.g., video-to-audio). In particular, we achieve the best FVD and FAD on Landscape and AIST++ dancing datasets. Turing tests of 10k votes further demonstrate dominant preferences for our model. The code and pre-trained models can be downloaded at https://github.com/researchmm/MM-Diffusion.Comment: Accepted by CVPR 202
    corecore