Search CORE

9 research outputs found

A vicinal surface model for epitaxial growth with logarithmic free energy

Author: Gao Yuan
Ji Hangjie
Liu Jian-Guo
Witelski Thomas P.
Publication venue
Publication date: 24/06/2018
Field of study

We study a continuum model for solid films that arises from the modeling of one-dimensional step flows on a vicinal surface in the attachment-detachment-limited regime. The resulting nonlinear partial differential equation,

u_t = -u^2(u^3+\alpha u)_{hhhh}

, gives the evolution for the surface slope

u

as a function of the local height

h

in a monotone step train. Subject to periodic boundary conditions and positive initial conditions, we prove the existence, uniqueness and positivity of global strong solutions to this PDE using two Lyapunov energy functions. The long time behavior of

u

converging to a constant that only depends on the initial data is also investigated both analytically and numerically.Comment: 18 pages, 7 figure

arXiv.org e-Print Archive

ModelScope Text-to-Video Technical Report

Author: Chen Dayou
Wang Jiuniu
Wang Xiang
Yuan Hangjie
Zhang Shiwei
Zhang Yingya
Publication venue
Publication date: 12/08/2023
Field of study

This paper introduces ModelScopeT2V, a text-to-video synthesis model that evolves from a text-to-image synthesis model (i.e., Stable Diffusion). ModelScopeT2V incorporates spatio-temporal blocks to ensure consistent frame generation and smooth movement transitions. The model could adapt to varying frame numbers during training and inference, rendering it suitable for both image-text and video-text datasets. ModelScopeT2V brings together three components (i.e., VQGAN, a text encoder, and a denoising UNet), totally comprising 1.7 billion parameters, in which 0.5 billion parameters are dedicated to temporal capabilities. The model demonstrates superior performance over state-of-the-art methods across three evaluation metrics. The code and an online demo are available at \url{https://modelscope.cn/models/damo/text-to-video-synthesis/summary}.Comment: Technical report. Project page: \url{https://modelscope.cn/models/damo/text-to-video-synthesis/summary

arXiv.org e-Print Archive

Progressive Learning without Forgetting

Author: Bian Ang
Feng Tao
Huang Ziyuan
Wang Mang
Yuan Hangjie
Zhang Jianzhou
Publication venue
Publication date: 28/11/2022
Field of study

Learning from changing tasks and sequential experience without forgetting the obtained knowledge is a challenging problem for artificial neural networks. In this work, we focus on two challenging problems in the paradigm of Continual Learning (CL) without involving any old data: (i) the accumulation of catastrophic forgetting caused by the gradually fading knowledge space from which the model learns the previous knowledge; (ii) the uncontrolled tug-of-war dynamics to balance the stability and plasticity during the learning of new tasks. In order to tackle these problems, we present Progressive Learning without Forgetting (PLwF) and a credit assignment regime in the optimizer. PLwF densely introduces model functions from previous tasks to construct a knowledge space such that it contains the most reliable knowledge on each task and the distribution information of different tasks, while credit assignment controls the tug-of-war dynamics by removing gradient conflict through projection. Extensive ablative experiments demonstrate the effectiveness of PLwF and credit assignment. In comparison with other CL methods, we report notably better results even without relying on any raw data

arXiv.org e-Print Archive

DreamVideo: Composing Your Dream Videos with Customized Subject and Motion

Author: Liu Yu
Liu Zhiheng
Qing Zhiwu
Shan Hongming
Wei Yujie
Yuan Hangjie
Zhang Shiwei
Zhang Yingya
Zhou Jingren
Publication venue
Publication date: 07/12/2023
Field of study

Customized generation using diffusion models has made impressive progress in image generation, but remains unsatisfactory in the challenging video generation task, as it requires the controllability of both subjects and motions. To that end, we present DreamVideo, a novel approach to generating personalized videos from a few static images of the desired subject and a few videos of target motion. DreamVideo decouples this task into two stages, subject learning and motion learning, by leveraging a pre-trained video diffusion model. The subject learning aims to accurately capture the fine appearance of the subject from provided images, which is achieved by combining textual inversion and fine-tuning of our carefully designed identity adapter. In motion learning, we architect a motion adapter and fine-tune it on the given videos to effectively model the target motion pattern. Combining these two lightweight and efficient adapters allows for flexible customization of any subject with any motion. Extensive experimental results demonstrate the superior performance of our DreamVideo over the state-of-the-art methods for customized video generation. Our project page is at https://dreamvideo-t2v.github.io

arXiv.org e-Print Archive

RLIPv2: Fast Scaling of Relational Language-Image Pre-training

Author: Albanie Samuel
Feng Tao
Jiang Jianwen
Ni Dong
Pan Yining
Wang Xiang
Yuan Hangjie
Zhang Shiwei
Zhang Yingya
Zhao Deli
Publication venue
Publication date: 18/08/2023
Field of study

Relational Language-Image Pre-training (RLIP) aims to align vision representations with relational texts, thereby advancing the capability of relational reasoning in computer vision tasks. However, hindered by the slow convergence of RLIPv1 architecture and the limited availability of existing scene graph data, scaling RLIPv1 is challenging. In this paper, we propose RLIPv2, a fast converging model that enables the scaling of relational pre-training to large-scale pseudo-labelled scene graph data. To enable fast scaling, RLIPv2 introduces Asymmetric Language-Image Fusion (ALIF), a mechanism that facilitates earlier and deeper gated cross-modal fusion with sparsified language encoding layers. ALIF leads to comparable or better performance than RLIPv1 in a fraction of the time for pre-training and fine-tuning. To obtain scene graph data at scale, we extend object detection datasets with free-form relation labels by introducing a captioner (e.g., BLIP) and a designed Relation Tagger. The Relation Tagger assigns BLIP-generated relation texts to region pairs, thus enabling larger-scale relational pre-training. Through extensive experiments conducted on Human-Object Interaction Detection and Scene Graph Generation, RLIPv2 shows state-of-the-art performance on three benchmarks under fully-finetuning, few-shot and zero-shot settings. Notably, the largest RLIPv2 achieves 23.29mAP on HICO-DET without any fine-tuning, yields 32.22mAP with just 1% data and yields 45.09mAP with 100% data. Code and models are publicly available at https://github.com/JacobYuan7/RLIPv2.Comment: Accepted to ICCV 2023. Code and models: https://github.com/JacobYuan7/RLIPv

arXiv.org e-Print Archive

InstructVideo: Instructing Video Diffusion Models with Human Feedback

Author: Albanie Samuel
Feng Tao
Liu Ziwei
Ni Dong
Pan Yining
Wang Xiang
Wei Yujie
Yuan Hangjie
Zhang Shiwei
Zhang Yingya
Publication venue
Publication date: 19/12/2023
Field of study

Diffusion models have emerged as the de facto paradigm for video generation. However, their reliance on web-scale data of varied quality often yields results that are visually unappealing and misaligned with the textual prompts. To tackle this problem, we propose InstructVideo to instruct text-to-video diffusion models with human feedback by reward fine-tuning. InstructVideo has two key ingredients: 1) To ameliorate the cost of reward fine-tuning induced by generating through the full DDIM sampling chain, we recast reward fine-tuning as editing. By leveraging the diffusion process to corrupt a sampled video, InstructVideo requires only partial inference of the DDIM sampling chain, reducing fine-tuning cost while improving fine-tuning efficiency. 2) To mitigate the absence of a dedicated video reward model for human preferences, we repurpose established image reward models, e.g., HPSv2. To this end, we propose Segmental Video Reward, a mechanism to provide reward signals based on segmental sparse sampling, and Temporally Attenuated Reward, a method that mitigates temporal modeling degradation during fine-tuning. Extensive experiments, both qualitative and quantitative, validate the practicality and efficacy of using image reward models in InstructVideo, significantly enhancing the visual quality of generated videos without compromising generalization capabilities. Code and models will be made publicly available.Comment: Project page: https://instructvideo.github.io

arXiv.org e-Print Archive

Learning Visual Context for Group Activity Recognition

Author: Ni Dong
Yuan Hangjie
Publication venue: Association for the Advancement of Artificial Intelligence
Publication date: 18/05/2021
Field of study

Group activity recognition aims to recognize an overall activity in a multi-person scene. Previous methods strive to reason on individual features. However, they under-explore the person-specific contextual information, which is significant and informative in computer vision tasks. In this paper, we propose a new reasoning paradigm to incorporate global contextual information. Specifically, we propose two modules to bridge the gap between group activity and visual context. The first is Transformer based Context Encoding (TCE) module, which enhances individual representation by encoding global contextual information to individual features and refining the aggregated information. The second is Spatial-Temporal Bilinear Pooling (STBiP) module. It firstly further explores pairwise relationships for the context encoded individual representation, then generates semantic representations via gated message passing on a constructed spatial-temporal graph. On their basis, we further design a two-branch model that integrates the designed modules into a pipeline. Systematic experiments demonstrate each module's effectiveness on either branch. Visualizations indicate that visual contextual cues can be aggregated globally by TCE. Moreover, our method achieves state-of-the-art results on two widely used benchmarks using only RGB images as input and 2D backbones

Association for the Advancement of Artificial Intelligence: AAAI Publications

Global existence of solutions to a tear film model with locally elevated evaporation rates

Author: Ajaev
Ajaev
Berger
Braun
Burelbach
Ceniceros
Cook
Craster
Evans
Gilbarg
Hangjie Ji
Holly
Howison
Jensen
Jian-Guo Liu
Keller
King-Smith
King-Smith
Li
Lin
Majda
Mishima
Peng
Peterson
Renardy
Sharma
Siddique
Thomas P. Witelski
Winter
Yiantsios
Yuan Gao
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref