38 research outputs found
ImageBrush: Learning Visual In-Context Instructions for Exemplar-Based Image Manipulation
While language-guided image manipulation has made remarkable progress, the
challenge of how to instruct the manipulation process faithfully reflecting
human intentions persists. An accurate and comprehensive description of a
manipulation task using natural language is laborious and sometimes even
impossible, primarily due to the inherent uncertainty and ambiguity present in
linguistic expressions. Is it feasible to accomplish image manipulation without
resorting to external cross-modal language information? If this possibility
exists, the inherent modality gap would be effortlessly eliminated. In this
paper, we propose a novel manipulation methodology, dubbed ImageBrush, that
learns visual instructions for more accurate image editing. Our key idea is to
employ a pair of transformation images as visual instructions, which not only
precisely captures human intention but also facilitates accessibility in
real-world scenarios. Capturing visual instructions is particularly challenging
because it involves extracting the underlying intentions solely from visual
demonstrations and then applying this operation to a new image. To address this
challenge, we formulate visual instruction learning as a diffusion-based
inpainting problem, where the contextual information is fully exploited through
an iterative process of generation. A visual prompting encoder is carefully
devised to enhance the model's capacity in uncovering human intent behind the
visual instructions. Extensive experiments show that our method generates
engaging manipulation results conforming to the transformations entailed in
demonstrations. Moreover, our model exhibits robust generalization capabilities
on various downstream tasks such as pose transfer, image translation and video
inpainting
Sub-Character Tokenization for Chinese Pretrained Language Models
Tokenization is fundamental to pretrained language models (PLMs). Existing
tokenization methods for Chinese PLMs typically treat each character as an
indivisible token. However, they ignore the unique feature of the Chinese
writing system where additional linguistic information exists below the
character level, i.e., at the sub-character level. To utilize such information,
we propose sub-character (SubChar for short) tokenization. Specifically, we
first encode the input text by converting each Chinese character into a short
sequence based on its glyph or pronunciation, and then construct the vocabulary
based on the encoded text with sub-word tokenization. Experimental results show
that SubChar tokenizers have two main advantages over existing tokenizers: 1)
They can tokenize inputs into much shorter sequences, thus improving the
computational efficiency. 2) Pronunciation-based SubChar tokenizers can encode
Chinese homophones into the same transliteration sequences and produce the same
tokenization output, hence being robust to all homophone typos. At the same
time, models trained with SubChar tokenizers perform competitively on
downstream tasks. We release our code at
https://github.com/thunlp/SubCharTokenization to facilitate future work.Comment: This draft supersedes the previous version named "SHUOWEN-JIEZI:
Linguistically Informed Tokenizers For Chinese Language Model Pretraining
Red Alarm for Pre-trained Models: Universal Vulnerability to Neuron-Level Backdoor Attacks
Pre-trained models (PTMs) have been widely used in various downstream tasks.
The parameters of PTMs are distributed on the Internet and may suffer backdoor
attacks. In this work, we demonstrate the universal vulnerability of PTMs,
where fine-tuned PTMs can be easily controlled by backdoor attacks in arbitrary
downstream tasks. Specifically, attackers can add a simple pre-training task,
which restricts the output representations of trigger instances to pre-defined
vectors, namely neuron-level backdoor attack (NeuBA). If the backdoor
functionality is not eliminated during fine-tuning, the triggers can make the
fine-tuned model predict fixed labels by pre-defined vectors. In the
experiments of both natural language processing (NLP) and computer vision (CV),
we show that NeuBA absolutely controls the predictions for trigger instances
without any knowledge of downstream tasks. Finally, we apply several defense
methods to NeuBA and find that model pruning is a promising direction to resist
NeuBA by excluding backdoored neurons. Our findings sound a red alarm for the
wide use of PTMs. Our source code and models are available at
\url{https://github.com/thunlp/NeuBA}
Masked Lip-Sync Prediction by Audio-Visual Contextual Exploitation in Transformers
Previous studies have explored generating accurately lip-synced talking faces
for arbitrary targets given audio conditions. However, most of them deform or
generate the whole facial area, leading to non-realistic results. In this work,
we delve into the formulation of altering only the mouth shapes of the target
person. This requires masking a large percentage of the original image and
seamlessly inpainting it with the aid of audio and reference frames. To this
end, we propose the Audio-Visual Context-Aware Transformer (AV-CAT) framework,
which produces accurate lip-sync with photo-realistic quality by predicting the
masked mouth shapes. Our key insight is to exploit desired contextual
information provided in audio and visual modalities thoroughly with delicately
designed Transformers. Specifically, we propose a convolution-Transformer
hybrid backbone and design an attention-based fusion strategy for filling the
masked parts. It uniformly attends to the textural information on the unmasked
regions and the reference frame. Then the semantic audio information is
involved in enhancing the self-attention computation. Additionally, a
refinement network with audio injection improves both image and lip-sync
quality. Extensive experiments validate that our model can generate
high-fidelity lip-synced results for arbitrary subjects.Comment: Accepted to SIGGRAPH Asia 2022 (Conference Proceedings). Project
page: https://hangz-nju-cuhk.github.io/projects/AV-CA
Make Your Brief Stroke Real and Stereoscopic: 3D-Aware Simplified Sketch to Portrait Generation
Creating the photo-realistic version of people sketched portraits is useful
to various entertainment purposes. Existing studies only generate portraits in
the 2D plane with fixed views, making the results less vivid. In this paper, we
present Stereoscopic Simplified Sketch-to-Portrait (SSSP), which explores the
possibility of creating Stereoscopic 3D-aware portraits from simple contour
sketches by involving 3D generative models. Our key insight is to design
sketch-aware constraints that can fully exploit the prior knowledge of a
tri-plane-based 3D-aware generative model. Specifically, our designed
region-aware volume rendering strategy and global consistency constraint
further enhance detail correspondences during sketch encoding. Moreover, in
order to facilitate the usage of layman users, we propose a Contour-to-Sketch
module with vector quantized representations, so that easily drawn contours can
directly guide the generation of 3D portraits. Extensive comparisons show that
our method generates high-quality results that match the sketch. Our usability
study verifies that our system is greatly preferred by user.Comment: Project Page on https://hangz-nju-cuhk.github.io
Recommended from our members
Cyclin D-CDK4 kinase destabilizes PD-L1 via Cul3SPOP to control cancer immune surveillance
Delay-Sensitive Service Provisioning in Software-Defined Low-Earth-Orbit Satellite Networks
With the advancement of space technology and satellite communications, low-Earth-orbit (LEO) satellite networks have experienced rapid development in the past decade. In the vision of 6G, LEO satellite networks play an important role in future 6G networks. On the other hand, a variety of applications, including many delay-sensitive applications, are continuously emerging. Due to the highly dynamic nature of LEO satellite networks, supporting time-deterministic services in such networks is challenging. However, we can provide latency guarantees for most delay-sensitive applications through data plane traffic shaping and control plane routing optimization. This paper addresses the routing optimization problem for time-sensitive (TS) flows in software-defined low-Earth-orbit (LEO) satellite networks. We model the problem as an integer linear programming (ILP) model aiming to minimize path handovers and maximum link utilization while meeting TS flow latency constraints. Since this problem is NP-hard, we design an efficient longest continuous path (LCP) approximation algorithm. LCP selects the longest valid path in each topology snapshot that satisfies delay constraints. An auxiliary graph then determines the routing sequence with minimized handovers. We implement an LEO satellite network testbed with Open vSwitch (OVS) and an open-network operating system (ONOS) controller to evaluate LCP. The results show that LCP reduces the number of path handovers by up to 31.7% and keeps the maximum link utilization lowest for more than 75% of the time compared to benchmark algorithms. In summary, LCP achieves excellent path handover optimization and load balancing performance under TS flow latency constraints