15 research outputs found
HotFlip: White-Box Adversarial Examples for Text Classification
We propose an efficient method to generate white-box adversarial examples to
trick a character-level neural classifier. We find that only a few
manipulations are needed to greatly decrease the accuracy. Our method relies on
an atomic flip operation, which swaps one token for another, based on the
gradients of the one-hot input vectors. Due to efficiency of our method, we can
perform adversarial training which makes the model more robust to attacks at
test time. With the use of a few semantics-preserving constraints, we
demonstrate that HotFlip can be adapted to attack a word-level classifier as
well
Adding Conditional Control to Text-to-Image Diffusion Models
We present ControlNet, a neural network architecture to add spatial
conditioning controls to large, pretrained text-to-image diffusion models.
ControlNet locks the production-ready large diffusion models, and reuses their
deep and robust encoding layers pretrained with billions of images as a strong
backbone to learn a diverse set of conditional controls. The neural
architecture is connected with "zero convolutions" (zero-initialized
convolution layers) that progressively grow the parameters from zero and ensure
that no harmful noise could affect the finetuning. We test various conditioning
controls, eg, edges, depth, segmentation, human pose, etc, with Stable
Diffusion, using single or multiple conditions, with or without prompts. We
show that the training of ControlNets is robust with small (<50k) and large
(>1m) datasets. Extensive results show that ControlNet may facilitate wider
applications to control image diffusion models.Comment: Codes and Supplementary Material:
https://github.com/lllyasviel/ControlNe
HireVAE: An Online and Adaptive Factor Model Based on Hierarchical and Regime-Switch VAE
Factor model is a fundamental investment tool in quantitative investment,
which can be empowered by deep learning to become more flexible and efficient
in practical complicated investing situations. However, it is still an open
question to build a factor model that can conduct stock prediction in an online
and adaptive setting, where the model can adapt itself to match the current
market regime identified based on only point-in-time market information. To
tackle this problem, we propose the first deep learning based online and
adaptive factor model, HireVAE, at the core of which is a hierarchical latent
space that embeds the underlying relationship between the market situation and
stock-wise latent factors, so that HireVAE can effectively estimate useful
latent factors given only historical market information and subsequently
predict accurate stock returns. Across four commonly used real stock market
benchmarks, the proposed HireVAE demonstrate superior performance in terms of
active returns over previous methods, verifying the potential of such online
and adaptive factor model.Comment: Accepted to IJCAI 202
Cinematic Behavior Transfer via NeRF-based Differentiable Filming
In the evolving landscape of digital media and video production, the precise
manipulation and reproduction of visual elements like camera movements and
character actions are highly desired. Existing SLAM methods face limitations in
dynamic scenes and human pose estimation often focuses on 2D projections,
neglecting 3D statuses. To address these issues, we first introduce a reverse
filming behavior estimation technique. It optimizes camera trajectories by
leveraging NeRF as a differentiable renderer and refining SMPL tracks. We then
introduce a cinematic transfer pipeline that is able to transfer various shot
types to a new 2D video or a 3D virtual environment. The incorporation of 3D
engine workflow enables superior rendering and control abilities, which also
achieves a higher rating in the user study.Comment: Project Page:
https://virtualfilmstudio.github.io/projects/cinetransfe
SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models
The development of text-to-video (T2V), i.e., generating videos with a given
text prompt, has been significantly advanced in recent years. However, relying
solely on text prompts often results in ambiguous frame composition due to
spatial uncertainty. The research community thus leverages the dense structure
signals, e.g., per-frame depth/edge sequences, to enhance controllability,
whose collection accordingly increases the burden of inference. In this work,
we present SparseCtrl to enable flexible structure control with temporally
sparse signals, requiring only one or a few inputs, as shown in Figure 1. It
incorporates an additional condition encoder to process these sparse signals
while leaving the pre-trained T2V model untouched. The proposed approach is
compatible with various modalities, including sketches, depth maps, and RGB
images, providing more practical control for video generation and promoting
applications such as storyboarding, depth rendering, keyframe animation, and
interpolation. Extensive experiments demonstrate the generalization of
SparseCtrl on both original and personalized T2V generators. Codes and models
will be publicly available at https://guoyww.github.io/projects/SparseCtrl .Comment: Project page: https://guoyww.github.io/projects/SparseCtr
Zero-shot Skeleton-based Action Recognition via Mutual Information Estimation and Maximization
Zero-shot skeleton-based action recognition aims to recognize actions of
unseen categories after training on data of seen categories. The key is to
build the connection between visual and semantic space from seen to unseen
classes. Previous studies have primarily focused on encoding sequences into a
singular feature vector, with subsequent mapping the features to an identical
anchor point within the embedded space. Their performance is hindered by 1) the
ignorance of the global visual/semantic distribution alignment, which results
in a limitation to capture the true interdependence between the two spaces. 2)
the negligence of temporal information since the frame-wise features with rich
action clues are directly pooled into a single feature vector. We propose a new
zero-shot skeleton-based action recognition method via mutual information (MI)
estimation and maximization. Specifically, 1) we maximize the MI between visual
and semantic space for distribution alignment; 2) we leverage the temporal
information for estimating the MI by encouraging MI to increase as more frames
are observed. Extensive experiments on three large-scale skeleton action
datasets confirm the effectiveness of our method. Code:
https://github.com/YujieOuO/SMIE.Comment: Accepted by ACM MM 202