346 research outputs found
ExpCLIP: Bridging Text and Facial Expressions via Semantic Alignment
The objective of stylized speech-driven facial animation is to create
animations that encapsulate specific emotional expressions. Existing methods
often depend on pre-established emotional labels or facial expression
templates, which may limit the necessary flexibility for accurately conveying
user intent. In this research, we introduce a technique that enables the
control of arbitrary styles by leveraging natural language as emotion prompts.
This technique presents benefits in terms of both flexibility and
user-friendliness. To realize this objective, we initially construct a
Text-Expression Alignment Dataset (TEAD), wherein each facial expression is
paired with several prompt-like descriptions.We propose an innovative automatic
annotation method, supported by Large Language Models (LLMs), to expedite the
dataset construction, thereby eliminating the substantial expense of manual
annotation. Following this, we utilize TEAD to train a CLIP-based model, termed
ExpCLIP, which encodes text and facial expressions into semantically aligned
style embeddings. The embeddings are subsequently integrated into the facial
animation generator to yield expressive and controllable facial animations.
Given the limited diversity of facial emotions in existing speech-driven facial
animation training data, we further introduce an effective Expression Prompt
Augmentation (EPA) mechanism to enable the animation generator to support
unprecedented richness in style control. Comprehensive experiments illustrate
that our method accomplishes expressive facial animation generation and offers
enhanced flexibility in effectively conveying the desired style
UFOGen: You Forward Once Large Scale Text-to-Image Generation via Diffusion GANs
Text-to-image diffusion models have demonstrated remarkable capabilities in
transforming textual prompts into coherent images, yet the computational cost
of their inference remains a persistent challenge. To address this issue, we
present UFOGen, a novel generative model designed for ultra-fast, one-step
text-to-image synthesis. In contrast to conventional approaches that focus on
improving samplers or employing distillation techniques for diffusion models,
UFOGen adopts a hybrid methodology, integrating diffusion models with a GAN
objective. Leveraging a newly introduced diffusion-GAN objective and
initialization with pre-trained diffusion models, UFOGen excels in efficiently
generating high-quality images conditioned on textual descriptions in a single
step. Beyond traditional text-to-image generation, UFOGen showcases versatility
in applications. Notably, UFOGen stands among the pioneering models enabling
one-step text-to-image generation and diverse downstream tasks, presenting a
significant advancement in the landscape of efficient generative models
MobileDiffusion: Subsecond Text-to-Image Generation on Mobile Devices
The deployment of large-scale text-to-image diffusion models on mobile
devices is impeded by their substantial model size and slow inference speed. In
this paper, we propose \textbf{MobileDiffusion}, a highly efficient
text-to-image diffusion model obtained through extensive optimizations in both
architecture and sampling techniques. We conduct a comprehensive examination of
model architecture design to reduce redundancy, enhance computational
efficiency, and minimize model's parameter count, while preserving image
generation quality. Additionally, we employ distillation and diffusion-GAN
finetuning techniques on MobileDiffusion to achieve 8-step and 1-step inference
respectively. Empirical studies, conducted both quantitatively and
qualitatively, demonstrate the effectiveness of our proposed techniques.
MobileDiffusion achieves a remarkable \textbf{sub-second} inference speed for
generating a image on mobile devices, establishing a new state
of the art
DGraph: A Large-Scale Financial Dataset for Graph Anomaly Detection
Graph Anomaly Detection (GAD) has recently become a hot research spot due to
its practicability and theoretical value. Since GAD emphasizes the application
and the rarity of anomalous samples, enriching the varieties of its datasets is
a fundamental work. Thus, this paper present DGraph, a real-world dynamic graph
in the finance domain. DGraph overcomes many limitations of current GAD
datasets. It contains about 3M nodes, 4M dynamic edges, and 1M ground-truth
nodes. We provide a comprehensive observation of DGraph, revealing that
anomalous nodes and normal nodes generally have different structures, neighbor
distribution, and temporal dynamics. Moreover, it suggests that those unlabeled
nodes are also essential for detecting fraudsters. Furthermore, we conduct
extensive experiments on DGraph. Observation and experiments demonstrate that
DGraph is propulsive to advance GAD research and enable in-depth exploration of
anomalous nodes.Comment: 9 page
3D Visibility-aware Generalizable Neural Radiance Fields for Interacting Hands
Neural radiance fields (NeRFs) are promising 3D representations for scenes,
objects, and humans. However, most existing methods require multi-view inputs
and per-scene training, which limits their real-life applications. Moreover,
current methods focus on single-subject cases, leaving scenes of interacting
hands that involve severe inter-hand occlusions and challenging view variations
remain unsolved. To tackle these issues, this paper proposes a generalizable
visibility-aware NeRF (VA-NeRF) framework for interacting hands. Specifically,
given an image of interacting hands as input, our VA-NeRF first obtains a
mesh-based representation of hands and extracts their corresponding geometric
and textural features. Subsequently, a feature fusion module that exploits the
visibility of query points and mesh vertices is introduced to adaptively merge
features of both hands, enabling the recovery of features in unseen areas.
Additionally, our VA-NeRF is optimized together with a novel discriminator
within an adversarial learning paradigm. In contrast to conventional
discriminators that predict a single real/fake label for the synthesized image,
the proposed discriminator generates a pixel-wise visibility map, providing
fine-grained supervision for unseen areas and encouraging the VA-NeRF to
improve the visual quality of synthesized images. Experiments on the
Interhand2.6M dataset demonstrate that our proposed VA-NeRF outperforms
conventional NeRFs significantly. Project Page:
\url{https://github.com/XuanHuang0/VANeRF}.Comment: Accepted by AAAI-2
Prompt-based Node Feature Extractor for Few-shot Learning on Text-Attributed Graphs
Text-attributed Graphs (TAGs) are commonly found in the real world, such as
social networks and citation networks, and consist of nodes represented by
textual descriptions. Currently, mainstream machine learning methods on TAGs
involve a two-stage modeling approach: (1) unsupervised node feature extraction
with pre-trained language models (PLMs); and (2) supervised learning using
Graph Neural Networks (GNNs). However, we observe that these representations,
which have undergone large-scale pre-training, do not significantly improve
performance with a limited amount of training samples. The main issue is that
existing methods have not effectively integrated information from the graph and
downstream tasks simultaneously. In this paper, we propose a novel framework
called G-Prompt, which combines a graph adapter and task-specific prompts to
extract node features. First, G-Prompt introduces a learnable GNN layer
(\emph{i.e.,} adaptor) at the end of PLMs, which is fine-tuned to better
capture the masked tokens considering graph neighborhood information. After the
adapter is trained, G-Prompt incorporates task-specific prompts to obtain
\emph{interpretable} node representations for the downstream task. Our
experiment results demonstrate that our proposed method outperforms current
state-of-the-art (SOTA) methods on few-shot node classification. More
importantly, in zero-shot settings, the G-Prompt embeddings can not only
provide better task interpretability than vanilla PLMs but also achieve
comparable performance with fully-supervised baselines.Comment: Under revie
Integrated photonics modular arithmetic processor
Integrated photonics computing has emerged as a promising approach to
overcome the limitations of electronic processors in the post-Moore era,
capitalizing on the superiority of photonic systems. However, present
integrated photonics computing systems face challenges in achieving
high-precision calculations, consequently limiting their potential
applications, and their heavy reliance on analog-to-digital (AD) and
digital-to-analog (DA) conversion interfaces undermines their performance. Here
we propose an innovative photonic computing architecture featuring scalable
calculation precision and a novel photonic conversion interface. By leveraging
Residue Number System (RNS) theory, the high-precision calculation is
decomposed into multiple low-precision modular arithmetic operations executed
through optical phase manipulation. Those operations directly interact with the
digital system via our proposed optical digital-to-phase converter (ODPC) and
phase-to-digital converter (OPDC). Through experimental demonstrations, we
showcase a calculation precision of 9 bits and verify the feasibility of the
ODPC/OPDC photonic interface. This approach paves the path towards liberating
photonic computing from the constraints imposed by limited precision and AD/DA
converters.Comment: 23 pages, 9 figure
Fast-HuBERT: An Efficient Training Framework for Self-Supervised Speech Representation Learning
Recent years have witnessed significant advancements in self-supervised
learning (SSL) methods for speech-processing tasks. Various speech-based SSL
models have been developed and present promising performance on a range of
downstream tasks including speech recognition. However, existing speech-based
SSL models face a common dilemma in terms of computational cost, which might
hinder their potential application and in-depth academic research. To address
this issue, we first analyze the computational cost of different modules during
HuBERT pre-training and then introduce a stack of efficiency optimizations,
which is named Fast-HuBERT in this paper. The proposed Fast-HuBERT can be
trained in 1.1 days with 8 V100 GPUs on the Librispeech 960h benchmark, without
performance degradation, resulting in a 5.2x speedup, compared to the original
implementation. Moreover, we explore two well-studied techniques in the
Fast-HuBERT and demonstrate consistent improvements as reported in previous
work
- …