The objective of stylized speech-driven facial animation is to create
animations that encapsulate specific emotional expressions. Existing methods
often depend on pre-established emotional labels or facial expression
templates, which may limit the necessary flexibility for accurately conveying
user intent. In this research, we introduce a technique that enables the
control of arbitrary styles by leveraging natural language as emotion prompts.
This technique presents benefits in terms of both flexibility and
user-friendliness. To realize this objective, we initially construct a
Text-Expression Alignment Dataset (TEAD), wherein each facial expression is
paired with several prompt-like descriptions.We propose an innovative automatic
annotation method, supported by Large Language Models (LLMs), to expedite the
dataset construction, thereby eliminating the substantial expense of manual
annotation. Following this, we utilize TEAD to train a CLIP-based model, termed
ExpCLIP, which encodes text and facial expressions into semantically aligned
style embeddings. The embeddings are subsequently integrated into the facial
animation generator to yield expressive and controllable facial animations.
Given the limited diversity of facial emotions in existing speech-driven facial
animation training data, we further introduce an effective Expression Prompt
Augmentation (EPA) mechanism to enable the animation generator to support
unprecedented richness in style control. Comprehensive experiments illustrate
that our method accomplishes expressive facial animation generation and offers
enhanced flexibility in effectively conveying the desired style