6 research outputs found
FDLS: A Deep Learning Approach to Production Quality, Controllable, and Retargetable Facial Performances
Visual effects commonly requires both the creation of realistic synthetic
humans as well as retargeting actors' performances to humanoid characters such
as aliens and monsters. Achieving the expressive performances demanded in
entertainment requires manipulating complex models with hundreds of parameters.
Full creative control requires the freedom to make edits at any stage of the
production, which prohibits the use of a fully automatic ``black box'' solution
with uninterpretable parameters. On the other hand, producing realistic
animation with these sophisticated models is difficult and laborious. This
paper describes FDLS (Facial Deep Learning Solver), which is Weta Digital's
solution to these challenges. FDLS adopts a coarse-to-fine and
human-in-the-loop strategy, allowing a solved performance to be verified and
edited at several stages in the solving process. To train FDLS, we first
transform the raw motion-captured data into robust graph features. Secondly,
based on the observation that the artists typically finalize the jaw pass
animation before proceeding to finer detail, we solve for the jaw motion first
and predict fine expressions with region-based networks conditioned on the jaw
position. Finally, artists can optionally invoke a non-linear finetuning
process on top of the FDLS solution to follow the motion-captured virtual
markers as closely as possible. FDLS supports editing if needed to improve the
results of the deep learning solution and it can handle small daily changes in
the actor's face shape. FDLS permits reliable and production-quality
performance solving with minimal training and little or no manual effort in
many cases, while also allowing the solve to be guided and edited in unusual
and difficult cases. The system has been under development for several years
and has been used in major movies.Comment: DigiPro '22: The Digital Production Symposiu
EMS: 3D Eyebrow Modeling from Single-view Images
Eyebrows play a critical role in facial expression and appearance. Although
the 3D digitization of faces is well explored, less attention has been drawn to
3D eyebrow modeling. In this work, we propose EMS, the first learning-based
framework for single-view 3D eyebrow reconstruction. Following the methods of
scalp hair reconstruction, we also represent the eyebrow as a set of fiber
curves and convert the reconstruction to fibers growing problem. Three modules
are then carefully designed: RootFinder firstly localizes the fiber root
positions which indicates where to grow; OriPredictor predicts an orientation
field in the 3D space to guide the growing of fibers; FiberEnder is designed to
determine when to stop the growth of each fiber. Our OriPredictor is directly
borrowing the method used in hair reconstruction. Considering the differences
between hair and eyebrows, both RootFinder and FiberEnder are newly proposed.
Specifically, to cope with the challenge that the root location is severely
occluded, we formulate root localization as a density map estimation task.
Given the predicted density map, a density-based clustering method is further
used for finding the roots. For each fiber, the growth starts from the root
point and moves step by step until the ending, where each step is defined as an
oriented line with a constant length according to the predicted orientation
field. To determine when to end, a pixel-aligned RNN architecture is designed
to form a binary classifier, which outputs stop or not for each growing step.
To support the training of all proposed networks, we build the first 3D
synthetic eyebrow dataset that contains 400 high-quality eyebrow models
manually created by artists. Extensive experiments have demonstrated the
effectiveness of the proposed EMS pipeline on a variety of different eyebrow
styles and lengths, ranging from short and sparse to long bushy eyebrows.Comment: To appear in SIGGRAPH Asia 2023 (Journal Track). 19 pages, 19
figures, 6 table
Investigating 3D Visual Speech Animation Using 2D Videos
Lip motion accuracy is of paramount importance for speech intelligibility, especially for users who are hard of hearing or foreign language learners. Furthermore, generating a high level of realism in lip movements is required for the game and film production industries. This thesis focuses on the mapping of tracked lip motions of front-view 2D videos of a real speaker to a synthetic 3D head. A data-driven approach is used based on a 3D morphable model (3DMM) built using 3D synthetic head poses. The 3DMMs have been widely used for different tasks such as face recognition, detect facial expressions and lip motions in 2D videos. However, investigating factors such as the required facial landmarks for the mapping process, the amount of data for constructing the 3DMM, and differences in facial features between real faces and 3D faces that may influence the resulting animation have not been considered yet. Therefore, this research centers around investigating the impact of these factors on the final 3D lip motions.
The thesis explores how different sets of facial features used in the mapping process
influence the resulting 3D motions. Five sets of the facial features are used for mapping the real faces to the corresponding 3D faces. The results show that the inclusion of eyebrows, eyes, nose, and lips improves the 3D lip motions, while face contour features (i.e. the outside boundary of the front view of the face) restrict the face’s mesh, distorting the resulting animation.
This thesis investigates how using different amounts of data when constructing the 3DMM affects the 3D lip motions. The results show that using a wider range of synthetic head poses for different phoneme intensities to create a 3DMM, as well as a combination of front- and side-view photographs of real speakers to produce initial neutral 3D synthetic head poses, provides better animation results compared to ground truth data consisting of front- and side-view 2D videos of real speakers.
The thesis also investigates the impact of differences and similarities in facial features between real speakers and the 3DMMs on the resulting 3D lip motions by mapping between non-similar faces based on differences and similarities in vertical mouth height and mouth width. The objective and user test results show that mapping 2D videos of real speakers with low vertical mouth heights to 3D heads that correspond to real speakers with high vertical mouth heights, or vice versa, generates less good 3D lip motions. It is thus important that this is considered when using a 2D recording of a real actor’s lip movements to control a 3D synthetic character