With the rapid development of artificial intelligence (AI), digital humans
have attracted more and more attention and are expected to achieve a wide range
of applications in several industries. Then, most of the existing digital
humans still rely on manual modeling by designers, which is a cumbersome
process and has a long development cycle. Therefore, facing the rise of digital
humans, there is an urgent need for a digital human generation system combined
with AI to improve development efficiency. In this paper, an implementation
scheme of an intelligent digital human generation system with multimodal fusion
is proposed. Specifically, text, speech and image are taken as inputs, and
interactive speech is synthesized using large language model (LLM), voiceprint
extraction, and text-to-speech conversion techniques. Then the input image is
age-transformed and a suitable image is selected as the driving image. Then,
the modification and generation of digital human video content is realized by
digital human driving, novel view synthesis, and intelligent dressing
techniques. Finally, we enhance the user experience through style transfer,
super-resolution, and quality evaluation. Experimental results show that the
system can effectively realize digital human generation. The related code is
released at https://github.com/zyj-2000/CUMT_2D_PhotoSpeaker