32,436 research outputs found

    An Implementation of Multimodal Fusion System for Intelligent Digital Human Generation

    Full text link
    With the rapid development of artificial intelligence (AI), digital humans have attracted more and more attention and are expected to achieve a wide range of applications in several industries. Then, most of the existing digital humans still rely on manual modeling by designers, which is a cumbersome process and has a long development cycle. Therefore, facing the rise of digital humans, there is an urgent need for a digital human generation system combined with AI to improve development efficiency. In this paper, an implementation scheme of an intelligent digital human generation system with multimodal fusion is proposed. Specifically, text, speech and image are taken as inputs, and interactive speech is synthesized using large language model (LLM), voiceprint extraction, and text-to-speech conversion techniques. Then the input image is age-transformed and a suitable image is selected as the driving image. Then, the modification and generation of digital human video content is realized by digital human driving, novel view synthesis, and intelligent dressing techniques. Finally, we enhance the user experience through style transfer, super-resolution, and quality evaluation. Experimental results show that the system can effectively realize digital human generation. The related code is released at https://github.com/zyj-2000/CUMT_2D_PhotoSpeaker

    The AISB’08 Symposium on Multimodal Output Generation (MOG 2008)

    Get PDF
    Welcome to Aberdeen at the Symposium on Multimodal Output Generation (MOG 2008)! In this volume the papers presented at the MOG 2008 international symposium are collected

    Towards automatic generation of multimodal answers to medical questions: a cognitive engineering approach

    Get PDF
    This paper describes a production experiment carried out to determine which modalities people choose to answer different types of questions. In this experiment participants had to create (multimodal) presentations of answers to general medical questions. The collected answer presentations were coded on types of manipulations (typographic, spatial, graphical), presence of visual media (i.e., photos, graphics, and animations), functions and position of these visual media. The results of a first analysis indicated that participants presented the information in a multimodal way. Moreover, significant differences were found in the information presentation of different answer and question types

    A multimodal restaurant finder for semantic web

    Get PDF
    Multimodal dialogue systems provide multiple modalities in the form of speech, mouse clicking, drawing or touch that can enhance human-computer interaction. However, one of the drawbacks of the existing multimodal systems is that they are highly domain-specific and they do not allow information to be shared across different providers. In this paper, we propose a semantic multimodal system, called Semantic Restaurant Finder, for the Semantic Web in which the restaurant information in different city/country/language are constructed as ontologies to allow the information to be sharable. From the Semantic Restaurant Finder, users can make use of the semantic restaurant knowledge distributed from different locations on the Internet to find the desired restaurants

    On the Role of Visuals in Multimodal Answers to Medical Questions

    Get PDF
    This paper describes two experiments carried out in order to investigate the role of visuals in multimodal answer presentations for a medical question answering system. First, a production experiment was carried out to determine which modalities people choose to answer different types of questions. In this experiment, participants had to create (multimodal) presentations of answers to general medical questions. The collected answer presentations were coded on the presence of visual media (i.e., photos, graphics, and animations) and their function. The results indicated that participants presented the information in a multimodal way. Moreover, significant differences were found in the presentation of different answer and question types. Next, an evaluation experiment was conducted to investigate how users evaluate different types of multimodal answer presentations. In this second experiment, participants had\ud to assess the informativity and attractiveness of answer presentations for different types of medical questions. These answer presentations, originating from the production experiment, were manipulated in their answer length (brief vs. extended) and their type of picture (illustrative vs. informative). After the participants had assessed the answer presentations, they received a post-\ud test in which they had to indicate how much they had recalled from the presented answer presentations. The results showed that answer presentations with an informative picture were evaluated as more informative and more attractive than answer presentations with an illustrative picture. The results for the post-test tentatively indicated that learning from answer presentations with an informative picture leads to a better learning performance than learning from purely textual answer presentations

    An information assistant system for the prevention of tunnel vision in crisis management

    Get PDF
    In the crisis management environment, tunnel vision is a set of bias in decision makers’ cognitive process which often leads to incorrect understanding of the real crisis situation, biased perception of information, and improper decisions. The tunnel vision phenomenon is a consequence of both the challenges in the task and the natural limitation in a human being’s cognitive process. An information assistant system is proposed with the purpose of preventing tunnel vision. The system serves as a platform for monitoring the on-going crisis event. All information goes through the system before arrives at the user. The system enhances the data quality, reduces the data quantity and presents the crisis information in a manner that prevents or repairs the user’s cognitive overload. While working with such a system, the users (crisis managers) are expected to be more likely to stay aware of the actual situation, stay open minded to possibilities, and make proper decisions

    Generation of multi-modal dialogue for a net environment

    Get PDF
    In this paper an architecture and special purpose markup language for simulated affective face-to-face communication is presented. In systems based on this architecture, users will be able to watch embodied conversational agents interact with each other in virtual locations on the internet. The markup language, or Rich Representation Language (RRL), has been designed to provide an integrated representation of speech, gesture, posture and facial animation
    corecore