Search CORE

32,436 research outputs found

An Implementation of Multimodal Fusion System for Intelligent Digital Human Generation

Author: Bi Kaiyue
Chen Yaodong
Liu Hui
Xiong Lian
Zhou Yingjie
Publication venue
Publication date: 31/10/2023
Field of study

With the rapid development of artificial intelligence (AI), digital humans have attracted more and more attention and are expected to achieve a wide range of applications in several industries. Then, most of the existing digital humans still rely on manual modeling by designers, which is a cumbersome process and has a long development cycle. Therefore, facing the rise of digital humans, there is an urgent need for a digital human generation system combined with AI to improve development efficiency. In this paper, an implementation scheme of an intelligent digital human generation system with multimodal fusion is proposed. Specifically, text, speech and image are taken as inputs, and interactive speech is synthesized using large language model (LLM), voiceprint extraction, and text-to-speech conversion techniques. Then the input image is age-transformed and a suitable image is selected as the driving image. Then, the modification and generation of digital human video content is realized by digital human driving, novel view synthesis, and intelligent dressing techniques. Finally, we enhance the user experience through style transfer, super-resolution, and quality evaluation. Experimental results show that the system can effectively realize digital human generation. The related code is released at https://github.com/zyj-2000/CUMT_2D_PhotoSpeaker

arXiv.org e-Print Archive

The AISB’08 Symposium on Multimodal Output Generation (MOG 2008)

Author: André E.
Bachvarova Y.S.
Sluis I. van der
Theune M.
Publication venue: Society for the Study of Artificial Intelligence and the Simulation of Behaviour (AISB)
Publication date: 01/01/2008
Field of study

Welcome to Aberdeen at the Symposium on Multimodal Output Generation (MOG 2008)! In this volume the papers presented at the MOG 2008 international symposium are collected

University of Twente Research Information

Towards automatic generation of multimodal answers to medical questions: a cognitive engineering approach

Author: Bosma Wauter
Hooijdonk Charlotte van
Krahmer Emiel
Maes Alfons
Theune Mariët
Publication venue: University of Twente, Centre for Telematics and Information Technology
Publication date: 01/01/2007
Field of study

This paper describes a production experiment carried out to determine which modalities people choose to answer different types of questions. In this experiment participants had to create (multimodal) presentations of answers to general medical questions. The collected answer presentations were coded on types of manipulations (typographic, spatial, graphical), presence of visual media (i.e., photos, graphics, and animations), functions and position of these visual media. The results of a first analysis indicated that participants presented the information in a multimodal way. Moreover, significant differences were found in the information presentation of different answer and question types

University of Twente Research Information

Tilburg University Repository

A multimodal restaurant finder for semantic web

Author: He Yulan
Hui Siu Cheung
Quan Thanh Tho
Publication venue
Publication date: 01/01/2007
Field of study

Multimodal dialogue systems provide multiple modalities in the form of speech, mouse clicking, drawing or touch that can enhance human-computer interaction. However, one of the drawbacks of the existing multimodal systems is that they are highly domain-speciﬁc and they do not allow information to be shared across different providers. In this paper, we propose a semantic multimodal system, called Semantic Restaurant Finder, for the Semantic Web in which the restaurant information in different city/country/language are constructed as ontologies to allow the information to be sharable. From the Semantic Restaurant Finder, users can make use of the semantic restaurant knowledge distributed from different locations on the Internet to ﬁnd the desired restaurants

On the Role of Visuals in Multimodal Answers to Medical Questions

Author: Bosma Wauter
Hooijdonk Charlotte van
Krahmer Emiel
Maes Alfons
Theune Mariët
Vos Jurry de
Publication venue: IEEE Computer Society Press
Publication date: 01/01/2007
Field of study

This paper describes two experiments carried out in order to investigate the role of visuals in multimodal answer presentations for a medical question answering system. First, a production experiment was carried out to determine which modalities people choose to answer different types of questions. In this experiment, participants had to create (multimodal) presentations of answers to general medical questions. The collected answer presentations were coded on the presence of visual media (i.e., photos, graphics, and animations) and their function. The results indicated that participants presented the information in a multimodal way. Moreover, significant differences were found in the presentation of different answer and question types. Next, an evaluation experiment was conducted to investigate how users evaluate different types of multimodal answer presentations. In this second experiment, participants had\ud to assess the informativity and attractiveness of answer presentations for different types of medical questions. These answer presentations, originating from the production experiment, were manipulated in their answer length (brief vs. extended) and their type of picture (illustrative vs. informative). After the participants had assessed the answer presentations, they received a post-\ud test in which they had to indicate how much they had recalled from the presented answer presentations. The results showed that answer presentations with an informative picture were evaluated as more informative and more attractive than answer presentations with an illustrative picture. The results for the post-test tentatively indicated that learning from answer presentations with an informative picture leads to a better learning performance than learning from purely textual answer presentations

University of Twente Research Information

Tilburg University Repository

An information assistant system for the prevention of tunnel vision in crisis management

Author: Cao Yujia
Publication venue: Centre for Telematics and Information Technology, University of Twente
Publication date: 01/01/2008
Field of study

In the crisis management environment, tunnel vision is a set of bias in decision makers’ cognitive process which often leads to incorrect understanding of the real crisis situation, biased perception of information, and improper decisions. The tunnel vision phenomenon is a consequence of both the challenges in the task and the natural limitation in a human being’s cognitive process. An information assistant system is proposed with the purpose of preventing tunnel vision. The system serves as a platform for monitoring the on-going crisis event. All information goes through the system before arrives at the user. The system enhances the data quality, reduces the data quantity and presents the crisis information in a manner that prevents or repairs the user’s cognitive overload. While working with such a system, the users (crisis managers) are expected to be more likely to stay aware of the actual situation, stay open minded to possibilities, and make proper decisions

University of Twente Research Information

Generation of multi-modal dialogue for a net environment

Author: Baumann S.
Grice M.
Gstrein E.
Klesen M.
Krenn B.
Pirker H.
Piwek P.
Schroeder M.
van Deemter K.
Publication venue
Publication date: 01/01/2002
Field of study

In this paper an architecture and special purpose markup language for simulated affective face-to-face communication is presented. In systems based on this architecture, users will be able to watch embodied conversational agents interact with each other in virtual locations on the internet. The markup language, or Rich Representation Language (RRL), has been designed to provide an integrated representation of speech, gesture, posture and facial animation