14 research outputs found

    User Evaluation of the SYNFACE Talking Head Telephone

    Get PDF
    Abstract. The talking-head telephone, Synface, is a lip-reading support for people with hearing-impairment. It has been tested by 49 users with varying degrees of hearing-impaired in UK and Sweden in lab and home environments. Synface was found to give support to the users, especially in perceiving numbers and addresses and an enjoyable way to communicate. A majority deemed Synface to be a useful product.

    Speech-driven facial animations improve speech-in-noise comprehension of humans

    Get PDF
    Understanding speech becomes a demanding task when the environment is noisy. Comprehension of speech in noise can be substantially improved by looking at the speaker’s face, and this audiovisual benefit is even more pronounced in people with hearing impairment. Recent advances in AI have allowed to synthesize photorealistic talking faces from a speech recording and a still image of a person’s face in an end-to-end manner. However, it has remained unknown whether such facial animations improve speech-in-noise comprehension. Here we consider facial animations produced by a recently introduced generative adversarial network (GAN), and show that humans cannot distinguish between the synthesized and the natural videos. Importantly, we then show that the end-to-end synthesized videos significantly aid humans in understanding speech in noise, although the natural facial motions yield a yet higher audiovisual benefit. We further find that an audiovisual speech recognizer (AVSR) benefits from the synthesized facial animations as well. Our results suggest that synthesizing facial motions from speech can be used to aid speech comprehension in difficult listening environments

    Relating Objective and Subjective Performance Measures for AAM-based Visual Speech Synthesizers

    Get PDF
    We compare two approaches for synthesizing visual speech using Active Appearance Models (AAMs): one that utilizes acoustic features as input, and one that utilizes a phonetic transcription as input. Both synthesizers are trained using the same data and the performance is measured using both objective and subjective testing. We investigate the impact of likely sources of error in the synthesized visual speech by introducing typical errors into real visual speech sequences and subjectively measuring the perceived degradation. When only a small region (e.g. a single syllable) of ground-truth visual speech is incorrect we find that the subjective score for the entire sequence is subjectively lower than sequences generated by our synthesizers. This observation motivates further consideration of an often ignored issue, which is to what extent are subjective measures correlated with objective measures of performance? Significantly, we find that the most commonly used objective measures of performance are not necessarily the best indicator of viewer perception of quality. We empirically evaluate alternatives and show that the cost of a dynamic time warp of synthesized visual speech parameters to the respective ground-truth parameters is a better indicator of subjective quality

    Making Faces - State-Space Models Applied to Multi-Modal Signal Processing

    Get PDF

    Multi-modal response generation.

    Get PDF
    Wong Ka Ho.Thesis submitted in: October 2005.Thesis (M.Phil.)--Chinese University of Hong Kong, 2006.Includes bibliographical references (leaves 163-170).Abstracts in English and Chinese.Abstract --- p.2Acknowledgements --- p.5Chapter 1 --- Introduction --- p.10Chapter 1.1 --- Multi-modal and Multi-media --- p.10Chapter 1.2 --- Overview --- p.11Chapter 1.3 --- Thesis Goal --- p.13Chapter 1.4 --- Thesis Outline --- p.15Chapter 2 --- Background --- p.16Chapter 2.1 --- Multi-modal Fission --- p.17Chapter 2.2 --- Multi-modal Data collection --- p.21Chapter 2.2.1 --- Collection Time --- p.21Chapter 2.2.2 --- Annotation and Tools --- p.21Chapter 2.2.3 --- Knowledge of Multi-modal Using --- p.21Chapter 2.3 --- Text-to-audiovisual Speech System --- p.22Chapter 2.3.1 --- Different. Approaches to Generate a Talking Heading --- p.23Chapter 2.3.2 --- Sub-tasks in Animating a Talking Head --- p.25Chapter 2.4 --- Modality Selection --- p.27Chapter 2.4.1 --- Rules-based approach --- p.27Chapter 2.4.2 --- Plan-based approach --- p.28Chapter 2.4.3 --- Feature-based approach --- p.29Chapter 2.4.4 --- Corpus-based approach --- p.30Chapter 2.5 --- Summary --- p.30Chapter 3 --- Information Domain --- p.31Chapter 3.1 --- Multi-media Information --- p.31Chapter 3.2 --- "Task Goals, Dialog Acts, Concepts and Information Type" --- p.32Chapter 3.2.1 --- Task Goals and Dialog Acts --- p.32Chapter 3.2.2 --- Concepts and Information Type --- p.36Chapter 3.3 --- User's Task and Scenario --- p.37Chapter 3.4 --- Chapter Summary --- p.38Chapter 4 --- Multi-modal Response Data Collection --- p.41Chapter 4.1 --- Data Collection Setup --- p.42Chapter 4.1.1 --- Multi-modal Input Setup --- p.43Chapter 4.1.2 --- Multi-modal Output Setup --- p.43Chapter 4.2 --- Procedure --- p.45Chapter 4.2.1 --- Precaution --- p.45Chapter 4.2.2 --- Recording --- p.50Chapter 4.2.3 --- Data Size and Type --- p.50Chapter 4.3 --- Annotation --- p.52Chapter 4.3.1 --- Extensible Multi-Modal Markup Language --- p.52Chapter 4.3.2 --- "Mobile, Multi-biometric and Multi-modal Annotation" --- p.53Chapter 4.4 --- Problems in the Wizard-of-Oz Setup --- p.56Chapter 4.4.1 --- Lack of Knowledge --- p.57Chapter 4.4.2 --- Time Deficiency --- p.57Chapter 4.4.3 --- Information Availability --- p.58Chapter 4.4.4 --- Operation Delay --- p.59Chapter 4.4.5 --- Lack of Modalities --- p.59Chapter 4.5 --- Data Optimization --- p.61Chapter 4.5.1 --- Precaution --- p.61Chapter 4.5.2 --- Procedures --- p.61Chapter 4.5.3 --- Data Size in Expert Design Responses --- p.63Chapter 4.6 --- Analysis and Discussion --- p.65Chapter 4.6.1 --- Multi-modal Usage --- p.67Chapter 4.6.2 --- Modality Combination --- p.67Chapter 4.6.3 --- Deictic term --- p.68Chapter 4.6.4 --- Task Goal and Dialog Acts --- p.71Chapter 4.6.5 --- Information Type --- p.72Chapter 4.7 --- Chapter Summary --- p.74Chapter 5 --- Text-to-Audiovisual Speech System --- p.76Chapter 5.1 --- Phonemes and Visemes --- p.77Chapter 5.2 --- Three-dimensional Facial Animation --- p.82Chapter 5.2.1 --- Three-dimensional (3D) Face Model --- p.82Chapter 5.2.2 --- The Blending Process for Animation --- p.84Chapter 5.2.3 --- Connectivity between Visemes --- p.85Chapter 5.3 --- User Perception Experiments --- p.87Chapter 5.4 --- Applications and Extension --- p.89Chapter 5.4.1 --- Multilingual Extension and Potential Applications --- p.89Chapter 5.5 --- Talking Head in Multi-modal Dialogue System --- p.90Chapter 5.5.1 --- Prosody --- p.93Chapter 5.5.2 --- Body Gesture --- p.94Chapter 5.6 --- Chapter Summary --- p.94Chapter 6 --- Modality Selection and Implementation --- p.98Chapter 6.1 --- Multi-modal Response Examples --- p.98Chapter 6.1.1 --- Single Concept-value Example --- p.99Chapter 6.1.2 --- Two Concept-values with Different Information Types --- p.102Chapter 6.1.3 --- Multiple Concept-values with Same Information Types Example --- p.103Chapter 6.2 --- Heuristic Rules for Modality Selection --- p.105Chapter 6.2.1 --- General Principles --- p.106Chapter 6.2.2 --- Heuristic rules --- p.107Chapter 6.2.3 --- Temporal Coordination for Synchronization --- p.109Chapter 6.2.4 --- Physical Layout --- p.110Chapter 6.2.5 --- Deictic Term --- p.111Chapter 6.2.6 --- Example --- p.111Chapter 6.3 --- Spoken Content Generation --- p.113Chapter 6.4 --- Chapter Summary --- p.115Chapter 7 --- Conclusions and Future Work --- p.117Chapter 7.1 --- Summary --- p.117Chapter 7.2 --- Contributions --- p.118Chapter 7.3 --- Future work --- p.119Chapter A --- XML Schema for M3 Markup Language --- p.123Chapter B --- M3ML Examples --- p.128Chapter C --- Domain-Specific Task Goals in the Hong Kong Tourism Do- main --- p.131Chapter D --- Dialog Acts for User Request in the Hong Kong Tourism Do- main --- p.133Chapter E --- Dialog Acts for System Response in the Hong Kong Tourism Domain --- p.137Chapter F --- Information Type and Concepts --- p.141Chapter G --- Concepts --- p.143Bibliography --- p.14

    Towards an Integrative Information Society: Studies on Individuality in Speech and Sign

    Get PDF
    The flow of information within modern information society has increased rapidly over the last decade. The major part of this information flow relies on the individual’s abilities to handle text or speech input. For the majority of us it presents no problems, but there are some individuals who would benefit from other means of conveying information, e.g. signed information flow. During the last decades the new results from various disciplines have all suggested towards the common background and processing for sign and speech and this was one of the key issues that I wanted to investigate further in this thesis. The basis of this thesis is firmly within speech research and that is why I wanted to design analogous test batteries for widely used speech perception tests for signers – to find out whether the results for signers would be the same as in speakers’ perception tests. One of the key findings within biology – and more precisely its effects on speech and communication research – is the mirror neuron system. That finding has enabled us to form new theories about evolution of communication, and it all seems to converge on the hypothesis that all communication has a common core within humans. In this thesis speech and sign are discussed as equal and analogical counterparts of communication and all research methods used in speech are modified for sign. Both speech and sign are thus investigated using similar test batteries. Furthermore, both production and perception of speech and sign are studied separately. An additional framework for studying production is given by gesture research using cry sounds. Results of cry sound research are then compared to results from children acquiring sign language. These results show that individuality manifests itself from very early on in human development. Articulation in adults, both in speech and sign, is studied from two perspectives: normal production and re-learning production when the apparatus has been changed. Normal production is studied both in speech and sign and the effects of changed articulation are studied with regards to speech. Both these studies are done by using carrier sentences. Furthermore, sign production is studied giving the informants possibility for spontaneous speech. The production data from the signing informants is also used as the basis for input in the sign synthesis stimuli used in sign perception test battery. Speech and sign perception were studied using the informants’ answers to questions using forced choice in identification and discrimination tasks. These answers were then compared across language modalities. Three different informant groups participated in the sign perception tests: native signers, sign language interpreters and Finnish adults with no knowledge of any signed language. This gave a chance to investigate which of the characteristics found in the results were due to the language per se and which were due to the changes in modality itself. As the analogous test batteries yielded similar results over different informant groups, some common threads of results could be observed. Starting from very early on in acquiring speech and sign the results were highly individual. However, the results were the same within one individual when the same test was repeated. This individuality of results represented along same patterns across different language modalities and - in some occasions - across language groups. As both modalities yield similar answers to analogous study questions, this has lead us to providing methods for basic input for sign language applications, i.e. signing avatars. This has also given us answers to questions on precision of the animation and intelligibility for the users – what are the parameters that govern intelligibility of synthesised speech or sign and how precise must the animation or synthetic speech be in order for it to be intelligible. The results also give additional support to the well-known fact that intelligibility in fact is not the same as naturalness. In some cases, as shown within the sign perception test battery design, naturalness decreases intelligibility. This also has to be taken into consideration when designing applications. All in all, results from each of the test batteries, be they for signers or speakers, yield strikingly similar patterns, which would indicate yet further support for the common core for all human communication. Thus, we can modify and deepen the phonetic framework models for human communication based on the knowledge obtained from the results of the test batteries within this thesis.Siirretty Doriast

    Augmented Reality

    Get PDF
    Augmented Reality (AR) is a natural development from virtual reality (VR), which was developed several decades earlier. AR complements VR in many ways. Due to the advantages of the user being able to see both the real and virtual objects simultaneously, AR is far more intuitive, but it's not completely detached from human factors and other restrictions. AR doesn't consume as much time and effort in the applications because it's not required to construct the entire virtual scene and the environment. In this book, several new and emerging application areas of AR are presented and divided into three sections. The first section contains applications in outdoor and mobile AR, such as construction, restoration, security and surveillance. The second section deals with AR in medical, biological, and human bodies. The third and final section contains a number of new and useful applications in daily living and learning

    Studies on Inequalities in Information Society. Proceedings of the Conference, Well-Being in the Information Society

    Get PDF
    Siirretty Doriast
    corecore