529 research outputs found

    A survey on mouth modeling and analysis for Sign Language recognition

    Get PDF
    © 2015 IEEE.Around 70 million Deaf worldwide use Sign Languages (SLs) as their native languages. At the same time, they have limited reading/writing skills in the spoken language. This puts them at a severe disadvantage in many contexts, including education, work, usage of computers and the Internet. Automatic Sign Language Recognition (ASLR) can support the Deaf in many ways, e.g. by enabling the development of systems for Human-Computer Interaction in SL and translation between sign and spoken language. Research in ASLR usually revolves around automatic understanding of manual signs. Recently, ASLR research community has started to appreciate the importance of non-manuals, since they are related to the lexical meaning of a sign, the syntax and the prosody. Nonmanuals include body and head pose, movement of the eyebrows and the eyes, as well as blinks and squints. Arguably, the mouth is one of the most involved parts of the face in non-manuals. Mouth actions related to ASLR can be either mouthings, i.e. visual syllables with the mouth while signing, or non-verbal mouth gestures. Both are very important in ASLR. In this paper, we present the first survey on mouth non-manuals in ASLR. We start by showing why mouth motion is important in SL and the relevant techniques that exist within ASLR. Since limited research has been conducted regarding automatic analysis of mouth motion in the context of ALSR, we proceed by surveying relevant techniques from the areas of automatic mouth expression and visual speech recognition which can be applied to the task. Finally, we conclude by presenting the challenges and potentials of automatic analysis of mouth motion in the context of ASLR

    Computer-based tracking, analysis, and visualization of linguistically significant nonmanual events in American Sign Language (ASL)

    Full text link
    Our linguistically annotated American Sign Language (ASL) corpora have formed a basis for research to automate detection by computer of essential linguistic information conveyed through facial expressions and head movements. We have tracked head position and facial deformations, and used computational learning to discern specific grammatical markings. Our ability to detect, identify, and temporally localize the occurrence of such markings in ASL videos has recently been improved by incorporation of (1) new techniques for deformable model-based 3D tracking of head position and facial expressions, which provide significantly better tracking accuracy and recover quickly from temporary loss of track due to occlusion; and (2) a computational learning approach incorporating 2-level Conditional Random Fields (CRFs), suited to the multi-scale spatio-temporal characteristics of the data, which analyses not only low-level appearance characteristics, but also the patterns that enable identification of significant gestural components, such as periodic head movements and raised or lowered eyebrows. Here we summarize our linguistically motivated computational approach and the results for detection and recognition of nonmanual grammatical markings; demonstrate our data visualizations, and discuss the relevance for linguistic research; and describe work underway to enable such visualizations to be produced over large corpora and shared publicly on the Web

    Humour production in face-to-face interaction: a multimodal and cognitive study

    Get PDF
    El humor es una de las formas de comunicación más complejas que existen (Veale, Brône & Feyaerts, 2015). Entre las teorías lingüísticas sobre el humor, algunas tienen un enfoque semántico-pragmático, tales como la Semantic Script Theory of Humour (Raskin, 1984) o la General Theory of Verbal Humour (Attardo, 2001). Otras se inscriben en la Teoría de la Relevancia (Yus, 2016) y las hay también con una perspectiva más cognitiva (Giora, 1991, 2015; Coulson & Okley, 2005; Veale, Feyaerts & Brône, 2006). Por otra parte, se han realizado varios estudios sobre los marcadores multimodales de la ironía o el sarcasmo, cuyos resultados son dispares (Attardo, Eisterhold, Hay, and Poggi, 2003; Attardo, Pickering, and Baker, 2011; Attardo, Wagner, and Urios-Aparisi, 2011). Sin embargo, el humor no irónico ha sido objeto de menor estudio. Además, la mayor parte de los análisis se circunscriben al humor ensayado, con pocos estudios sobre el humor producido de forma espontánea (Bryant, 2010, Feyaerts, 2013; Tabacaru, 2014, etc.) y menos aún que conjuguen la perspectiva multimodal con la cognitiva. En esta tesis se analizan 14 entrevistas extraídas de The Late Show with Stephen Colbert con vistas a explicar la comunicación espontánea del humor desde el punto de vista multimodal y cognitivo. Los enunciados se han identificado como humorísticos cuando el público reaccionaba riendo. El análisis multimodal se ha realizado en ELAN, con cinco niveles de anotaciones: transcripción, tipo de humor (Feyaerts et al., 2010), mecanismo conceptual subyacente (Croft & Cruse, 2004), gestos y prosodia. El estudio prosódico se ha llevado a cabo con Praat, a fin de determinar si había un mayor contraste prosódico en enunciados humorísticos. Los resultados muestran que los mecanismos multimodales y cognitivos no difieren entre enunciados humorísticos y no humorísticos.Departamento de Filología InglesaDoctorado en Estudios Ingleses Avanzados: Lenguas y Culturas en Contact

    Primacy of mouth over eyes to perceive audiovisual Mandarin lexical tones

    Get PDF
    The visual cues of lexical tones are more implicit and much less investigated than consonants and vowels, and it is still unclear what facial areas contribute to facial tones identification. This study investigated Chinese and English speakers’ eye movements when they were asked to identify audiovisual Mandarin lexical tones. The Chinese and English speakers were presented with an audiovisual clip of Mandarin monosyllables (for instance, /ă/, /à/, /ĭ/, /ì/) and were asked to identify whether the syllables were a dipping tone (/ă/, / ĭ/) or a falling tone (/ à/, /ì/). These audiovisual syllables were presented in clear, noisy and silent (absence of audio signal) conditions. An eye-tracker recorded the participants’ eye movements. Results showed that the participants gazed more at the mouth than the eyes. In addition, when acoustic conditions became adverse, both the Chinese and English speakers increased their gaze duration at the mouth rather than at the eyes. The findings suggested that the mouth is the primary area that listeners utilise in their perception of audiovisual lexical tones. The similar eye movements between the Chinese and English speakers imply that the mouth acts as a perceptual cue that provides articulatory information, as opposed to social and pragmatic information

    Visual prosody in speech-driven facial animation: elicitation, prediction, and perceptual evaluation

    Get PDF
    Facial animations capable of articulating accurate movements in synchrony with a speech track have become a subject of much research during the past decade. Most of these efforts have focused on articulation of lip and tongue movements, since these are the primary sources of information in speech reading. However, a wealth of paralinguistic information is implicitly conveyed through visual prosody (e.g., head and eyebrow movements). In contrast with lip/tongue movements, however, for which the articulation rules are fairly well known (i.e., viseme-phoneme mappings, coarticulation), little is known about the generation of visual prosody. The objective of this thesis is to explore the perceptual contributions of visual prosody in speech-driven facial avatars. Our main hypothesis is that visual prosody driven by acoustics of the speech signal, as opposed to random or no visual prosody, results in more realistic, coherent and convincing facial animations. To test this hypothesis, we have developed an audio-visual system capable of capturing synchronized speech and facial motion from a speaker using infrared illumination and retro-reflective markers. In order to elicit natural visual prosody, a story-telling experiment was designed in which the actors were shown a short cartoon video, and subsequently asked to narrate the episode. From this audio-visual data, four different facial animations were generated, articulating no visual prosody, Perlin-noise, speech-driven movements, and ground truth movements. Speech-driven movements were driven by acoustic features of the speech signal (e.g., fundamental frequency and energy) using rule-based heuristics and autoregressive models. A pair-wise perceptual evaluation shows that subjects can clearly discriminate among the four visual prosody animations. It also shows that speech-driven movements and Perlin-noise, in that order, approach the performance of veridical motion. The results are quite promising and suggest that speech-driven motion could outperform Perlin-noise if more powerful motion prediction models are used. In addition, our results also show that exaggeration can bias the viewer to perceive a computer generated character to be more realistic motion-wise

    A system for recognizing human emotions based on speech analysis and facial feature extraction: applications to Human-Robot Interaction

    Get PDF
    With the advance in Artificial Intelligence, humanoid robots start to interact with ordinary people based on the growing understanding of psychological processes. Accumulating evidences in Human Robot Interaction (HRI) suggest that researches are focusing on making an emotional communication between human and robot for creating a social perception, cognition, desired interaction and sensation. Furthermore, robots need to receive human emotion and optimize their behavior to help and interact with a human being in various environments. The most natural way to recognize basic emotions is extracting sets of features from human speech, facial expression and body gesture. A system for recognition of emotions based on speech analysis and facial features extraction can have interesting applications in Human-Robot Interaction. Thus, the Human-Robot Interaction ontology explains how the knowledge of these fundamental sciences is applied in physics (sound analyses), mathematics (face detection and perception), philosophy theory (behavior) and robotic science context. In this project, we carry out a study to recognize basic emotions (sadness, surprise, happiness, anger, fear and disgust). Also, we propose a methodology and a software program for classification of emotions based on speech analysis and facial features extraction. The speech analysis phase attempted to investigate the appropriateness of using acoustic (pitch value, pitch peak, pitch range, intensity and formant), phonetic (speech rate) properties of emotive speech with the freeware program PRAAT, and consists of generating and analyzing a graph of speech signals. The proposed architecture investigated the appropriateness of analyzing emotive speech with the minimal use of signal processing algorithms. 30 participants to the experiment had to repeat five sentences in English (with durations typically between 0.40 s and 2.5 s) in order to extract data relative to pitch (value, range and peak) and rising-falling intonation. Pitch alignments (peak, value and range) have been evaluated and the results have been compared with intensity and speech rate. The facial feature extraction phase uses the mathematical formulation (B\ue9zier curves) and the geometric analysis of the facial image, based on measurements of a set of Action Units (AUs) for classifying the emotion. The proposed technique consists of three steps: (i) detecting the facial region within the image, (ii) extracting and classifying the facial features, (iii) recognizing the emotion. Then, the new data have been merged with reference data in order to recognize the basic emotion. Finally, we combined the two proposed algorithms (speech analysis and facial expression), in order to design a hybrid technique for emotion recognition. Such technique have been implemented in a software program, which can be employed in Human-Robot Interaction. The efficiency of the methodology was evaluated by experimental tests on 30 individuals (15 female and 15 male, 20 to 48 years old) form different ethnic groups, namely: (i) Ten adult European, (ii) Ten Asian (Middle East) adult and (iii) Ten adult American. Eventually, the proposed technique made possible to recognize the basic emotion in most of the cases

    Expression of Aboutness Subject Topic Constructions in Turkish Sign Language (TİD) Narratives

    Get PDF
    In the visual-spatial modality, signers indicate old, new, or contrastive information using certain syntactic, prosodic, and morphological strategies. Even though information structure has been described extensively for many sign languages, the flow of information in the narrative discourse remains unexplored in Turkish Sign Language (TİD). This study aims to describe aboutness subject topic constructions in TİD narratives. We examined data from six adult native signers of TİD and found that TİD signers mainly used nominals for reintroduced aboutness subject topics. The optional and rare non-manual markers observed on reintroduced topics mainly included squint, brow raise, and backward head tilt. Maintained aboutness subject topics, which have higher referential accessibility, were often omitted and tracked with zero anaphora. Finally, we found that constructed action is more frequently present on the predicates of clauses with a maintained aboutness subject topic than with a reintroduced aboutness subject topic. Overall, these results indicate that the use of constructed action and nominals in aboutness subject topics correlates with referential accessibility in TİD. While the former has been observed more in maintained contexts, the latter has been observed mainly in reintroduced contexts. In addition to the syntactic and prosodic cues that may distinguish old information from new or contrastive information in narratives, we suggest that pragmatic cues such as referential accessibility may help account for the manual and non-manual articulation strategies for information structure in TİD narratives
    corecore