55 research outputs found
Prosody Based Co-analysis for Continuous Recognition of Coverbal Gestures
Although speech and gesture recognition has been studied extensively, all the
successful attempts of combining them in the unified framework were
semantically motivated, e.g., keyword-gesture cooccurrence. Such formulations
inherited the complexity of natural language processing. This paper presents a
Bayesian formulation that uses a phenomenon of gesture and speech articulation
for improving accuracy of automatic recognition of continuous coverbal
gestures. The prosodic features from the speech signal were coanalyzed with the
visual signal to learn the prior probability of co-occurrence of the prominent
spoken segments with the particular kinematical phases of gestures. It was
found that the above co-analysis helps in detecting and disambiguating visually
small gestures, which subsequently improves the rate of continuous gesture
recognition. The efficacy of the proposed approach was demonstrated on a large
database collected from the weather channel broadcast. This formulation opens
new avenues for bottom-up frameworks of multimodal integration.Comment: Alternative see:
http://vision.cse.psu.edu/kettebek/academ/publications.ht
Prosody and Kinesics Based Co-analysis Towards Continuous Gesture Recognition
The aim of this study is to develop a multimodal co-analysis framework for continuous gesture recognition by exploiting prosodic and kinesics manifestation of natural communication. Using this framework, a co-analysis pattern between correlating components is obtained. The co-analysis pattern is clustered using K-means clustering to determine how well the pattern distinguishes the gestures. Features of the proposed approach that differentiate it from the other models are its less susceptibility to idiosyncrasies, its scalability, and simplicity. The experiment was performed on Multimodal Annotated Gesture Corpus (MAGEC) that we created for research on understanding non-verbal communication community, particularly the gestures
Language as a complex system: the case of phonetic variability
International audienceModern linguistic theories try to give an exhaustive explanation of how language works. In this perspective, each linguistic domain, such as phonetics, phonology, syntax, pragmatics, etc., is described by means of a set of rules or properties (in other words, a grammar). However, it is a long time linguists have observed that it is not possible to give a precise description of a real life utterance within a unique domain. We illustrate this problem with the case of phonetic variability and show how different domains interact. We propose then a two-level architecture in which domain interaction is implemented by means of constraints
Language as a complex system: the case of phonetic variability
International audienceModern linguistic theories try to give an exhaustive explanation of how language works. In this perspective, each linguistic domain, such as phonetics, phonology, syntax, pragmatics, etc., is described by means of a set of rules or properties (in other words, a grammar). However, it is a long time linguists have observed that it is not possible to give a precise description of a real life utterance within a unique domain. We illustrate this problem with the case of phonetic variability and show how different domains interact. We propose then a two-level architecture in which domain interaction is implemented by means of constraints
A review of temporal aspects of hand gesture analysis applied to discourse analysis and natural conversation
Lately, there has been a\ud
n increasing\ud
interest in hand gesture analysis systems. Recent works have employed\ud
pat\ud
tern recognition techniques and have focused on the development of systems with more natural user\ud
interfaces. These systems may use gestures to control interfaces or recognize sign language gestures\ud
, which\ud
can provide systems with multimodal interaction; o\ud
r consist in multimodal tools to help psycholinguists to\ud
understand new aspects of discourse analysis and to automate laborious tasks.\ud
Gestures are characterized\ud
by several aspects, mainly by movements\ud
and sequence of postures\ud
. Since data referring to move\ud
ments\ud
or\ud
sequences\ud
carry temporal information\ud
, t\ud
his paper presents a\ud
literature\ud
review\ud
about\ud
temporal aspects of\ud
hand gesture analysis, focusing on applications related to natural conversation and psycholinguistic\ud
analysis, using Systematic Literature Revi\ud
ew methodology. In our results, we organized works according to\ud
type of analysis, methods, highlighting the use of Machine Learning techniques, and applications.FAPESP 2011/04608-
A Comprehensive Review of Data-Driven Co-Speech Gesture Generation
Gestures that accompany speech are an essential part of natural and efficient
embodied human communication. The automatic generation of such co-speech
gestures is a long-standing problem in computer animation and is considered an
enabling technology in film, games, virtual social spaces, and for interaction
with social robots. The problem is made challenging by the idiosyncratic and
non-periodic nature of human co-speech gesture motion, and by the great
diversity of communicative functions that gestures encompass. Gesture
generation has seen surging interest recently, owing to the emergence of more
and larger datasets of human gesture motion, combined with strides in
deep-learning-based generative models, that benefit from the growing
availability of data. This review article summarizes co-speech gesture
generation research, with a particular focus on deep generative models. First,
we articulate the theory describing human gesticulation and how it complements
speech. Next, we briefly discuss rule-based and classical statistical gesture
synthesis, before delving into deep learning approaches. We employ the choice
of input modalities as an organizing principle, examining systems that generate
gestures from audio, text, and non-linguistic input. We also chronicle the
evolution of the related training data sets in terms of size, diversity, motion
quality, and collection method. Finally, we identify key research challenges in
gesture generation, including data availability and quality; producing
human-like motion; grounding the gesture in the co-occurring speech in
interaction with other speakers, and in the environment; performing gesture
evaluation; and integration of gesture synthesis into applications. We
highlight recent approaches to tackling the various key challenges, as well as
the limitations of these approaches, and point toward areas of future
development.Comment: Accepted for EUROGRAPHICS 202
Gesture and Speech in Interaction - 4th edition (GESPIN 4)
International audienceThe fourth edition of Gesture and Speech in Interaction (GESPIN) was held in Nantes, France. With more than 40 papers, these proceedings show just what a flourishing field of enquiry gesture studies continues to be. The keynote speeches of the conference addressed three different aspects of multimodal interaction:gesture and grammar, gesture acquisition, and gesture and social interaction. In a talk entitled Qualitiesof event construal in speech and gesture: Aspect and tense, Alan Cienki presented an ongoing researchproject on narratives in French, German and Russian, a project that focuses especially on the verbal andgestural expression of grammatical tense and aspect in narratives in the three languages. Jean-MarcColletta's talk, entitled Gesture and Language Development: towards a unified theoretical framework,described the joint acquisition and development of speech and early conventional and representationalgestures. In Grammar, deixis, and multimodality between code-manifestation and code-integration or whyKendon's Continuum should be transformed into a gestural circle, Ellen Fricke proposed a revisitedgrammar of noun phrases that integrates gestures as part of the semiotic and typological codes of individuallanguages. From a pragmatic and cognitive perspective, Judith Holler explored the use ofgaze and hand gestures as means of organizing turns at talk as well as establishing common ground in apresentation entitled On the pragmatics of multi-modal face-to-face communication: Gesture, speech andgaze in the coordination of mental states and social interaction.Among the talks and posters presented at the conference, the vast majority of topics related, quitenaturally, to gesture and speech in interaction - understood both in terms of mapping of units in differentsemiotic modes and of the use of gesture and speech in social interaction. Several presentations explored the effects of impairments(such as diseases or the natural ageing process) on gesture and speech. The communicative relevance ofgesture and speech and audience-design in natural interactions, as well as in more controlled settings liketelevision debates and reports, was another topic addressed during the conference. Some participantsalso presented research on first and second language learning, while others discussed the relationshipbetween gesture and intonation. While most participants presented research on gesture and speech froman observer's perspective, be it in semiotics or pragmatics, some nevertheless focused on another importantaspect: the cognitive processes involved in language production and perception. Last but not least,participants also presented talks and posters on the computational analysis of gestures, whether involvingexternal devices (e.g. mocap, kinect) or concerning the use of specially-designed computer software forthe post-treatment of gestural data. Importantly, new links were made between semiotics and mocap data
Semi-automation of gesture annotation by machine learning and human collaboration
none6siGesture and multimodal communication researchers typically annotate video data manually, even though this can be a very time-consuming task. In the present work, a method to detect gestures is proposed as a fundamental step towards a semi-automatic gesture annotation tool. The proposed method can be applied to RGB videos and requires annotations of part of a video as input. The technique deploys a pose estimation method and active learning. In the experiment, it is shown that if about 27% of the video is annotated, the remaining parts of the video can be annotated automatically with an F-score of at least 0.85. Users can run this tool with a small number of annotations first. If the predicted annotations for the remainder of the video are not satisfactory, users can add further annotations and run the tool again. The code has been released so that other researchers and practitioners can use the results of this research. This tool has been confirmed to work in conjunction with ELAN.openIenaga, Naoto; Cravotta, Alice; Terayama, Kei; Scotney, Bryan W.; Saito, Hideo; Busà , M. GraziaIenaga, Naoto; Cravotta, Alice; Terayama, Kei; Scotney, Bryan W.; Saito, Hideo; Busà , M. Grazi
Comprehension in-situ: how multimodal information shapes language processing
The human brain supports communication in dynamic face-to-face environments where spoken words are embedded in linguistic discourse and accompanied by multimodal cues, such as prosody, gestures and mouth movements. However, we only have limited knowledge of how these multimodal cues jointly modulate language comprehension. In a series of behavioural and EEG studies, we investigated the joint impact of these cues when processing naturalistic-style materials. First, we built a mouth informativeness corpus of English words, to quantify mouth informativeness of a large number of words used in the following experiments. Then, across two EEG studies, we found and replicated that native English speakers use multimodal cues and that their interactions dynamically modulate N400 amplitude elicited by words that are less predictable in the discourse context (indexed by surprisal values per word). We then extended the findings to second language comprehenders, finding that multimodal cues modulate L2 comprehension, just like in L1, but to a lesser extent; although L2 comprehenders benefit more from meaningful gestures and mouth movements. Finally, in two behavioural experiments investigating whether multimodal cues jointly modulate the learning of new concepts, we found some evidence that presence of iconic gestures improves memory, and that the effect may be larger if information is presented also with prosodic accentuation. Overall, these findings suggest that real-world comprehension uses all cues present and weights cues differently in a dynamic manner. Therefore, multimodal cues should not be neglected for language studies. Investigating communication in naturalistic contexts containing more than one cue can provide new insight into our understanding of language comprehension in the real world
- …