37 research outputs found

    Predicting Head Pose in Dyadic Conversation

    Get PDF
    Natural movement plays a significant role in realistic speech animation. Numerous studies have demonstrated the contribution visual cues make to the degree we, as human observers, find an animation acceptable. Rigid head motion is one visual mode that universally co-occurs with speech, and so it is a reasonable strategy to seek features from the speech mode to predict the head pose. Several previous authors have shown that prediction is possible, but experiments are typically confined to rigidly produced dialogue. Expressive, emotive and prosodic speech exhibit motion patterns that are far more difficult to predict with considerable variation in expected head pose. People involved in dyadic conversation adapt speech and head motion in response to the others’ speech and head motion. Using Deep Bi-Directional Long Short Term Memory (BLSTM) neural networks, we demonstrate that it is possible to predict not just the head motion of the speaker, but also the head motion of the listener from the speech signal

    The Natural Statistics of Audiovisual Speech

    Get PDF
    Humans, like other animals, are exposed to a continuous stream of signals, which are dynamic, multimodal, extended, and time varying in nature. This complex input space must be transduced and sampled by our sensory systems and transmitted to the brain where it can guide the selection of appropriate actions. To simplify this process, it's been suggested that the brain exploits statistical regularities in the stimulus space. Tests of this idea have largely been confined to unimodal signals and natural scenes. One important class of multisensory signals for which a quantitative input space characterization is unavailable is human speech. We do not understand what signals our brain has to actively piece together from an audiovisual speech stream to arrive at a percept versus what is already embedded in the signal structure of the stream itself. In essence, we do not have a clear understanding of the natural statistics of audiovisual speech. In the present study, we identified the following major statistical features of audiovisual speech. First, we observed robust correlations and close temporal correspondence between the area of the mouth opening and the acoustic envelope. Second, we found the strongest correlation between the area of the mouth opening and vocal tract resonances. Third, we observed that both area of the mouth opening and the voice envelope are temporally modulated in the 2–7 Hz frequency range. Finally, we show that the timing of mouth movements relative to the onset of the voice is consistently between 100 and 300 ms. We interpret these data in the context of recent neural theories of speech which suggest that speech communication is a reciprocally coupled, multisensory event, whereby the outputs of the signaler are matched to the neural processes of the receiver

    Monkeys and Humans Share a Common Computation for Face/Voice Integration

    Get PDF
    Speech production involves the movement of the mouth and other regions of the face resulting in visual motion cues. These visual cues enhance intelligibility and detection of auditory speech. As such, face-to-face speech is fundamentally a multisensory phenomenon. If speech is fundamentally multisensory, it should be reflected in the evolution of vocal communication: similar behavioral effects should be observed in other primates. Old World monkeys share with humans vocal production biomechanics and communicate face-to-face with vocalizations. It is unknown, however, if they, too, combine faces and voices to enhance their perception of vocalizations. We show that they do: monkeys combine faces and voices in noisy environments to enhance their detection of vocalizations. Their behavior parallels that of humans performing an identical task. We explored what common computational mechanism(s) could explain the pattern of results we observed across species. Standard explanations or models such as the principle of inverse effectiveness and a “race” model failed to account for their behavior patterns. Conversely, a “superposition model”, positing the linear summation of activity patterns in response to visual and auditory components of vocalizations, served as a straightforward but powerful explanatory mechanism for the observed behaviors in both species. As such, it represents a putative homologous mechanism for integrating faces and voices across primates

    The Manipulative Complexity of Lower Paleolithic Stone Toolmaking

    Get PDF
    Early stone tools provide direct evidence of human cognitive and behavioral evolution that is otherwise unavailable. Proper interpretation of these data requires a robust interpretive framework linking archaeological evidence to specific behavioral and cognitive actions.Here we employ a data glove to record manual joint angles in a modern experimental toolmaker (the 4(th) author) replicating ancient tool forms in order to characterize and compare the manipulative complexity of two major Lower Paleolithic technologies (Oldowan and Acheulean). To this end we used a principled and general measure of behavioral complexity based on the statistics of joint movements.This allowed us to confirm that previously observed differences in brain activation associated with Oldowan versus Acheulean technologies reflect higher-level behavior organization rather than lower-level differences in manipulative complexity. This conclusion is consistent with a scenario in which the earliest stages of human technological evolution depended on novel perceptual-motor capacities (such as the control of joint stiffness) whereas later developments increasingly relied on enhanced mechanisms for cognitive control. This further suggests possible links between toolmaking and language evolution

    Real-time Visual Prosody for Interactive Virtual Agents

    No full text
    International audience<p>Speakers accompany their speech with incessant, subtle headmovements. It is important to implement such &quot;visual prosody&quot; in virtual agents, not only to make their behavior more natural, but also because it has been shown to help listeners understand speech. We contribute a visual prosody model for interactive virtual agents that shall be capableof having live, non-scripted interactions with humans and thus have to use Text-To-Speech rather than recorded speech. We present our method for creating visual prosody online from continuous TTS output, and we report results from three crowdsourcing experiments carried out to seeif and to what extent it can help in enhancing the interaction experience with an agent.</p
    corecore