This thesis concerns the form-meaning mapping of multimodal communicative actions
consisting of speech signals and improvised co-speech gestures, produced spontaneously
with the hand. The interaction between speech and speech-accompanying
gestures has been standardly addressed from a cognitive perspective to establish the
underlying cognitive mechanisms for the synchronous speech and gesture production,
and also from a computational perspective to build computer systems that communicate
through multiple modalities.
Based on the findings of this previous research, we advance a new theory in which
the mapping from the form of the combined speech-and-gesture signal to its meaning is
analysed in a constraint-based multimodal grammar. We propose several construction
rules about multimodal well-formedness that we motivate empirically from an extensive
and detailed corpus study. In particular, the construction rules use the prosody,
syntax and semantics of speech, the form and meaning of the gesture signal, as well
as the temporal performance of the speech relative to the temporal performance of the
gesture to constrain the derivation of a single multimodal syntax tree which in turn
determines a meaning representation via standard mechanisms for semantic composition.
Gestural form often underspecifies its meaning, and so the output of our grammar
is underspecified logical formulae that support the range of possible interpretations of
the multimodal act in its final context-of-use, given the current models of the semantics/
pragmatics interface.
It is standardly held in the gesture community that the co-expressivity of speech and
gesture is determined on the basis of their temporal co-occurrence: that is, a gesture
signal is semantically related to the speech signal that happened at the same time as
the gesture. Whereas this is usually taken for granted, we propose a methodology of
establishing in a systematic and domain-independent way which spoken element(s)
gesture can be semantically related to, based on their form, so as to yield a meaning
representation that supports the intended interpretation(s) in context. The ‘semantic’
alignment of speech and gesture is thus driven not from the temporal co-occurrence
alone, but also from the linguistic properties of the speech signal gesture overlaps with.
In so doing, we contribute a fine-grained system for articulating the form-meaning
mapping of multimodal actions that uses standard methods from linguistics.
We show that just as language exhibits ambiguity in both form and meaning, so do
multimodal actions: for instance, the integration of gesture is not restricted to a unique
speech phrase but rather speech and gesture can be aligned in multiple multimodal syntax trees thus yielding distinct meaning representations. These multiple mappings
stem from the fact that the meaning as derived from gesture form is highly incomplete
even in context. An overall challenge is thus to account for the range of possible interpretations
of the multimodal action in context using standard methods from linguistics
for syntactic derivation and semantic composition