58,878 research outputs found
Towards responsive Sensitive Artificial Listeners
This paper describes work in the recently started project SEMAINE, which aims to build a set of Sensitive Artificial Listeners – conversational agents designed to sustain an interaction with a human user despite limited verbal skills, through robust recognition and generation of non-verbal behaviour in real-time, both when the agent is speaking and listening. We report on data collection and on the design of a system architecture in view of real-time responsiveness
Integration of multimodal data based on surface registration
The paper proposes and evaluates a strategy for the alignment of
anatomical and functional data of the brain. The method takes as an
input two different sets of images of a same patient: MR data and
SPECT. It proceeds in four steps: first, it constructs two voxel
models from the two image sets; next, it extracts from the two voxel
models the surfaces of regions of interest; in the third step, the
surfaces are interactively aligned by corresponding pairs; finally a
unique volume model is constructed by selectively applying the
geometrical transformations associated to the regions and weighting
their contributions. The main advantages of this strategy are (i) that
it can be applied retrospectively, (ii) that it is tri-dimensional,
and (iii) that it is local. Its main disadvantage with regard to
previously published methods it that it requires the extraction of
surfaces. However, this step is often required for other stages of the
multimodal analysis such as the visualization and therefore its cost
can be accounted in the global cost of the process.Postprint (published version
Learning Multi-Modal Word Representation Grounded in Visual Context
Representing the semantics of words is a long-standing problem for the
natural language processing community. Most methods compute word semantics
given their textual context in large corpora. More recently, researchers
attempted to integrate perceptual and visual features. Most of these works
consider the visual appearance of objects to enhance word representations but
they ignore the visual environment and context in which objects appear. We
propose to unify text-based techniques with vision-based techniques by
simultaneously leveraging textual and visual context to learn multimodal word
embeddings. We explore various choices for what can serve as a visual context
and present an end-to-end method to integrate visual context elements in a
multimodal skip-gram model. We provide experiments and extensive analysis of
the obtained results
- …