1,734 research outputs found
Multi-Modal Mutual Attention and Iterative Interaction for Referring Image Segmentation
We address the problem of referring image segmentation that aims to generate
a mask for the object specified by a natural language expression. Many recent
works utilize Transformer to extract features for the target object by
aggregating the attended visual regions. However, the generic attention
mechanism in Transformer only uses the language input for attention weight
calculation, which does not explicitly fuse language features in its output.
Thus, its output feature is dominated by vision information, which limits the
model to comprehensively understand the multi-modal information, and brings
uncertainty for the subsequent mask decoder to extract the output mask. To
address this issue, we propose Multi-Modal Mutual Attention ()
and Multi-Modal Mutual Decoder () that better fuse information
from the two input modalities. Based on {}, we further propose
Iterative Multi-modal Interaction () to allow continuous and
in-depth interactions between language and vision features. Furthermore, we
introduce Language Feature Reconstruction () to prevent the
language information from being lost or distorted in the extracted feature.
Extensive experiments show that our proposed approach significantly improves
the baseline and outperforms state-of-the-art referring image segmentation
methods on RefCOCO series datasets consistently.Comment: IEEE TI
RIO: A Benchmark for Reasoning Intention-Oriented Objects in Open Environments
Intention-oriented object detection aims to detect desired objects based on
specific intentions or requirements. For instance, when we desire to "lie down
and rest", we instinctively seek out a suitable option such as a "bed" or a
"sofa" that can fulfill our needs. Previous work in this area is limited either
by the number of intention descriptions or by the affordance vocabulary
available for intention objects. These limitations make it challenging to
handle intentions in open environments effectively. To facilitate this
research, we construct a comprehensive dataset called Reasoning
Intention-Oriented Objects (RIO). In particular, RIO is specifically designed
to incorporate diverse real-world scenarios and a wide range of object
categories. It offers the following key features: 1) intention descriptions in
RIO are represented as natural sentences rather than a mere word or verb
phrase, making them more practical and meaningful; 2) the intention
descriptions are contextually relevant to the scene, enabling a broader range
of potential functionalities associated with the objects; 3) the dataset
comprises a total of 40,214 images and 130,585 intention-object pairs. With the
proposed RIO, we evaluate the ability of some existing models to reason
intention-oriented objects in open environments.Comment: NeurIPS 2023 D&B accepted. See our project page for more details:
https://reasonio.github.io
Gesture and Speech in Interaction - 4th edition (GESPIN 4)
International audienceThe fourth edition of Gesture and Speech in Interaction (GESPIN) was held in Nantes, France. With more than 40 papers, these proceedings show just what a flourishing field of enquiry gesture studies continues to be. The keynote speeches of the conference addressed three different aspects of multimodal interaction:gesture and grammar, gesture acquisition, and gesture and social interaction. In a talk entitled Qualitiesof event construal in speech and gesture: Aspect and tense, Alan Cienki presented an ongoing researchproject on narratives in French, German and Russian, a project that focuses especially on the verbal andgestural expression of grammatical tense and aspect in narratives in the three languages. Jean-MarcColletta's talk, entitled Gesture and Language Development: towards a unified theoretical framework,described the joint acquisition and development of speech and early conventional and representationalgestures. In Grammar, deixis, and multimodality between code-manifestation and code-integration or whyKendon's Continuum should be transformed into a gestural circle, Ellen Fricke proposed a revisitedgrammar of noun phrases that integrates gestures as part of the semiotic and typological codes of individuallanguages. From a pragmatic and cognitive perspective, Judith Holler explored the use ofgaze and hand gestures as means of organizing turns at talk as well as establishing common ground in apresentation entitled On the pragmatics of multi-modal face-to-face communication: Gesture, speech andgaze in the coordination of mental states and social interaction.Among the talks and posters presented at the conference, the vast majority of topics related, quitenaturally, to gesture and speech in interaction - understood both in terms of mapping of units in differentsemiotic modes and of the use of gesture and speech in social interaction. Several presentations explored the effects of impairments(such as diseases or the natural ageing process) on gesture and speech. The communicative relevance ofgesture and speech and audience-design in natural interactions, as well as in more controlled settings liketelevision debates and reports, was another topic addressed during the conference. Some participantsalso presented research on first and second language learning, while others discussed the relationshipbetween gesture and intonation. While most participants presented research on gesture and speech froman observer's perspective, be it in semiotics or pragmatics, some nevertheless focused on another importantaspect: the cognitive processes involved in language production and perception. Last but not least,participants also presented talks and posters on the computational analysis of gestures, whether involvingexternal devices (e.g. mocap, kinect) or concerning the use of specially-designed computer software forthe post-treatment of gestural data. Importantly, new links were made between semiotics and mocap data
- …