35,024 research outputs found
Multi-Modal Visual and Memory Coreference Resolution
An automated assistant in an augmented reality (AR) device or smartphone performs coreference resolution. The automated assistant resolves references that are mentioned in a user’s dialog, e.g., audio input, by analyzing the audio input, visual input, and stored information that represents the user’s memory. The automated assistant performs the coreference resolution so as conduct an intelligent dialog with the user for shopping, visual question answering (VQA), or other interactive activity
Joint Graph-Based Reasoning For Interacting With A User
After coreference resolution is completed by an automated assistant in an augmented reality (AR) device or smartphone, the automated assistant performs a joint graph-based reasoning method to conduct an intelligent dialog with a user. The joint graph-based reasoning method uses information from various data sources (such as a scene graph, memory graph, knowledge graph, etc.) that enables the automated assistant to provide responses to comments that are provided by the user during the dialog. The automated assistant performs the dialog with the user for shopping, visual question answering (VQA), or other interactive user activity
Multi-Modal Human-Machine Communication for Instructing Robot Grasping Tasks
A major challenge for the realization of intelligent robots is to supply them
with cognitive abilities in order to allow ordinary users to program them
easily and intuitively. One way of such programming is teaching work tasks by
interactive demonstration. To make this effective and convenient for the user,
the machine must be capable to establish a common focus of attention and be
able to use and integrate spoken instructions, visual perceptions, and
non-verbal clues like gestural commands. We report progress in building a
hybrid architecture that combines statistical methods, neural networks, and
finite state machines into an integrated system for instructing grasping tasks
by man-machine interaction. The system combines the GRAVIS-robot for visual
attention and gestural instruction with an intelligent interface for speech
recognition and linguistic interpretation, and an modality fusion module to
allow multi-modal task-oriented man-machine communication with respect to
dextrous robot manipulation of objects.Comment: 7 pages, 8 figure
Evaluating Visual Conversational Agents via Cooperative Human-AI Games
As AI continues to advance, human-AI teams are inevitable. However, progress
in AI is routinely measured in isolation, without a human in the loop. It is
crucial to benchmark progress in AI, not just in isolation, but also in terms
of how it translates to helping humans perform certain tasks, i.e., the
performance of human-AI teams.
In this work, we design a cooperative game - GuessWhich - to measure human-AI
team performance in the specific context of the AI being a visual
conversational agent. GuessWhich involves live interaction between the human
and the AI. The AI, which we call ALICE, is provided an image which is unseen
by the human. Following a brief description of the image, the human questions
ALICE about this secret image to identify it from a fixed pool of images.
We measure performance of the human-ALICE team by the number of guesses it
takes the human to correctly identify the secret image after a fixed number of
dialog rounds with ALICE. We compare performance of the human-ALICE teams for
two versions of ALICE. Our human studies suggest a counterintuitive trend -
that while AI literature shows that one version outperforms the other when
paired with an AI questioner bot, we find that this improvement in AI-AI
performance does not translate to improved human-AI performance. This suggests
a mismatch between benchmarking of AI in isolation and in the context of
human-AI teams.Comment: HCOMP 201
Segment Everything Everywhere All at Once
Despite the growing demand for interactive AI systems, there have been few
comprehensive studies on human-AI interaction in visual understanding e.g.
segmentation. Inspired by the development of prompt-based universal interfaces
for LLMs, this paper presents SEEM, a promptable, interactive model for
Segmenting Everything Everywhere all at once in an image. SEEM has four
desiderata: i) Versatility: by introducing a versatile prompting engine for
different types of prompts, including points, boxes, scribbles, masks, texts,
and referred regions of another image; ii) Compositionality: by learning a
joint visual-semantic space for visual and textual prompts to compose queries
on the fly for inference as shown in Fig 1; iii)Interactivity: by incorporating
learnable memory prompts to retain dialog history information via mask-guided
cross-attention; and iv) Semantic-awareness: by using a text encoder to encode
text queries and mask labels for open-vocabulary segmentation
Learning a Policy for Opportunistic Active Learning
Active learning identifies data points to label that are expected to be the
most useful in improving a supervised model. Opportunistic active learning
incorporates active learning into interactive tasks that constrain possible
queries during interactions. Prior work has shown that opportunistic active
learning can be used to improve grounding of natural language descriptions in
an interactive object retrieval task. In this work, we use reinforcement
learning for such an object retrieval task, to learn a policy that effectively
trades off task completion with model improvement that would benefit future
tasks.Comment: EMNLP 2018 Camera Read
- …