Search CORE

33,019 research outputs found

InstructSeq: Unifying Vision Tasks with Instruction-conditioned Multi-modal Sequence Generation

Author: Dai Jifeng
Fang Rongyao
Huang Zhaoyang
Li Hongsheng
Tian Hao
Yan Shilin
Zhou Jingqiu
Publication venue
Publication date: 30/11/2023
Field of study

Empowering models to dynamically accomplish tasks specified through natural language instructions represents a promising path toward more capable and general artificial intelligence. In this work, we introduce InstructSeq, an instruction-conditioned multi-modal modeling framework that unifies diverse vision tasks through flexible natural language control and handling of both visual and textual data. InstructSeq employs a multimodal transformer architecture encompassing visual, language, and sequential modeling. We utilize a visual encoder to extract image features and a text encoder to encode instructions. An autoregressive transformer fuses the representations and generates sequential task outputs. By training with LLM-generated natural language instructions, InstructSeq acquires a strong comprehension of free-form instructions for specifying visual tasks. This provides an intuitive interface for directing capabilities using flexible natural instructions. Without any task-specific tuning, InstructSeq achieves compelling performance on semantic segmentation, referring expression segmentation/comprehension, and image captioning. The flexible control and multi-task unification empower the model with more human-like versatility and generalizability for computer vision. The code will be released soon at https://github.com/rongyaofang/InstructSeq.Comment: 10 page

arXiv.org e-Print Archive

MIRIAM: A Multimodal Chat-Based Interface for Autonomous Systems

Author: Garcia Francisco J. Chiyah
Hastie Helen
Laskov Atanas
Patron Pedro
Robb David A.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 03/11/2017
Field of study

We present MIRIAM (Multimodal Intelligent inteRactIon for Autonomous systeMs), a multimodal interface to support situation awareness of autonomous vehicles through chat-based interaction. The user is able to chat about the vehicle's plan, objectives, previous activities and mission progress. The system is mixed initiative in that it pro-actively sends messages about key events, such as fault warnings. We will demonstrate MIRIAM using SeeByte's SeeTrack command and control interface and Neptune autonomy simulator.Comment: 2 pages, ICMI'17, 19th ACM International Conference on Multimodal Interaction, November 13-17 2017, Glasgow, U

arXiv.org e-Print Archive

Heriot Watt Pure

Crossref

Multimodal agent interfaces and system architectures for health and fitness companions

Author: Cavazza Marc
Charlton Daniel
Gambäck Björn
Hakulinen Jaakko
Hansen Preben
Rodríguez Gancedo Mari C.
Santos de la Cámara Raul
Smith Cameron
Ståhl Olov
Turunen Markku
Publication venue
Publication date: 01/01/2008
Field of study

Multimodal conversational spoken dialogues using physical and virtual agents provide a potential interface to motivate and support users in the domain of health and fitness. In this paper we present how such multimodal conversational Companions can be implemented to support their owners in various pervasive and mobile settings. In particular, we focus on different forms of multimodality and system architectures for such interfaces

RISE – Research Institutes of Sweden

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Swedish Institute of Computer Science Publications Database

Software institutes' Online Digital Archive

Ambient Gestures

Author: Hare Jonathon
Karam Maria
Lewis Paul
schraefel m.c.
Publication venue: s.n.
Publication date: 01/03/2006
Field of study

We present Ambient Gestures, a novel gesture-based system designed to support ubiquitous ‘in the environment’ interactions with everyday computing technology. Hand gestures and audio feedback allow users to control computer applications without reliance on a graphical user interface, and without having to switch from the context of a non-computer task to the context of the computer. The Ambient Gestures system is composed of a vision recognition software application, a set of gestures to be processed by a scripting application and a navigation and selection application that is controlled by the gestures. This system allows us to explore gestures as the primary means of interaction within a multimodal, multimedia environment. In this paper we describe the Ambient Gestures system, define the gestures and the interactions that can be achieved in this environment and present a formative study of the system. We conclude with a discussion of our findings and future applications of Ambient Gestures in ubiquitous computing

Southampton (e-Prints Soton)

Reference Resolution in Multi-modal Interaction: Position paper

Author: Nijholt A.
Publication venue: EU IST RTD Roadmap
Publication date: 01/01/2002
Field of study

In this position paper we present our research on multimodal interaction in and with virtual environments. The aim of this presentation is to emphasize the necessity to spend more research on reference resolution in multimodal contexts. In multi-modal interaction the human conversational partner can apply more than one modality in conveying his or her message to the environment in which a computer detects and interprets signals from different modalities. We show some naturally arising problems and how they are treated for different contexts. No generally applicable solutions are given

University of Twente Research Information

Recommended from our members

A multimodal restaurant finder for semantic web

Author: He Yulan
Hui Siu Cheung
Quan Thanh Tho
Publication venue
Publication date: 01/01/2007
Field of study

Multimodal dialogue systems provide multiple modalities in the form of speech, mouse clicking, drawing or touch that can enhance human-computer interaction. However, one of the drawbacks of the existing multimodal systems is that they are highly domain-speciﬁc and they do not allow information to be shared across different providers. In this paper, we propose a semantic multimodal system, called Semantic Restaurant Finder, for the Semantic Web in which the restaurant information in different city/country/language are constructed as ontologies to allow the information to be sharable. From the Semantic Restaurant Finder, users can make use of the semantic restaurant knowledge distributed from different locations on the Internet to ﬁnd the desired restaurants

Open Research Online (The Open University)

An information assistant system for the prevention of tunnel vision in crisis management

Author: Cao Yujia
Publication venue: Centre for Telematics and Information Technology, University of Twente
Publication date: 01/01/2008
Field of study

In the crisis management environment, tunnel vision is a set of bias in decision makers’ cognitive process which often leads to incorrect understanding of the real crisis situation, biased perception of information, and improper decisions. The tunnel vision phenomenon is a consequence of both the challenges in the task and the natural limitation in a human being’s cognitive process. An information assistant system is proposed with the purpose of preventing tunnel vision. The system serves as a platform for monitoring the on-going crisis event. All information goes through the system before arrives at the user. The system enhances the data quality, reduces the data quantity and presents the crisis information in a manner that prevents or repairs the user’s cognitive overload. While working with such a system, the users (crisis managers) are expected to be more likely to stay aware of the actual situation, stay open minded to possibilities, and make proper decisions

University of Twente Research Information