833 research outputs found
Towards a more natural and intelligent interface with embodied conversation agent
Conversational agent also known as chatterbots are computer programs which are designed to converse like a human as much as their intelligent allows. In many ways, they are the embodiment of Turing's vision. The ability for computers to converse with human users using natural language would arguably increase their usefulness. Recent advances in Natural Language Processing (NLP) and Artificial Intelligence (AI) in general have advances this field in realizing the vision of a more humanoid interactive system. This paper presents and discusses the use of embodied conversation agent (ECA) for the imitation games. This paper also presents the technical design of our ECA and its performance. In the interactive media industry, it can also been observed that the ECA are getting popular
Recommended from our members
uC: Ubiquitous Collaboration Platform for Multimodal Team Interaction Support
A human-centered computing platform that improves teamwork and transforms the “human- computer interaction experience” for distributed teams is presented. This Ubiquitous Collaboration, or uC (“you see”), platform\u27s objective is to transform distributed teamwork (i.e., work occurring when teams of workers and learners are geographically dispersed and often interacting at different times). It achieves this goal through a multimodal team interaction interface realized through a reconfigurable open architecture. The approach taken is to integrate: (1) an intuitive speech- and video-centric multi-modal interface to augment more conventional methods (e.g., mouse, stylus and touch), (2) an open and reconfigurable architecture supporting information gathering, and (3) a machine intelligent approach to analysis and management of heterogeneous live and stored sensor data to support collaboration. The system will transform how teams of people interact with computers by drawing on both the virtual and physical environment
Generative Pretraining in Multimodality
We present Emu, a Transformer-based multimodal foundation model, which can
seamlessly generate images and texts in multimodal context. This omnivore model
can take in any single-modality or multimodal data input indiscriminately
(e.g., interleaved image, text and video) through a one-model-for-all
autoregressive training process. First, visual signals are encoded into
embeddings, and together with text tokens form an interleaved input sequence.
Emu is then end-to-end trained with a unified objective of classifying the next
text token or regressing the next visual embedding in the multimodal sequence.
This versatile multimodality empowers the exploration of diverse pretraining
data sources at scale, such as videos with interleaved frames and text,
webpages with interleaved images and text, as well as web-scale image-text
pairs and video-text pairs. Emu can serve as a generalist multimodal interface
for both image-to-text and text-to-image tasks, and supports in-context image
and text generation. Across a broad range of zero-shot/few-shot tasks including
image captioning, visual question answering, video question answering and
text-to-image generation, Emu demonstrates superb performance compared to
state-of-the-art large multimodal models. Extended capabilities such as
multimodal assistants via instruction tuning are also demonstrated with
impressive performance.Comment: Code and Demo: https://github.com/baaivision/Em
Using X+V to construct a non-proprietary speech browser for a public-domain SpeechWeb
A SpeechWeb is a collection of hyperlinked speech applications that are distributed over the Internet. Users access the speech applications through remote browsers, which accept human-voice-input and return synthesized-voice-output. In previous research, a new architecture (LRRP) has been proposed, which is ideally suited for building a Public-Domain SpeechWeb. However, a non-proprietary speech browser is needed for this architecture. In this thesis, we have solved several limitations of X+V, a programming language for developing Multimodal applications, and we have used X+V to build a viable Public-Domain SpeechWeb browser. Our browser has the following properties: real-time human-machine speech interaction; ease of installation and use; acceptable speech-recognition accuracy in a suitable environment; no cost, non-proprietary, ease of distribution; use of common communication protocol---CGI; ease of creation of speech applications; possibility to deploy on mobile devices.Dept. of Computer Science. Paper copy at Leddy Library: Theses & Major Papers - Basement, West Bldg. / Call Number: Thesis2006 .M31. Source: Masters Abstracts International, Volume: 45-01, page: 0360. Thesis (M.Sc.)--University of Windsor (Canada), 2006
Punny Captions: Witty Wordplay in Image Descriptions
Wit is a form of rich interaction that is often grounded in a specific
situation (e.g., a comment in response to an event). In this work, we attempt
to build computational models that can produce witty descriptions for a given
image. Inspired by a cognitive account of humor appreciation, we employ
linguistic wordplay, specifically puns, in image descriptions. We develop two
approaches which involve retrieving witty descriptions for a given image from a
large corpus of sentences, or generating them via an encoder-decoder neural
network architecture. We compare our approach against meaningful baseline
approaches via human studies and show substantial improvements. We find that
when a human is subject to similar constraints as the model regarding word
usage and style, people vote the image descriptions generated by our model to
be slightly wittier than human-written witty descriptions. Unsurprisingly,
humans are almost always wittier than the model when they are free to choose
the vocabulary, style, etc.Comment: NAACL 2018 (11 pages
Leveraging Large Language Models in Conversational Recommender Systems
A Conversational Recommender System (CRS) offers increased transparency and
control to users by enabling them to engage with the system through a real-time
multi-turn dialogue. Recently, Large Language Models (LLMs) have exhibited an
unprecedented ability to converse naturally and incorporate world knowledge and
common-sense reasoning into language understanding, unlocking the potential of
this paradigm. However, effectively leveraging LLMs within a CRS introduces new
technical challenges, including properly understanding and controlling a
complex conversation and retrieving from external sources of information. These
issues are exacerbated by a large, evolving item corpus and a lack of
conversational data for training. In this paper, we provide a roadmap for
building an end-to-end large-scale CRS using LLMs. In particular, we propose
new implementations for user preference understanding, flexible dialogue
management and explainable recommendations as part of an integrated
architecture powered by LLMs. For improved personalization, we describe how an
LLM can consume interpretable natural language user profiles and use them to
modulate session-level context. To overcome conversational data limitations in
the absence of an existing production CRS, we propose techniques for building a
controllable LLM-based user simulator to generate synthetic conversations. As a
proof of concept we introduce RecLLM, a large-scale CRS for YouTube videos
built on LaMDA, and demonstrate its fluency and diverse functionality through
some illustrative example conversations
Automatic translation of formal data specifications to voice data-input applications.
This thesis introduces a complete solution for automatic translation of formal data specifications to voice data-input applications. The objective of the research is to automatically generate applications for inputting data through speech from specifications of the structure of the data. The formal data specifications are XML DTDs. A new formalization called Grammar-DTD (G-DTD) is introduced as an extended DTD that contains grammars to describe valid values of the DTD elements and attributes. G-DTDs facilitate the automatic generation of Voice XML applications that correspond to the original DTD structure. The development of the automatic application-generator included identifying constraints on the G-DTD to ensure a feasible translation, using predicate calculus to build a knowledge base of inference rules that describes the mapping procedure, and writing an algorithm for the automatic translation based on the inference rules.Dept. of Computer Science. Paper copy at Leddy Library: Theses & Major Papers - Basement, West Bldg. / Call Number: Thesis2006 .H355. Source: Masters Abstracts International, Volume: 45-01, page: 0354. Thesis (M.Sc.)--University of Windsor (Canada), 2006
Convo: What does conversational programming need? An exploration of machine learning interface design
Vast improvements in natural language understanding and speech recognition
have paved the way for conversational interaction with computers. While
conversational agents have often been used for short goal-oriented dialog, we
know little about agents for developing computer programs. To explore the
utility of natural language for programming, we conducted a study (=45)
comparing different input methods to a conversational programming system we
developed. Participants completed novice and advanced tasks using voice-based,
text-based, and voice-or-text-based systems. We found that users appreciated
aspects of each system (e.g., voice-input efficiency, text-input precision) and
that novice users were more optimistic about programming using voice-input than
advanced users. Our results show that future conversational programming tools
should be tailored to users' programming experience and allow users to choose
their preferred input mode. To reduce cognitive load, future interfaces can
incorporate visualizations and possess custom natural language understanding
and speech recognition models for programming.Comment: 9 pages, 7 figures, submitted to VL/HCC 2020, for associated user
study video: https://youtu.be/TC5P3OO5ex
- …