Search CORE

23 research outputs found

Multimodal speech interfaces for map-based applications

Author: Liu Sean (Sean Y.)
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/2010
Field of study

Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Cataloged from student-submitted PDF version of thesis.Includes bibliographical references (p. 71-73).This thesis presents the development of multimodal speech interfaces for mobile and vehicle systems. Multimodal interfaces have been shown to increase input efficiency in comparison with their purely speech or text-based counterparts. To date, much of the existing work has focused on desktop or large tablet-sized devices. The advent of the smartphone and its ability to handle both speech and touch inputs in combination with a screen display has created a compelling opportunity for deploying multimodal systems on smaller-sized devices. We introduce a multimodal user interface designed for mobile and vehicle devices, and system enhancements for a dynamically expandable point-of-interest database. The mobile system is evaluated using Amazon Mechanical Turk and the vehicle- based system is analyzed through in-lab usability studies. Our experiments show encouraging results for multimodal speech adoption.by Sean Liu.M.Eng

DSpace@MIT

Toward Widely-Available and Usable Multimodal Conversational Interfaces

Author: Gruenstein Alexander
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/2009
Field of study

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009.Cataloged from PDF version of thesis.Includes bibliographical references (p. 159-166).Multimodal conversational interfaces, which allow humans to interact with a computer using a combination of spoken natural language and a graphical interface, offer the potential to transform the manner by which humans communicate with computers. While researchers have developed myriad such interfaces, none have made the transition out of the laboratory and into the hands of a significant number of users. This thesis makes progress toward overcoming two intertwined barriers preventing more widespread adoption: availability and usability. Toward addressing the problem of availability, this thesis introduces a new platform for building multimodal interfaces that makes it easy to deploy them to users via the World Wide Web. One consequence of this work is City Browser, the first multimodal conversational interface made publicly available to anyone with a web browser and a microphone. City Browser serves as a proof-of-concept that significant amounts of usage data can be collected in this way, allowing a glimpse of how users interact with such interfaces outside of a laboratory environment. City Browser, in turn, has served as the primary platform for deploying and evaluating three new strategies aimed at improving usability. The most pressing usability challenge for conversational interfaces is their limited ability to accurately transcribe and understand spoken natural language. The three strategies developed in this thesis - context-sensitive language modeling, response confidence scoring, and user behavior shaping - each attack the problem from a different angle, but they are linked in that each critically integrates information from the conversational context.by Alexander Gruenstein.Ph.D

DSpace@MIT

On the Development of Adaptive and User-Centred Interactive Multimodal Interfaces

Author: Callejas Zoraida
Espejo Gonzalo
Griol David
López-Cózar Ramón
Ábalos Nieves
Publication venue: 'IGI Global'
Publication date: 01/01/2012
Field of study

Multimodal systems have attained increased attention in recent years, which has made possible important improvements in the technologies for recognition, processing, and generation of multimodal information. However, there are still many issues related to multimodality which are not clear, for example, the principles that make it possible to resemble human-human multimodal communication. This chapter focuses on some of the most important challenges that researchers have recently envisioned for future multimodal interfaces. It also describes current efforts to develop intelligent, adaptive, proactive, portable and affective multimodal interfaces

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Universidad Carlos III de Madrid e-Archivo

Crowd-supervised training of spoken language systems

Author: McGraw Ian C. (Ian Carmichael)
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/2012
Field of study

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2012.Cataloged from PDF version of thesis.Includes bibliographical references (p. 155-166).Spoken language systems are often deployed with static speech recognizers. Only rarely are parameters in the underlying language, lexical, or acoustic models updated on-the-fly. In the few instances where parameters are learned in an online fashion, developers traditionally resort to unsupervised training techniques, which are known to be inferior to their supervised counterparts. These realities make the development of spoken language interfaces a difficult and somewhat ad-hoc engineering task, since models for each new domain must be built from scratch or adapted from a previous domain. This thesis explores an alternative approach that makes use of human computation to provide crowd-supervised training for spoken language systems. We explore human-in-the-loop algorithms that leverage the collective intelligence of crowds of non-expert individuals to provide valuable training data at a very low cost for actively deployed spoken language systems. We also show that in some domains the crowd can be incentivized to provide training data for free, as a byproduct of interacting with the system itself. Through the automation of crowdsourcing tasks, we construct and demonstrate organic spoken language systems that grow and improve without the aid of an expert. Techniques that rely on collecting data remotely from non-expert users, however, are subject to the problem of noise. This noise can sometimes be heard in audio collected from poor microphones or muddled acoustic environments. Alternatively, noise can take the form of corrupt data from a worker trying to game the system - for example, a paid worker tasked with transcribing audio may leave transcripts blank in hopes of receiving a speedy payment. We develop strategies to mitigate the effects of noise in crowd-collected data and analyze their efficacy. This research spans a number of different application domains of widely-deployed spoken language interfaces, but maintains the common thread of improving the speech recognizer's underlying models with crowd-supervised training algorithms. We experiment with three central components of a speech recognizer: the language model, the lexicon, and the acoustic model. For each component, we demonstrate the utility of a crowd-supervised training framework. For the language model and lexicon, we explicitly show that this framework can be used hands-free, in two organic spoken language systems.by Ian C. McGraw.Ph.D

DSpace@MIT

I-SEARCH - a multimodal search engine based on rich unified content description (RUCoD)

Author: Axenopoulos Apostolos
Camurri Antonio
Croce Vincenzo
Daras Petros
Etzold Jonas
Grimm Paul
Joyeux Laurent
Lazzaro Marinella
Malassiotis Sotiris
Mamledis Athanasios
Massari Alberto
Nucci Francesco
Spiller Sabine
Steiner Thomas
Sutton Lorenzo
Tzovaras Dimitrios
Verroust-Blondet Anne
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 16/04/2012
Field of study

International audienceIn this paper, we report on work around the I-SEARCH EU (FP7 ICT STREP) project whose objective is the development of a multimodal search engine. We present the project's objectives, and detail the achieved results, amongswhich a Rich Unified Content Description format

Crossref

INRIA a CCSD electronic archive server

HAL-Rennes 1

Adapting existing games for education using speech recognition

Author: Cai Carrie Jun
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/2013
Field of study

Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2013.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Cataloged from PDF student-submitted version of thesis.Includes bibliographical references (p. 73-77).Although memory exercises and arcade-style games are alike in their repetitive nature, memorization tasks like vocabulary drills tend to be mundane and tedious while arcade-style games are popular, intense and broadly addictive. The repetitive structure of arcade games suggests an opportunity to modify these well-known games for the purpose of learning. Arcade-style games like Tetris and Pac-man are often difficult to adapt for educational purposes because their fast-paced intensity and keystroke-heavy nature leave little room for simultaneous practice of other skills. Incorporating spoken language technology could make it possible for users to learn as they play, keeping up with game speed through multimodal interaction. Two challenges exist in this research: first, it is unclear which learning strategy would be most eective when incorporated into an already fast-paced, mentally demanding game. Secondly, it remains difficult to augment fast-paced games with speech interaction because the frustrating effect of recognition errors highly compromises entertainment. In this work, we designed and implemented Tetrilingo, a modified version of Tetris with speech recognition to help students practice and remember word-picture mappings. With our speech recognition prototype, we investigated the extent to which various forms of memory practice impact learning and engagement, and found that free-recall retrieval practice was less enjoyable to slower learners despite producing signicant learning benefits over alternative learning strategies. Using utterances collected from learners interacting with Tetrilingo, we also evaluated several techniques to increase speech recognition accuracy in fast-paced games by leveraging game context. Results show that, because false negative recognition errors are self-perpetuating and more prevalent than false positives, relaxing the constraints of the speech recognizer towards greater leniency may enhance overall recognition performance.by Carrie Jun Cai.S.M

DSpace@MIT

Harvesting and summarizing user-generated content for advanced speech-based human-computer interaction

Author: Liu Jingjing, Ph. D. Massachusetts Institute of Technology
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/2012
Field of study

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2012.Cataloged from PDF version of thesis.Includes bibliographical references (p. 155-164).There have been many assistant applications on mobile devices, which could help people obtain rich Web content such as user-generated data (e.g., reviews, posts, blogs, and tweets). However, online communities and social networks are expanding rapidly and it is impossible for people to browse and digest all the information via simple search interface. To help users obtain information more efficiently, both the interface for data access and the information representation need to be improved. An intuitive and personalized interface, such as a dialogue system, could be an ideal assistant, which engages a user in a continuous dialogue to garner the user's interest and capture the user's intent, and assists the user via speech-navigated interactions. In addition, there is a great need for a type of application that can harvest data from the Web, summarize the information in a concise manner, and present it in an aggregated yet natural way such as direct human dialogue. This thesis, therefore, aims to conduct research on a universal framework for developing speech-based interface that can aggregate user-generated Web content and present the summarized information via speech-based human-computer interaction. To accomplish this goal, several challenges must be met. Firstly, how to interpret users' intention from their spoken input correctly? Secondly, how to interpret the semantics and sentiment of user-generated data and aggregate them into structured yet concise summaries? Lastly, how to develop a dialogue modeling mechanism to handle discourse and present the highlighted information via natural language? This thesis explores plausible approaches to tackle these challenges. We will explore a lexicon modeling approach for semantic tagging to improve spoken language understanding and query interpretation. We will investigate a parse-and-paraphrase paradigm and a sentiment scoring mechanism for information extraction from unstructured user-generated data. We will also explore sentiment-involved dialogue modeling and corpus-based language generation approaches for dialogue and discourse. Multilingual prototype systems in multiple domains have been implemented for demonstration.by Jingjing Liu.Ph.D

DSpace@MIT

SiAM-dp : an open development platform for massively multimodal dialogue systems in cyber-physical environments

Author: Neßelrath Robert
Publication venue: Fakultät 6 - Naturwissenschaftlich-Technische Fakultät I. Fachrichtung 6.2 - Informatik
Publication date
Field of study

Cyber-physical environments enhance natural environments of daily life such as homes, factories, offices, and cars by connecting the cybernetic world of computers and communication with the real physical world. While under the keyword of Industrie 4.0, cyber-physical environments will take a relevant role in the next industrial revolution, and they will also appear in homes, offices, workshops, and numerous other areas. In this new world, classical interaction concepts where users exclusively interact with a single stationary device, PC or smartphone become less dominant and make room for new occurrences of interaction between humans and the environment itself. Furthermore, new technologies and a rising spectrum of applicable modalities broaden the possibilities for interaction designers to include more natural and intuitive non-verbal and verbal communication. The dynamic characteristic of a cyber-physical environment and the mobility of users confronts developers with the challenge of developing systems that are flexible concerning the connected and used devices and modalities. This implies new opportunities for cross-modal interaction that go beyond dual modalities interaction as is well known nowadays. This thesis addresses the support of application developers with a platform for the declarative and model based development of multimodal dialogue applications, with a focus on distributed input and output devices in cyber-physical environments. The main contributions can be divided into three parts: - Design of models and strategies for the specification of dialogue applications in a declarative development approach. This includes models for the definition of project resources, dialogue behaviour, speech recognition grammars, and graphical user interfaces and mapping rules, which convert the device specific representation of input and output description to a common representation language. - The implementation of a runtime platform that provides a flexible and extendable architecture for the easy integration of new devices and components. The platform realises concepts and strategies of multimodal human-computer interaction and is the basis for full-fledged multimodal dialogue applications for arbitrary device setups, domains, and scenarios. - A software development toolkit that is integrated in the Eclipse rich client platform and provides wizards and editors for creating and editing new multimodal dialogue applications.Cyber-physische Umgebungen (CPEs) erweitern natürliche Alltagsumgebungen wie Heim, Fabrik, Büro und Auto durch Verbindung der kybernetischen Welt der Computer und Kommunikation mit der realen, physischen Welt. Die möglichen Anwendungsgebiete hierbei sind weitreichend. Während unter dem Stichwort Industrie 4.0 cyber-physische Umgebungen eine bedeutende Rolle für die nächste industrielle Revolution spielen werden, erhalten sie ebenfalls Einzug in Heim, Büro, Werkstatt und zahlreiche weitere Bereiche. In solch einer neuen Welt geraten klassische Interaktionskonzepte, in denen Benutzer ausschließlich mit einem einzigen Gerät, PC oder Smartphone interagieren, immer weiter in den Hintergrund und machen Platz für eine neue Ausprägung der Interaktion zwischen dem Menschen und der Umgebung selbst. Darüber hinaus sorgen neue Technologien und ein wachsendes Spektrum an einsetzbaren Modalitäten dafür, dass sich im Interaktionsdesign neue Möglichkeiten für eine natürlichere und intuitivere verbale und nonverbale Kommunikation auftun. Die dynamische Natur von cyber-physischen Umgebungen und die Mobilität der Benutzer darin stellt Anwendungsentwickler vor die Herausforderung, Systeme zu entwickeln, die flexibel bezüglich der verbundenen und verwendeten Geräte und Modalitäten sind. Dies impliziert auch neue Möglichkeiten in der modalitätsübergreifenden Kommunikation, die über duale Interaktionskonzepte, wie sie heutzutage bereits üblich sind, hinausgehen. Die vorliegende Arbeit befasst sich mit der Unterstützung von Anwendungsentwicklern mit Hilfe einer Plattform zur deklarativen und modellbasierten Entwicklung von multimodalen Dialogapplikationen mit einem Fokus auf verteilte Ein- und Ausgabegeräte in cyber-physischen Umgebungen. Die bearbeiteten Aufgaben können grundlegend in drei Teile gegliedert werden: - Die Konzeption von Modellen und Strategien für die Spezifikation von Dialoganwendungen in einem deklarativen Entwicklungsansatz. Dies beinhaltet Modelle für das Definieren von Projektressourcen, Dialogverhalten, Spracherkennergrammatiken, graphischen Benutzerschnittstellen und Abbildungsregeln, die die gerätespezifische Darstellung von Ein- und Ausgabegeräten in eine gemeinsame Repräsentationssprache transformieren. - Die Implementierung einer Laufzeitumgebung, die eine flexible und erweiterbare Architektur für die einfache Integration neuer Geräte und Komponenten bietet. Die Plattform realisiert Konzepte und Strategien der multimodalen Mensch-Maschine-Interaktion und ist die Basis vollwertiger multimodaler Dialoganwendungen für beliebige Domänen, Szenarien und Gerätekonfigurationen. - Eine Softwareentwicklungsumgebung, die in die Eclipse Rich Client Plattform integriert ist und Entwicklern Assistenten und Editoren an die Hand gibt, die das Erstellen und Editieren von neuen multimodalen Dialoganwendungen unterstützen

Web-based game for vocabulary acquisition using computer-directed speech

Author: Yoshimoto Brandon (Brandon T.)
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/2009
Field of study

Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009.Cataloged from PDF version of thesis.Includes bibliographical references (p. 73-74).Acquiring vocabulary in a foreign language is a long process which often involves the use of flashcards or cycling through long word lists for memorization. While many students learn effectively in this way, research at the Spoken Language Systems Group (SLS) has been exploring alternative methods which make use of speech recognition and generation technology. In this thesis, I designed and implemented a speech-enabled game to aid learners of Mandarin Chinese with this task of vocabulary acquisition. Our approach is, customizable, allowing user control of the vocabulary words, web-based, providing potential for widespread use, and game-based, engaging the user in an interactive session with a computer or another human player. We evaluated the feasibility of the game as a web-based CALL (Computer Aided Language Learning) system through its deployment for use by the general public. Secondly, we ran a study to measure the effect of computer-directed user speech on vocabulary acquisition in comparison to a system which only provides listening practice.by Brandon Yoshimoto.M.Eng

DSpace@MIT

Systems for the interconnection of Android applications and automobiles via the OpenXC framework

Author: Blyakher Arkady
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/2013
Field of study

Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2013.Page 102 blank. Cataloged from PDF version of thesis.Includes bibliographical references (page 101).The topic of this thesis is the design and construction of a set of applications that use information from a vehicle to provide a better user experience. We introduce two Android applications that make use of the open-source OpenXC library to gauge driver awareness and to provide a user interface via steering wheel controls. The first of these applications deals with human-to-human interactions in the form of a messaging client. Our goal for this application is to provide a method of determining when to notify drivers of new messages by using vehicle data to gauge driver awareness. The second application deals with human-to-machine interactions in the form of a point-of-interest browser. Our goal for this second application is to use steering wheel controls in place of the touch screen traditionally associated with Android mobile applications. We hope to demonstrate that our versions of these mobile applications, with their focus on preserving driver awareness, provide a viable upgrade to their traditional alternatives.by Arkady Blyakher.M. Eng

DSpace@MIT