55 research outputs found
Measuring a decade of progress in Text-to-Speech
The Blizzard Challenge offers a unique insight into progress in text-to-speech synthesis over the last decade. By using a very large listening test to compare the performance of a wide range of systems that have been constructed using a common corpus of speech recordings, it is possible to make some direct comparisons between competing techniques. By reviewing over a hundred papers describing all entries to the Challenge since 2005, we can make a useful summary of the most successful techniques adopted by participating teams, as well as drawing some conclusions about where the Blizzard Challenge has succeeded, and where there are still open problems in cross-system comparisons of text-to-speech synthesisers.El Reto Blizzard (en inglés, Blizzard Challenge) ofrece una perspectiva única en cuanto al progreso realizado en la conversión texto-habla en la última década. Dicho Reto posibilita la comparación directa entre distintas técnicas que compiten, utilizando para ello un experimento auditivo a gran escala en el que se compara el rendimiento de un amplio abanico de sistemas construidos sobre un mismo corpus de grabaciones de habla. Este artÃculo presenta una revisión de más de cien artÃculos, representantes de todos los proyectos presentados al Reto desde 2005. Aquà se resumen las técnicas de mayor éxito adoptadas por los equipos participantes, y se extraen algunas conclusiones sobre los mayores logros del Reto Blizzard, asà como de los problemas que aún quedan abiertos en la comparación cruzada de conversores texto-habla
Toward Widely-Available and Usable Multimodal Conversational Interfaces
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009.Cataloged from PDF version of thesis.Includes bibliographical references (p. 159-166).Multimodal conversational interfaces, which allow humans to interact with a computer using a combination of spoken natural language and a graphical interface, offer the potential to transform the manner by which humans communicate with computers. While researchers have developed myriad such interfaces, none have made the transition out of the laboratory and into the hands of a significant number of users. This thesis makes progress toward overcoming two intertwined barriers preventing more widespread adoption: availability and usability. Toward addressing the problem of availability, this thesis introduces a new platform for building multimodal interfaces that makes it easy to deploy them to users via the World Wide Web. One consequence of this work is City Browser, the first multimodal conversational interface made publicly available to anyone with a web browser and a microphone. City Browser serves as a proof-of-concept that significant amounts of usage data can be collected in this way, allowing a glimpse of how users interact with such interfaces outside of a laboratory environment. City Browser, in turn, has served as the primary platform for deploying and evaluating three new strategies aimed at improving usability. The most pressing usability challenge for conversational interfaces is their limited ability to accurately transcribe and understand spoken natural language. The three strategies developed in this thesis - context-sensitive language modeling, response confidence scoring, and user behavior shaping - each attack the problem from a different angle, but they are linked in that each critically integrates information from the conversational context.by Alexander Gruenstein.Ph.D
Recommended from our members
Painting Pictures with Words - From Theory to System
A picture paints a thousand words, or so we are told. But how many words does it take to paint a picture? And how can words create pictures in the first place? In this thesis we examine a new theory of linguistic meaning -- where the meaning of words and sentences is determined by the scenes they evoke. We describe how descriptive text is parsed and semantically interpreted and how the semantic interpretation is then depicted as a rendered 3D scene. In doing so, we describe WordsEye, our text-to-scene system, and touch upon many fascinating issues of lexical semantics, knowledge representation, and what we call "graphical semantics." We introduce the notion of vignettes as a way to bridge between function and form, between the semantics of language and the grounded semantics of 3D scenes. And we describe how VigNet, our lexical semantic and graphical knowledge base, mediates the whole process.
In the second part of this thesis, we describe four different ways WordsEye has been tested. We first discuss an evaluation of the system in an educational environment where WordsEye was shown to significantly improve literacy skills for sixth grade students versus a control group. We then compare WordsEye with Google Image Search on "realistic" and "imaginative" sentences in order to evaluate its performance on a sentence-by-sentence level and test its potential as a way to augment existing image search tools. Thirdly, we describe what we have learned in testing WordsEye as an online 3D authoring system where it has attracted 20,000 real-world users who have performed almost one million scene depictions. Finally, we describe tests of WordsEye as an elicitation tool for field linguists studying endangered languages. We then sum up by presenting a roadmap for enhancing the capabilities of the system and identifying key
opportunities and issues to be addressed
Proceedings: Voice Technology for Interactive Real-Time Command/Control Systems Application
Speech understanding among researchers and managers, current developments in voice technology, and an exchange of information concerning government voice technology efforts are discussed
Products and Services
Today’s global economy offers more opportunities, but is also more complex and competitive than ever before. This fact leads to a wide range of research activity in different fields of interest, especially in the so-called high-tech sectors. This book is a result of widespread research and development activity from many researchers worldwide, covering the aspects of development activities in general, as well as various aspects of the practical application of knowledge
Speech Recognition
Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes
Proceedings of the VIIth GSCP International Conference
The 7th International Conference of the Gruppo di Studi sulla Comunicazione Parlata, dedicated to the memory of Claire Blanche-Benveniste, chose as its main theme Speech and Corpora. The wide international origin of the 235 authors from 21 countries and 95 institutions led to papers on many different languages. The 89 papers of this volume reflect the themes of the conference: spoken corpora compilation and annotation, with the technological connected fields; the relation between prosody and pragmatics; speech pathologies; and different papers on phonetics, speech and linguistic analysis, pragmatics and sociolinguistics. Many papers are also dedicated to speech and second language studies. The online publication with FUP allows direct access to sound and video linked to papers (when downloaded)
- …