5 research outputs found
The timing bottleneck: Why timing and overlap are mission-critical for conversational user interfaces, speech recognition and dialogue systems
Speech recognition systems are a key intermediary in voice-driven
human-computer interaction. Although speech recognition works well for pristine
monologic audio, real-life use cases in open-ended interactive settings still
present many challenges. We argue that timing is mission-critical for dialogue
systems, and evaluate 5 major commercial ASR systems for their conversational
and multilingual support. We find that word error rates for natural
conversational data in 6 languages remain abysmal, and that overlap remains a
key challenge (study 1). This impacts especially the recognition of
conversational words (study 2), and in turn has dire consequences for
downstream intent recognition (study 3). Our findings help to evaluate the
current state of conversational ASR, contribute towards multidimensional error
analysis and evaluation, and identify phenomena that need most attention on the
way to build robust interactive speech technologies
Text to talk: foundations of interactive language modeling for conversational AI and talking robots
Language in interaction allows collaboration and exchange at an unrivaled level in the natural world. It provides humans with communicative tools that helped us to thrive as a species, and as social individuals. A key to its success is its flexible use in social interaction, an aspect of language that computational linguistics and NLP struggle to get a grip on. But progress has been made towards technology that aspires to keep up with the prowess of human language and sociality. Enabled by advances in speech recognition, the likes of Siri and Alexa have entered the lives of many. Increasing amounts of data and more machine learning architectures promise more robust voice user interfaces.
At the intersection of language theory and tech, this tutorial introduces strands of both the language sciences and technological fields that share an interest in understanding how language is used in interaction. Drawing on linguistics, cognitive science and the study of human interaction, we review the theoretical and empirical foundations necessary for progress in technological fields such as voice user interfaces (VUI), social robots, and conversational AI. You will learn the basics of interactive language modeling, exploring elements and dynamics of conversation. Drawing on techniques from dialog modeling, NLP and signal processing, you will dive into exploring structure and variation in conversational speech data in a hands-on tutorial. Some experience working with Python and Jupyter required.
The tutorial concludes with a discussion of the implications of what we know (and don’t know) about language use for designing and building next-generation interactive language technology. We discuss some technological limits of current products, and touch upon societal and ethical issues that emerge alongside the rise of voice AI. This tutorial might appeal to anyone interested in understanding why talking machines struggle to hold up their end of a conversation, whether coming from an science or engineering background