5 research outputs found

    The timing bottleneck: Why timing and overlap are mission-critical for conversational user interfaces, speech recognition and dialogue systems

    Full text link
    Speech recognition systems are a key intermediary in voice-driven human-computer interaction. Although speech recognition works well for pristine monologic audio, real-life use cases in open-ended interactive settings still present many challenges. We argue that timing is mission-critical for dialogue systems, and evaluate 5 major commercial ASR systems for their conversational and multilingual support. We find that word error rates for natural conversational data in 6 languages remain abysmal, and that overlap remains a key challenge (study 1). This impacts especially the recognition of conversational words (study 2), and in turn has dire consequences for downstream intent recognition (study 3). Our findings help to evaluate the current state of conversational ASR, contribute towards multidimensional error analysis and evaluation, and identify phenomena that need most attention on the way to build robust interactive speech technologies

    Opening up ChatGPT

    No full text

    Text to talk: foundations of interactive language modeling for conversational AI and talking robots

    No full text
    Language in interaction allows collaboration and exchange at an unrivaled level in the natural world. It provides humans with communicative tools that helped us to thrive as a species, and as social individuals. A key to its success is its flexible use in social interaction, an aspect of language that computational linguistics and NLP struggle to get a grip on. But progress has been made towards technology that aspires to keep up with the prowess of human language and sociality. Enabled by advances in speech recognition, the likes of Siri and Alexa have entered the lives of many. Increasing amounts of data and more machine learning architectures promise more robust voice user interfaces. At the intersection of language theory and tech, this tutorial introduces strands of both the language sciences and technological fields that share an interest in understanding how language is used in interaction. Drawing on linguistics, cognitive science and the study of human interaction, we review the theoretical and empirical foundations necessary for progress in technological fields such as voice user interfaces (VUI), social robots, and conversational AI. You will learn the basics of interactive language modeling, exploring elements and dynamics of conversation. Drawing on techniques from dialog modeling, NLP and signal processing, you will dive into exploring structure and variation in conversational speech data in a hands-on tutorial. Some experience working with Python and Jupyter required. The tutorial concludes with a discussion of the implications of what we know (and don’t know) about language use for designing and building next-generation interactive language technology. We discuss some technological limits of current products, and touch upon societal and ethical issues that emerge alongside the rise of voice AI. This tutorial might appeal to anyone interested in understanding why talking machines struggle to hold up their end of a conversation, whether coming from an science or engineering background
    corecore