Search CORE

8 research outputs found

Implicit Acoustic Echo Cancellation for Keyword Spotting and Device-Directed Speech Detection

Author: Balestri Thomas
Cornell Samuele
Sénéchal Thibaud
Publication venue
Publication date: 21/03/2022
Field of study

In many speech-enabled human-machine interaction scenarios, user speech can overlap with the device playback audio. In these instances, the performance of tasks such as keyword-spotting (KWS) and device-directed speech detection (DDD) can degrade significantly. To address this problem, we propose an implicit acoustic echo cancellation (iAEC) framework where a neural network is trained to exploit the additional information from a reference microphone channel to learn to ignore the interfering signal and improve detection performance. We study this framework for the tasks of KWS and DDD on, respectively, an augmented version of Google Speech Commands v2 and a real-world Alexa device dataset. Notably, we show a 56% reduction in false-reject rate for the DDD task during device playback conditions. We also show comparable or superior performance over a strong end-to-end neural echo cancellation + KWS baseline for the KWS task with an order of magnitude less computational requirements.Comment: Submitted to INTERSPEECH 202

arXiv.org e-Print Archive

Device Directedness with Contextual Cues for Spoken Dialog Systems

Author: Bekal Dhanush
Bodapati Sravan
Kirchhoff Katrin
Ronanki Srikanth
Srinivasan Sundararajan
Publication venue
Publication date: 23/11/2022
Field of study

In this work, we define barge-in verification as a supervised learning task where audio-only information is used to classify user spoken dialogue into true and false barge-ins. Following the success of pre-trained models, we use low-level speech representations from a self-supervised representation learning model for our downstream classification task. Further, we propose a novel technique to infuse lexical information directly into speech representations to improve the domain-specific language information implicitly learned during pre-training. Experiments conducted on spoken dialog data show that our proposed model trained to validate barge-in entirely from speech representations is faster by 38% relative and achieves 4.5% relative F1 score improvement over a baseline LSTM model that uses both audio and Automatic Speech Recognition (ASR) 1-best hypotheses. On top of this, our best proposed model with lexically infused representations along with contextual features provides a further relative improvement of 5.7% in the F1 score but only 22% faster than the baseline

arXiv.org e-Print Archive

Text Injection for Capitalization and Turn-Taking Prediction in Speech Models

Author: Bijwadia Shaan
Chang Shuo-yiin
Meng Zhong
Sainath Tara N.
Wang Weiran
Zhang Hao
Publication venue
Publication date: 14/08/2023
Field of study

Text injection for automatic speech recognition (ASR), wherein unpaired text-only data is used to supplement paired audio-text data, has shown promising improvements for word error rate. This study examines the use of text injection for auxiliary tasks, which are the non-ASR tasks often performed by an E2E model. In this work, we use joint end-to-end and internal language model training (JEIT) as our text injection algorithm to train an ASR model which performs two auxiliary tasks. The first is capitalization, which is a de-normalization task. The second is turn-taking prediction, which attempts to identify whether a user has completed their conversation turn in a digital assistant interaction. We show results demonstrating that our text injection method boosts capitalization performance for long-tail data, and improves turn-taking detection recall

arXiv.org e-Print Archive

Recommended from our members

Children and adults produce distinct technology- and human-directed speech.

Author: Barreda Santiago
Cohn Michelle
Graf Estes Katharine
Yu Zhou
Zellou Georgia
Publication venue: eScholarship, University of California
Publication date: 06/07/2024
Field of study

This study compares how English-speaking adults and children from the United States adapt their speech when talking to a real person and a smart speaker (Amazon Alexa) in a psycholinguistic experiment. Overall, participants produced more effortful speech when talking to a device (longer duration and higher pitch). These differences also varied by age: children produced even higher pitch in device-directed speech, suggesting a stronger expectation to be misunderstood by the system. In support of this, we see that after a staged recognition error by the device, children increased pitch even more. Furthermore, both adults and children displayed the same degree of variation in their responses for whether Alexa seems like a real person or not, further indicating that childrens conceptualization of the systems competence shaped their register adjustments, rather than an increased anthropomorphism response. This work speaks to models on the mechanisms underlying speech production, and human-computer interaction frameworks, providing support for routinized theories of spoken interaction with technology

eScholarship - University of California