8 research outputs found
Implicit Acoustic Echo Cancellation for Keyword Spotting and Device-Directed Speech Detection
In many speech-enabled human-machine interaction scenarios, user speech can
overlap with the device playback audio. In these instances, the performance of
tasks such as keyword-spotting (KWS) and device-directed speech detection (DDD)
can degrade significantly. To address this problem, we propose an implicit
acoustic echo cancellation (iAEC) framework where a neural network is trained
to exploit the additional information from a reference microphone channel to
learn to ignore the interfering signal and improve detection performance. We
study this framework for the tasks of KWS and DDD on, respectively, an
augmented version of Google Speech Commands v2 and a real-world Alexa device
dataset. Notably, we show a 56% reduction in false-reject rate for the DDD task
during device playback conditions. We also show comparable or superior
performance over a strong end-to-end neural echo cancellation + KWS baseline
for the KWS task with an order of magnitude less computational requirements.Comment: Submitted to INTERSPEECH 202
Device Directedness with Contextual Cues for Spoken Dialog Systems
In this work, we define barge-in verification as a supervised learning task
where audio-only information is used to classify user spoken dialogue into true
and false barge-ins. Following the success of pre-trained models, we use
low-level speech representations from a self-supervised representation learning
model for our downstream classification task. Further, we propose a novel
technique to infuse lexical information directly into speech representations to
improve the domain-specific language information implicitly learned during
pre-training. Experiments conducted on spoken dialog data show that our
proposed model trained to validate barge-in entirely from speech
representations is faster by 38% relative and achieves 4.5% relative F1 score
improvement over a baseline LSTM model that uses both audio and Automatic
Speech Recognition (ASR) 1-best hypotheses. On top of this, our best proposed
model with lexically infused representations along with contextual features
provides a further relative improvement of 5.7% in the F1 score but only 22%
faster than the baseline
Text Injection for Capitalization and Turn-Taking Prediction in Speech Models
Text injection for automatic speech recognition (ASR), wherein unpaired
text-only data is used to supplement paired audio-text data, has shown
promising improvements for word error rate. This study examines the use of text
injection for auxiliary tasks, which are the non-ASR tasks often performed by
an E2E model. In this work, we use joint end-to-end and internal language model
training (JEIT) as our text injection algorithm to train an ASR model which
performs two auxiliary tasks. The first is capitalization, which is a
de-normalization task. The second is turn-taking prediction, which attempts to
identify whether a user has completed their conversation turn in a digital
assistant interaction. We show results demonstrating that our text injection
method boosts capitalization performance for long-tail data, and improves
turn-taking detection recall
Recommended from our members
Children and adults produce distinct technology- and human-directed speech.
This study compares how English-speaking adults and children from the United States adapt their speech when talking to a real person and a smart speaker (Amazon Alexa) in a psycholinguistic experiment. Overall, participants produced more effortful speech when talking to a device (longer duration and higher pitch). These differences also varied by age: children produced even higher pitch in device-directed speech, suggesting a stronger expectation to be misunderstood by the system. In support of this, we see that after a staged recognition error by the device, children increased pitch even more. Furthermore, both adults and children displayed the same degree of variation in their responses for whether Alexa seems like a real person or not, further indicating that childrens conceptualization of the systems competence shaped their register adjustments, rather than an increased anthropomorphism response. This work speaks to models on the mechanisms underlying speech production, and human-computer interaction frameworks, providing support for routinized theories of spoken interaction with technology