2 research outputs found
Streaming keyword spotting on mobile devices
In this work we explore the latency and accuracy of keyword spotting (KWS)
models in streaming and non-streaming modes on mobile phones. NN model
conversion from non-streaming mode (model receives the whole input sequence and
then returns the classification result) to streaming mode (model receives
portion of the input sequence and classifies it incrementally) may require
manual model rewriting. We address this by designing a Tensorflow/Keras based
library which allows automatic conversion of non-streaming models to streaming
ones with minimum effort. With this library we benchmark multiple KWS models in
both streaming and non-streaming modes on mobile phones and demonstrate
different tradeoffs between latency and accuracy. We also explore novel KWS
models with multi-head attention which reduce the classification error over the
state-of-art by 10% on Google speech commands data sets V2. The streaming
library with all experiments is open-sourced
Data-Efficient Methods for Dialogue Systems
Conversational User Interface (CUI) has become ubiquitous in everyday life,
in consumer-focused products like Siri and Alexa or business-oriented
solutions. Deep learning underlies many recent breakthroughs in dialogue
systems but requires very large amounts of training data, often annotated by
experts. Trained with smaller data, these methods end up severely lacking
robustness (e.g. to disfluencies and out-of-domain input), and often just have
too little generalisation power. In this thesis, we address the above issues by
introducing a series of methods for training robust dialogue systems from
minimal data. Firstly, we study two orthogonal approaches to dialogue:
linguistically informed and machine learning-based - from the data efficiency
perspective. We outline the steps to obtain data-efficient solutions with
either approach. We then introduce two data-efficient models for dialogue
response generation: the Dialogue Knowledge Transfer Network based on latent
variable dialogue representations, and the hybrid Generative-Retrieval
Transformer model (ranked first at the DSTC 8 Fast Domain Adaptation task).
Next, we address the problem of robustness given minimal data. As such, propose
a multitask LSTM-based model for domain-general disfluency detection. For the
problem of out-of-domain input, we present Turn Dropout, a data augmentation
technique for anomaly detection only using in-domain data, and introduce
autoencoder-augmented models for efficient training with Turn Dropout. Finally,
we focus on social dialogue and introduce a neural model for response ranking
in social conversation used in Alana, the 3rd place winner in the Amazon Alexa
Prize 2017 and 2018. We employ a novel technique of predicting the dialogue
length as the main ranking objective and show that this approach improves upon
the ratings-based counterpart in terms of data efficiency while matching it in
performance.Comment: PhD thesis submitted at Heriot-Watt University. Contains previously
published work (see the list in Section 1.4