1,526 research outputs found
Multilingual Language Processing From Bytes
We describe an LSTM-based model which we call Byte-to-Span (BTS) that reads
text as bytes and outputs span annotations of the form [start, length, label]
where start positions, lengths, and labels are separate entries in our
vocabulary. Because we operate directly on unicode bytes rather than
language-specific words or characters, we can analyze text in many languages
with a single model. Due to the small vocabulary size, these multilingual
models are very compact, but produce results similar to or better than the
state-of- the-art in Part-of-Speech tagging and Named Entity Recognition that
use only the provided training datasets (no external data sources). Our models
are learning "from scratch" in that they do not rely on any elements of the
standard pipeline in Natural Language Processing (including tokenization), and
thus can run in standalone fashion on raw text
Deep Audio-Visual Speech Recognition
The goal of this work is to recognise phrases and sentences being spoken by a
talking face, with or without the audio. Unlike previous works that have
focussed on recognising a limited number of words or phrases, we tackle lip
reading as an open-world problem - unconstrained natural language sentences,
and in the wild videos. Our key contributions are: (1) we compare two models
for lip reading, one using a CTC loss, and the other using a
sequence-to-sequence loss. Both models are built on top of the transformer
self-attention architecture; (2) we investigate to what extent lip reading is
complementary to audio speech recognition, especially when the audio signal is
noisy; (3) we introduce and publicly release a new dataset for audio-visual
speech recognition, LRS2-BBC, consisting of thousands of natural sentences from
British television. The models that we train surpass the performance of all
previous work on a lip reading benchmark dataset by a significant margin.Comment: Accepted for publication by IEEE Transactions on Pattern Analysis and
Machine Intelligenc
Knowledge-Aware Audio-Grounded Generative Slot Filling for Limited Annotated Data
Manually annotating fine-grained slot-value labels for task-oriented dialogue
(ToD) systems is an expensive and time-consuming endeavour. This motivates
research into slot-filling methods that operate with limited amounts of
labelled data. Moreover, the majority of current work on ToD is based solely on
text as the input modality, neglecting the additional challenges of imperfect
automatic speech recognition (ASR) when working with spoken language. In this
work, we propose a Knowledge-Aware Audio-Grounded generative slot-filling
framework, termed KA2G, that focuses on few-shot and zero-shot slot filling for
ToD with speech input. KA2G achieves robust and data-efficient slot filling for
speech-based ToD by 1) framing it as a text generation task, 2) grounding text
generation additionally in the audio modality, and 3) conditioning on available
external knowledge (e.g. a predefined list of possible slot values). We show
that combining both modalities within the KA2G framework improves the
robustness against ASR errors. Further, the knowledge-aware slot-value
generator in KA2G, implemented via a pointer generator mechanism, particularly
benefits few-shot and zero-shot learning. Experiments, conducted on the
standard speech-based single-turn SLURP dataset and a multi-turn dataset
extracted from a commercial ToD system, display strong and consistent gains
over prior work, especially in few-shot and zero-shot setups.Comment: to submit to CS&
Adaptive Sparse and Monotonic Attention for Transformer-based Automatic Speech Recognition
The Transformer architecture model, based on self-attention and multi-head
attention, has achieved remarkable success in offline end-to-end Automatic
Speech Recognition (ASR). However, self-attention and multi-head attention
cannot be easily applied for streaming or online ASR. For self-attention in
Transformer ASR, the softmax normalization function-based attention mechanism
makes it impossible to highlight important speech information. For multi-head
attention in Transformer ASR, it is not easy to model monotonic alignments in
different heads. To overcome these two limits, we integrate sparse attention
and monotonic attention into Transformer-based ASR. The sparse mechanism
introduces a learned sparsity scheme to enable each self-attention structure to
fit the corresponding head better. The monotonic attention deploys
regularization to prune redundant heads for the multi-head attention structure.
The experiments show that our method can effectively improve the attention
mechanism on widely used benchmarks of speech recognition.Comment: Accepted to DSAA 202
- …