6,558 research outputs found
Neural Simultaneous Speech Translation Using Alignment-Based Chunking
In simultaneous machine translation, the objective is to determine when to
produce a partial translation given a continuous stream of source words, with a
trade-off between latency and quality. We propose a neural machine translation
(NMT) model that makes dynamic decisions when to continue feeding on input or
generate output words. The model is composed of two main components: one to
dynamically decide on ending a source chunk, and another that translates the
consumed chunk. We train the components jointly and in a manner consistent with
the inference conditions. To generate chunked training data, we propose a
method that utilizes word alignment while also preserving enough context. We
compare models with bidirectional and unidirectional encoders of different
depths, both on real speech and text input. Our results on the IWSLT 2020
English-to-German task outperform a wait-k baseline by 2.6 to 3.7% BLEU
absolute.Comment: IWSLT 202
Vision as an Interlingua: Learning Multilingual Semantic Embeddings of Untranscribed Speech
In this paper, we explore the learning of neural network embeddings for
natural images and speech waveforms describing the content of those images.
These embeddings are learned directly from the waveforms without the use of
linguistic transcriptions or conventional speech recognition technology. While
prior work has investigated this setting in the monolingual case using English
speech data, this work represents the first effort to apply these techniques to
languages beyond English. Using spoken captions collected in English and Hindi,
we show that the same model architecture can be successfully applied to both
languages. Further, we demonstrate that training a multilingual model
simultaneously on both languages offers improved performance over the
monolingual models. Finally, we show that these models are capable of
performing semantic cross-lingual speech-to-speech retrieval.Comment: to appear at ICASSP 201
Audio-Linguistic Embeddings for Spoken Sentences
We propose spoken sentence embeddings which capture both acoustic and
linguistic content. While existing works operate at the character, phoneme, or
word level, our method learns long-term dependencies by modeling speech at the
sentence level. Formulated as an audio-linguistic multitask learning problem,
our encoder-decoder model simultaneously reconstructs acoustic and natural
language features from audio. Our results show that spoken sentence embeddings
outperform phoneme and word-level baselines on speech recognition and emotion
recognition tasks. Ablation studies show that our embeddings can better model
high-level acoustic concepts while retaining linguistic content. Overall, our
work illustrates the viability of generic, multi-modal sentence embeddings for
spoken language understanding.Comment: International Conference on Acoustics, Speech, and Signal Processing
(ICASSP) 201
Transformers with convolutional context for ASR
The recent success of transformer networks for neural machine translation and
other NLP tasks has led to a surge in research work trying to apply it for
speech recognition. Recent efforts studied key research questions around ways
of combining positional embedding with speech features, and stability of
optimization for large scale learning of transformer networks. In this paper,
we propose replacing the sinusoidal positional embedding for transformers with
convolutionally learned input representations. These contextual representations
provide subsequent transformer blocks with relative positional information
needed for discovering long-range relationships between local concepts. The
proposed system has favorable optimization characteristics where our reported
results are produced with fixed learning rate of 1.0 and no warmup steps. The
proposed model achieves a competitive 4.7% and 12.9% WER on the Librispeech
``test clean'' and ``test other'' subsets when no extra LM text is provided
Phonetic-and-Semantic Embedding of Spoken Words with Applications in Spoken Content Retrieval
Word embedding or Word2Vec has been successful in offering semantics for text
words learned from the context of words. Audio Word2Vec was shown to offer
phonetic structures for spoken words (signal segments for words) learned from
signals within spoken words. This paper proposes a two-stage framework to
perform phonetic-and-semantic embedding on spoken words considering the context
of the spoken words. Stage 1 performs phonetic embedding with speaker
characteristics disentangled. Stage 2 then performs semantic embedding in
addition. We further propose to evaluate the phonetic-and-semantic nature of
the audio embeddings obtained in Stage 2 by parallelizing with text embeddings.
In general, phonetic structure and semantics inevitably disturb each other. For
example the words "brother" and "sister" are close in semantics but very
different in phonetic structure, while the words "brother" and "bother" are in
the other way around. But phonetic-and-semantic embedding is attractive, as
shown in the initial experiments on spoken document retrieval. Not only spoken
documents including the spoken query can be retrieved based on the phonetic
structures, but spoken documents semantically related to the query but not
including the query can also be retrieved based on the semantics.Comment: Accepted at SLT201
Multimodal Sentiment Analysis with Word-Level Fusion and Reinforcement Learning
With the increasing popularity of video sharing websites such as YouTube and
Facebook, multimodal sentiment analysis has received increasing attention from
the scientific community. Contrary to previous works in multimodal sentiment
analysis which focus on holistic information in speech segments such as bag of
words representations and average facial expression intensity, we develop a
novel deep architecture for multimodal sentiment analysis that performs
modality fusion at the word level. In this paper, we propose the Gated
Multimodal Embedding LSTM with Temporal Attention (GME-LSTM(A)) model that is
composed of 2 modules. The Gated Multimodal Embedding alleviates the
difficulties of fusion when there are noisy modalities. The LSTM with Temporal
Attention performs word level fusion at a finer fusion resolution between input
modalities and attends to the most important time steps. As a result, the
GME-LSTM(A) is able to better model the multimodal structure of speech through
time and perform better sentiment comprehension. We demonstrate the
effectiveness of this approach on the publicly-available Multimodal Corpus of
Sentiment Intensity and Subjectivity Analysis (CMU-MOSI) dataset by achieving
state-of-the-art sentiment classification and regression results. Qualitative
analysis on our model emphasizes the importance of the Temporal Attention Layer
in sentiment prediction because the additional acoustic and visual modalities
are noisy. We also demonstrate the effectiveness of the Gated Multimodal
Embedding in selectively filtering these noisy modalities out. Our results and
analysis open new areas in the study of sentiment analysis in human
communication and provide new models for multimodal fusion.Comment: ICMI 2017 Oral Presentation, Honorable Mention Awar
Learning from Past Mistakes: Improving Automatic Speech Recognition Output via Noisy-Clean Phrase Context Modeling
Automatic speech recognition (ASR) systems often make unrecoverable errors
due to subsystem pruning (acoustic, language and pronunciation models); for
example pruning words due to acoustics using short-term context, prior to
rescoring with long-term context based on linguistics. In this work we model
ASR as a phrase-based noisy transformation channel and propose an error
correction system that can learn from the aggregate errors of all the
independent modules constituting the ASR and attempt to invert those. The
proposed system can exploit long-term context using a neural network language
model and can better choose between existing ASR output possibilities as well
as re-introduce previously pruned or unseen (out-of-vocabulary) phrases. It
provides corrections under poorly performing ASR conditions without degrading
any accurate transcriptions; such corrections are greater on top of
out-of-domain and mismatched data ASR. Our system consistently provides
improvements over the baseline ASR, even when baseline is further optimized
through recurrent neural network language model rescoring. This demonstrates
that any ASR improvements can be exploited independently and that our proposed
system can potentially still provide benefits on highly optimized ASR. Finally,
we present an extensive analysis of the type of errors corrected by our system
Completely Unsupervised Speech Recognition By A Generative Adversarial Network Harmonized With Iteratively Refined Hidden Markov Models
Producing a large annotated speech corpus for training ASR systems remains
difficult for more than 95% of languages all over the world which are
low-resourced, but collecting a relatively big unlabeled data set for such
languages is more achievable. This is why some initial effort have been
reported on completely unsupervised speech recognition learned from unlabeled
data only, although with relatively high error rates. In this paper, we develop
a Generative Adversarial Network (GAN) to achieve this purpose, in which a
Generator and a Discriminator learn from each other iteratively to improve the
performance. We further use a set of Hidden Markov Models (HMMs) iteratively
refined from the machine generated labels to work in harmony with the GAN. The
initial experiments on TIMIT data set achieve an phone error rate of 33.1%,
which is 8.5% lower than the previous state-of-the-art.Comment: Accepted by Interspeech 201
Investigation of learning abilities on linguistic features in sequence-to-sequence text-to-speech synthesis
Neural sequence-to-sequence text-to-speech synthesis (TTS) can produce
high-quality speech directly from text or simple linguistic features such as
phonemes. Unlike traditional pipeline TTS, the neural sequence-to-sequence TTS
does not require manually annotated and complicated linguistic features such as
part-of-speech tags and syntactic structures for system training. However, it
must be carefully designed and well optimized so that it can implicitly extract
useful linguistic features from the input features. In this paper we
investigate under what conditions the neural sequence-to-sequence TTS can work
well in Japanese and English along with comparisons with deep neural network
(DNN) based pipeline TTS systems. Unlike past comparative studies, the pipeline
systems also use autoregressive probabilistic modeling and a neural vocoder. We
investigated systems from three aspects: a) model architecture, b) model
parameter size, and c) language. For the model architecture aspect, we adopt
modified Tacotron systems that we previously proposed and their variants using
an encoder from Tacotron or Tacotron2. For the model parameter size aspect, we
investigate two model parameter sizes. For the language aspect, we conduct
listening tests in both Japanese and English to see if our findings can be
generalized across languages. Our experiments suggest that a) a neural
sequence-to-sequence TTS system should have a sufficient number of model
parameters to produce high quality speech, b) it should also use a powerful
encoder when it takes characters as inputs, and c) the encoder still has a room
for improvement and needs to have an improved architecture to learn
supra-segmental features more appropriately
Towards Unsupervised Automatic Speech Recognition Trained by Unaligned Speech and Text only
Automatic speech recognition (ASR) has been widely researched with supervised
approaches, while many low-resourced languages lack audio-text aligned data,
and supervised methods cannot be applied on them.
In this work, we propose a framework to achieve unsupervised ASR on a read
English speech dataset, where audio and text are unaligned. In the first stage,
each word-level audio segment in the utterances is represented by a vector
representation extracted by a sequence-of-sequence autoencoder, in which
phonetic information and speaker information are disentangled.
Secondly, semantic embeddings of audio segments are trained from the vector
representations using a skip-gram model. Last but not the least, an
unsupervised method is utilized to transform semantic embeddings of audio
segments to text embedding space, and finally the transformed embeddings are
mapped to words.
With the above framework, we are towards unsupervised ASR trained by
unaligned text and speech only.Comment: Code is released:
https://github.com/grtzsohalf/Towards-Unsupervised-AS
- …