634 research outputs found
Contextual Joint Factor Acoustic Embeddings
Embedding acoustic information into fixed length representations is of
interest for a whole range of applications in speech and audio technology. Two
novel unsupervised approaches to generate acoustic embeddings by modelling of
acoustic context are proposed. The first approach is a contextual joint factor
synthesis encoder, where the encoder in an encoder/decoder framework is trained
to extract joint factors from surrounding audio frames to best generate the
target output. The second approach is a contextual joint factor analysis
encoder, where the encoder is trained to analyse joint factors from the source
signal that correlates best with the neighbouring audio. To evaluate the
effectiveness of our approaches compared to prior work, two tasks are conducted
-- phone classification and speaker recognition -- and test on different TIMIT
data sets. Experimental results show that one of the proposed approaches
outperforms phone classification baselines, yielding a classification accuracy
of 74.1%. When using additional out-of-domain data for training, an additional
3% improvements can be obtained, for both for phone classification and speaker
recognition tasks.Comment: Published at SLT202
Neural Predictive Coding using Convolutional Neural Networks towards Unsupervised Learning of Speaker Characteristics
Learning speaker-specific features is vital in many applications like speaker
recognition, diarization and speech recognition. This paper provides a novel
approach, we term Neural Predictive Coding (NPC), to learn speaker-specific
characteristics in a completely unsupervised manner from large amounts of
unlabeled training data that even contain many non-speech events and
multi-speaker audio streams. The NPC framework exploits the proposed short-term
active-speaker stationarity hypothesis which assumes two temporally-close short
speech segments belong to the same speaker, and thus a common representation
that can encode the commonalities of both the segments, should capture the
vocal characteristics of that speaker. We train a convolutional deep siamese
network to produce "speaker embeddings" by learning to separate `same' vs
`different' speaker pairs which are generated from an unlabeled data of audio
streams. Two sets of experiments are done in different scenarios to evaluate
the strength of NPC embeddings and compare with state-of-the-art in-domain
supervised methods. First, two speaker identification experiments with
different context lengths are performed in a scenario with comparatively
limited within-speaker channel variability. NPC embeddings are found to perform
the best at short duration experiment, and they provide complementary
information to i-vectors for full utterance experiments. Second, a large scale
speaker verification task having a wide range of within-speaker channel
variability is adopted as an upper-bound experiment where comparisons are drawn
with in-domain supervised methods
Recent Progresses in Deep Learning based Acoustic Models (Updated)
In this paper, we summarize recent progresses made in deep learning based
acoustic models and the motivation and insights behind the surveyed techniques.
We first discuss acoustic models that can effectively exploit variable-length
contextual information, such as recurrent neural networks (RNNs), convolutional
neural networks (CNNs), and their various combination with other models. We
then describe acoustic models that are optimized end-to-end with emphasis on
feature representations learned jointly with rest of the system, the
connectionist temporal classification (CTC) criterion, and the attention-based
sequence-to-sequence model. We further illustrate robustness issues in speech
recognition systems, and discuss acoustic model adaptation, speech enhancement
and separation, and robust training strategies. We also cover modeling
techniques that lead to more efficient decoding and discuss possible future
directions in acoustic model research.Comment: This is an updated version with latest literature until ICASSP2018 of
the paper: Dong Yu and Jinyu Li, "Recent Progresses in Deep Learning based
Acoustic Models," vol.4, no.3, IEEE/CAA Journal of Automatica Sinica, 201
Analyzing Hidden Representations in End-to-End Automatic Speech Recognition Systems
Neural models have become ubiquitous in automatic speech recognition systems.
While neural networks are typically used as acoustic models in more complex
systems, recent studies have explored end-to-end speech recognition systems
based on neural networks, which can be trained to directly predict text from
input acoustic features. Although such systems are conceptually elegant and
simpler than traditional systems, it is less obvious how to interpret the
trained models. In this work, we analyze the speech representations learned by
a deep end-to-end model that is based on convolutional and recurrent layers,
and trained with a connectionist temporal classification (CTC) loss. We use a
pre-trained model to generate frame-level features which are given to a
classifier that is trained on frame classification into phones. We evaluate
representations from different layers of the deep model and compare their
quality for predicting phone labels. Our experiments shed light on important
aspects of the end-to-end model such as layer depth, model complexity, and
other design choices.Comment: NIPS 201
Deep Learning for Sentiment Analysis : A Survey
Deep learning has emerged as a powerful machine learning technique that
learns multiple layers of representations or features of the data and produces
state-of-the-art prediction results. Along with the success of deep learning in
many other application domains, deep learning is also popularly used in
sentiment analysis in recent years. This paper first gives an overview of deep
learning and then provides a comprehensive survey of its current applications
in sentiment analysis.Comment: 34 pages, 9 figures, 2 table
CIF-based Collaborative Decoding for End-to-end Contextual Speech Recognition
End-to-end (E2E) models have achieved promising results on multiple speech
recognition benchmarks, and shown the potential to become the mainstream.
However, the unified structure and the E2E training hamper injecting contextual
information into them for contextual biasing. Though contextual LAS (CLAS)
gives an excellent all-neural solution, the degree of biasing to given context
information is not explicitly controllable. In this paper, we focus on
incorporating context information into the continuous integrate-and-fire (CIF)
based model that supports contextual biasing in a more controllable fashion.
Specifically, an extra context processing network is introduced to extract
contextual embeddings, integrate acoustically relevant context information and
decode the contextual output distribution, thus forming a collaborative
decoding with the decoder of the CIF-based model. Evaluated on the named entity
rich evaluation sets of HKUST/AISHELL-2, our method brings relative character
error rate (CER) reduction of 8.83%/21.13% and relative named entity character
error rate (NE-CER) reduction of 40.14%/51.50% when compared with a strong
baseline. Besides, it keeps the performance on original evaluation set without
degradation.Comment: Accepted by ICASSP 202
Neural approaches to spoken content embedding
Comparing spoken segments is a central operation to speech processing.
Traditional approaches in this area have favored frame-level dynamic
programming algorithms, such as dynamic time warping, because they require no
supervision, but they are limited in performance and efficiency. As an
alternative, acoustic word embeddings -- fixed-dimensional vector
representations of variable-length spoken word segments -- have begun to be
considered for such tasks as well. However, the current space of such
discriminative embedding models, training approaches, and their application to
real-world downstream tasks is limited. We start by considering ``single-view"
training losses where the goal is to learn an acoustic word embedding model
that separates same-word and different-word spoken segment pairs. Then, we
consider ``multi-view" contrastive losses. In this setting, acoustic word
embeddings are learned jointly with embeddings of character sequences to
generate acoustically grounded embeddings of written words, or acoustically
grounded word embeddings.
In this thesis, we contribute new discriminative acoustic word embedding
(AWE) and acoustically grounded word embedding (AGWE) approaches based on
recurrent neural networks (RNNs). We improve model training in terms of both
efficiency and performance. We take these developments beyond English to
several low-resource languages and show that multilingual training improves
performance when labeled data is limited. We apply our embedding models, both
monolingual and multilingual, to the downstream tasks of query-by-example
speech search and automatic speech recognition. Finally, we show how our
embedding approaches compare with and complement more recent self-supervised
speech models.Comment: PhD thesi
End-to-End Spoken Language Translation
In this paper, we address the task of spoken language understanding. We
present a method for translating spoken sentences from one language into spoken
sentences in another language. Given spectrogram-spectrogram pairs, our model
can be trained completely from scratch to translate unseen sentences. Our
method consists of a pyramidal-bidirectional recurrent network combined with a
convolutional network to output sentence-level spectrograms in the target
language. Empirically, our model achieves competitive performance with
state-of-the-art methods on multiple languages and can generalize to unseen
speakers.Comment: Technical Report. Stanford University, 2017. arXiv admin note: text
overlap with arXiv:1804.0004
Multimodal Embeddings from Language Models
Word embeddings such as ELMo have recently been shown to model word semantics
with greater efficacy through contextualized learning on large-scale language
corpora, resulting in significant improvement in state of the art across many
natural language tasks. In this work we integrate acoustic information into
contextualized lexical embeddings through the addition of multimodal inputs to
a pretrained bidirectional language model. The language model is trained on
spoken language that includes text and audio modalities. The resulting
representations from this model are multimodal and contain paralinguistic
information which can modify word meanings and provide affective information.
We show that these multimodal embeddings can be used to improve over previous
state of the art multimodal models in emotion recognition on the CMU-MOSEI
dataset
Talking to Your TV: Context-Aware Voice Search with Hierarchical Recurrent Neural Networks
We tackle the novel problem of navigational voice queries posed against an
entertainment system, where viewers interact with a voice-enabled remote
controller to specify the program to watch. This is a difficult problem for
several reasons: such queries are short, even shorter than comparable voice
queries in other domains, which offers fewer opportunities for deciphering user
intent. Furthermore, ambiguity is exacerbated by underlying speech recognition
errors. We address these challenges by integrating word- and character-level
representations of the queries and by modeling voice search sessions to capture
the contextual dependencies in query sequences. Both are accomplished with a
probabilistic framework in which recurrent and feedforward neural network
modules are organized in a hierarchical manner. From a raw dataset of 32M voice
queries from 2.5M viewers on the Comcast Xfinity X1 entertainment system, we
extracted data to train and test our models. We demonstrate the benefits of our
hybrid representation and context-aware model, which significantly outperforms
models without context as well as the current deployed product
- …