22,188 research outputs found
Towards Unsupervised Automatic Speech Recognition Trained by Unaligned Speech and Text only
Automatic speech recognition (ASR) has been widely researched with supervised
approaches, while many low-resourced languages lack audio-text aligned data,
and supervised methods cannot be applied on them.
In this work, we propose a framework to achieve unsupervised ASR on a read
English speech dataset, where audio and text are unaligned. In the first stage,
each word-level audio segment in the utterances is represented by a vector
representation extracted by a sequence-of-sequence autoencoder, in which
phonetic information and speaker information are disentangled.
Secondly, semantic embeddings of audio segments are trained from the vector
representations using a skip-gram model. Last but not the least, an
unsupervised method is utilized to transform semantic embeddings of audio
segments to text embedding space, and finally the transformed embeddings are
mapped to words.
With the above framework, we are towards unsupervised ASR trained by
unaligned text and speech only.Comment: Code is released:
https://github.com/grtzsohalf/Towards-Unsupervised-AS
Towards an Unsupervised Entrainment Distance in Conversational Speech using Deep Neural Networks
Entrainment is a known adaptation mechanism that causes interaction
participants to adapt or synchronize their acoustic characteristics.
Understanding how interlocutors tend to adapt to each other's speaking style
through entrainment involves measuring a range of acoustic features and
comparing those via multiple signal comparison methods. In this work, we
present a turn-level distance measure obtained in an unsupervised manner using
a Deep Neural Network (DNN) model, which we call Neural Entrainment Distance
(NED). This metric establishes a framework that learns an embedding from the
population-wide entrainment in an unlabeled training corpus. We use the
framework for a set of acoustic features and validate the measure
experimentally by showing its efficacy in distinguishing real conversations
from fake ones created by randomly shuffling speaker turns. Moreover, we show
real world evidence of the validity of the proposed measure. We find that high
value of NED is associated with high ratings of emotional bond in suicide
assessment interviews, which is consistent with prior studies.Comment: submitted to Interspeech 201
Neural Predictive Coding using Convolutional Neural Networks towards Unsupervised Learning of Speaker Characteristics
Learning speaker-specific features is vital in many applications like speaker
recognition, diarization and speech recognition. This paper provides a novel
approach, we term Neural Predictive Coding (NPC), to learn speaker-specific
characteristics in a completely unsupervised manner from large amounts of
unlabeled training data that even contain many non-speech events and
multi-speaker audio streams. The NPC framework exploits the proposed short-term
active-speaker stationarity hypothesis which assumes two temporally-close short
speech segments belong to the same speaker, and thus a common representation
that can encode the commonalities of both the segments, should capture the
vocal characteristics of that speaker. We train a convolutional deep siamese
network to produce "speaker embeddings" by learning to separate `same' vs
`different' speaker pairs which are generated from an unlabeled data of audio
streams. Two sets of experiments are done in different scenarios to evaluate
the strength of NPC embeddings and compare with state-of-the-art in-domain
supervised methods. First, two speaker identification experiments with
different context lengths are performed in a scenario with comparatively
limited within-speaker channel variability. NPC embeddings are found to perform
the best at short duration experiment, and they provide complementary
information to i-vectors for full utterance experiments. Second, a large scale
speaker verification task having a wide range of within-speaker channel
variability is adopted as an upper-bound experiment where comparisons are drawn
with in-domain supervised methods
Towards Unsupervised Speech-to-Text Translation
We present a framework for building speech-to-text translation (ST) systems
using only monolingual speech and text corpora, in other words, speech
utterances from a source language and independent text from a target language.
As opposed to traditional cascaded systems and end-to-end architectures, our
system does not require any labeled data (i.e., transcribed source audio or
parallel source and target text corpora) during training, making it especially
applicable to language pairs with very few or even zero bilingual resources.
The framework initializes the ST system with a cross-modal bilingual dictionary
inferred from the monolingual corpora, that maps every source speech segment
corresponding to a spoken word to its target text translation. For unseen
source speech utterances, the system first performs word-by-word translation on
each speech segment in the utterance. The translation is improved by leveraging
a language model and a sequence denoising autoencoder to provide prior
knowledge about the target language. Experimental results show that our
unsupervised system achieves comparable BLEU scores to supervised end-to-end
models despite the lack of supervision. We also provide an ablation analysis to
examine the utility of each component in our system
Personalized Acoustic Modeling by Weakly Supervised Multi-Task Deep Learning using Acoustic Tokens Discovered from Unlabeled Data
It is well known that recognizers personalized to each user are much more
effective than user-independent recognizers. With the popularity of smartphones
today, although it is not difficult to collect a large set of audio data for
each user, it is difficult to transcribe it. However, it is now possible to
automatically discover acoustic tokens from unlabeled personal data in an
unsupervised way. We therefore propose a multi-task deep learning framework
called a phoneme-token deep neural network (PTDNN), jointly trained from
unsupervised acoustic tokens discovered from unlabeled data and very limited
transcribed data for personalized acoustic modeling. We term this scenario
"weakly supervised". The underlying intuition is that the high degree of
similarity between the HMM states of acoustic token models and phoneme models
may help them learn from each other in this multi-task learning framework.
Initial experiments performed over a personalized audio data set recorded from
Facebook posts demonstrated that very good improvements can be achieved in both
frame accuracy and word accuracy over popularly-considered baselines such as
fDLR, speaker code and lightly supervised adaptation. This approach complements
existing speaker adaptation approaches and can be used jointly with such
techniques to yield improved results.Comment: 5 pages, 5 figures, published in IEEE ICASSP 201
Unsupervised Discovery of Structured Acoustic Tokens with Applications to Spoken Term Detection
In this paper, we compare two paradigms for unsupervised discovery of
structured acoustic tokens directly from speech corpora without any human
annotation. The Multigranular Paradigm seeks to capture all available
information in the corpora with multiple sets of tokens for different model
granularities. The Hierarchical Paradigm attempts to jointly learn several
levels of signal representations in a hierarchical structure. The two paradigms
are unified within a theoretical framework in this paper. Query-by-Example
Spoken Term Detection (QbE-STD) experiments on the QUESST dataset of MediaEval
2015 verifies the competitiveness of the acoustic tokens. The Enhanced
Relevance Score (ERS) proposed in this work improves both paradigms for the
task of QbE-STD. We also list results on the ABX evaluation task of the Zero
Resource Challenge 2015 for comparison of the Paradigms
One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization
Recently, voice conversion (VC) without parallel data has been successfully
adapted to multi-target scenario in which a single model is trained to convert
the input voice to many different speakers. However, such model suffers from
the limitation that it can only convert the voice to the speakers in the
training data, which narrows down the applicable scenario of VC. In this paper,
we proposed a novel one-shot VC approach which is able to perform VC by only an
example utterance from source and target speaker respectively, and the source
and target speaker do not even need to be seen during training. This is
achieved by disentangling speaker and content representations with instance
normalization (IN). Objective and subjective evaluation shows that our model is
able to generate the voice similar to target speaker. In addition to the
performance measurement, we also demonstrate that this model is able to learn
meaningful speaker representations without any supervision.Comment: Interspeech 201
Learning Audio Sequence Representations for Acoustic Event Classification
Acoustic Event Classification (AEC) has become a significant task for
machines to perceive the surrounding auditory scene. However, extracting
effective representations that capture the underlying characteristics of the
acoustic events is still challenging. Previous methods mainly focused on
designing the audio features in a 'hand-crafted' manner. Interestingly,
data-learnt features have been recently reported to show better performance. Up
to now, these were only considered on the frame-level. In this paper, we
propose an unsupervised learning framework to learn a vector representation of
an audio sequence for AEC. This framework consists of a Recurrent Neural
Network (RNN) encoder and a RNN decoder, which respectively transforms the
variable-length audio sequence into a fixed-length vector and reconstructs the
input sequence on the generated vector. After training the encoder-decoder, we
feed the audio sequences to the encoder and then take the learnt vectors as the
audio sequence representations. Compared with previous methods, the proposed
method can not only deal with the problem of arbitrary-lengths of audio
streams, but also learn the salient information of the sequence. Extensive
evaluation on a large-size acoustic event database is performed, and the
empirical results demonstrate that the learnt audio sequence representation
yields a significant performance improvement by a large margin compared with
other state-of-the-art hand-crafted sequence features for AEC
Semantic speech retrieval with a visually grounded model of untranscribed speech
There is growing interest in models that can learn from unlabelled speech
paired with visual context. This setting is relevant for low-resource speech
processing, robotics, and human language acquisition research. Here we study
how a visually grounded speech model, trained on images of scenes paired with
spoken captions, captures aspects of semantics. We use an external image tagger
to generate soft text labels from images, which serve as targets for a neural
model that maps untranscribed speech to (semantic) keyword labels. We introduce
a newly collected data set of human semantic relevance judgements and an
associated task, semantic speech retrieval, where the goal is to search for
spoken utterances that are semantically relevant to a given text query. Without
seeing any text, the model trained on parallel speech and images achieves a
precision of almost 60% on its top ten semantic retrievals. Compared to a
supervised model trained on transcriptions, our model matches human judgements
better by some measures, especially in retrieving non-verbatim semantic
matches. We perform an extensive analysis of the model and its resulting
representations.Comment: 10 pages, 3 figures, 5 tables; accepted to the IEEE/ACM Transactions
on Audio, Speech and Language Processin
Improved Audio Embeddings by Adjacency-Based Clustering with Applications in Spoken Term Detection
Embedding audio signal segments into vectors with fixed dimensionality is
attractive because all following processing will be easier and more efficient,
for example modeling, classifying or indexing. Audio Word2Vec previously
proposed was shown to be able to represent audio segments for spoken words as
such vectors carrying information about the phonetic structures of the signal
segments. However, each linguistic unit (word, syllable, phoneme in text form)
corresponds to unlimited number of audio segments with vector representations
inevitably spread over the embedding space, which causes some confusion. It is
therefore desired to better cluster the audio embeddings such that those
corresponding to the same linguistic unit can be more compactly distributed. In
this paper, inspired by Siamese networks, we propose some approaches to achieve
the above goal. This includes identifying positive and negative pairs from
unlabeled data for Siamese style training, disentangling acoustic factors such
as speaker characteristics from the audio embedding, handling unbalanced data
distribution, and having the embedding processes learn from the adjacency
relationships among data points. All these can be done in an unsupervised way.
Improved performance was obtained in preliminary experiments on the LibriSpeech
data set, including clustering characteristics analysis and applications of
spoken term detection
- …