164 research outputs found
Time-Contrastive Learning Based Deep Bottleneck Features for Text-Dependent Speaker Verification
There are a number of studies about extraction of bottleneck (BN) features
from deep neural networks (DNNs)trained to discriminate speakers, pass-phrases
and triphone states for improving the performance of text-dependent speaker
verification (TD-SV). However, a moderate success has been achieved. A recent
study [1] presented a time contrastive learning (TCL) concept to explore the
non-stationarity of brain signals for classification of brain states. Speech
signals have similar non-stationarity property, and TCL further has the
advantage of having no need for labeled data. We therefore present a TCL based
BN feature extraction method. The method uniformly partitions each speech
utterance in a training dataset into a predefined number of multi-frame
segments. Each segment in an utterance corresponds to one class, and class
labels are shared across utterances. DNNs are then trained to discriminate all
speech frames among the classes to exploit the temporal structure of speech. In
addition, we propose a segment-based unsupervised clustering algorithm to
re-assign class labels to the segments. TD-SV experiments were conducted on the
RedDots challenge database. The TCL-DNNs were trained using speech data of
fixed pass-phrases that were excluded from the TD-SV evaluation set, so the
learned features can be considered phrase-independent. We compare the
performance of the proposed TCL bottleneck (BN) feature with those of
short-time cepstral features and BN features extracted from DNNs discriminating
speakers, pass-phrases, speaker+pass-phrase, as well as monophones whose labels
and boundaries are generated by three different automatic speech recognition
(ASR) systems. Experimental results show that the proposed TCL-BN outperforms
cepstral features and speaker+pass-phrase discriminant BN features, and its
performance is on par with those of ASR derived BN features. Moreover,....Comment: Copyright (c) 2019 IEEE. Personal use of this material is permitted.
Permission from IEEE must be obtained for all other uses, in any current or
future media, including reprinting/republishing this material for advertising
or promotional purposes, creating new collective works, for resale or
redistribution to servers or lists, or reuse of any copyrighted component of
this work in other work
Phonetic aware techniques for Speaker Verification
The goal of this thesis is to improve current state-of-the-art techniques in speaker verification
(SV), typically based on âidentity-vectorsâ (i-vectors) and deep neural network (DNN), by exploiting diverse (phonetic) information extracted using various techniques such as automatic
speech recognition (ASR). Different speakers span different subspaces within a universal acoustic space, usually modelled by âuniversal background modelâ. The speaker-specific subspace
depends on the speakerâs voice characteristics, but also on the verbalised text of a speaker. In current state-of-the-art SV systems, i-vectors are extracted by applying a factor analysis
technique to obtain low dimensional speaker-specific representation. Furthermore, DNN output is also employed in a conventional i-vector framework to model phonetic information
embedded in the speech signal. This thesis proposes various techniques to exploit phonetic knowledge of speech to further enrich speaker characteristics.
More specifically, the techniques proposed in this thesis are applied to various SV tasks,
namely, text-independent and text-dependent SV. For text-independent SV task, several ASR
systems are developed and applied to compute phonetic posterior probabilities, subsequently
exploited to enhance the speaker-specific information included in i-vectors. These approaches
are then extended for text-dependent SV task, exploiting temporal information in a principled
way, i.e., by using dynamic time warping applied on speaker informative vectors.
Finally, as opposed to train DNN with phonetic information, DNN is trained in an end-to-end
fashion to directly discriminate speakers. The baseline end-to-end SV approach consists of
mapping a variable length speech segment to a fixed dimensional speaker vector by estimating
the mean of hidden representations in DNN structure. We improve upon this technique by
computing a distance function between two utterances which takes into account common
phonetic units. The whole network is optimized by employing a triplet-loss objective function.
The proposed approaches are evaluated on commonly used datasets such as NIST SRE 2010
and RSR2015. Significant improvements are observed over the baseline systems on both the
text-dependent and text-independent SV tasks by applying phonetic knowledge
General-purpose Information-theoretical Bayesian Optimisation:A thesis by acronyms
Bayesian optimisation (BO) is an increasingly popular strategy for optimising functions with substantial query costs. By sequentially focusing evaluation resources into promising areas of the search space, BO is able to find reasonable solutions within heavily restricted evaluation budgets. Consequently, BO has become the de-facto approach for fine-tuning the hyper-parameters of machine learning models and has had numerous successful applications in industry and across the experimental sciences.This thesis seeks to increase the scope of information-theoretic BO, a popular class of search strategies that regularly achieves state-of-the-art optimisation. Unfortunately,current information-theoretic BO routines require sophisticated approximation schemes that incur substantially large computational overheads and are, therefore, applicable only to optimisation problems defined over low-dimensional and Euclidean search spaces. This thesis proposes information-theoretic approximations that extend theMax-value Entropy Search of Wang and Jegelka (2017) to a much wider class of optimisation tasks, including noisy, batch and multi-fidelity optimisation across both Euclidean and highly-structured discrete spaces. To comprehensively test our proposed search strategies, we construct novel frameworks for performing BO over the highly-structured string spaces that arise in synthetic gene design and molecular search problems, as well as for objective functions with controllable observation noise. Finally,we demonstrate the real-world applicability of BO as part of a sophisticated machine learning pipeline for fine-tuning multi-speaker text-to-speech models
Sparse and Low-rank Modeling for Automatic Speech Recognition
This thesis deals with exploiting the low-dimensional multi-subspace structure of speech towards the goal of improving acoustic modeling for automatic speech recognition (ASR). Leveraging the parsimonious hierarchical nature of speech, we hypothesize that whenever a speech signal is measured in a high-dimensional feature space, the true class information is embedded in low-dimensional subspaces whereas noise is scattered as random high-dimensional erroneous estimations in the features. In this context, the contribution of this thesis is twofold: (i) identify sparse and low-rank modeling approaches as excellent tools for extracting the class-specific low-dimensional subspaces in speech features, and (ii) employ these tools under novel ASR frameworks to enrich the acoustic information present in the speech features towards the goal of improving ASR. Techniques developed in this thesis focus on deep neural network (DNN) based posterior features which, under the sparse and low-rank modeling approaches, unveil the underlying class-specific low-dimensional subspaces very elegantly.
In this thesis, we tackle ASR tasks of varying difficulty, ranging from isolated word recognition (IWR) and connected digit recognition (CDR) to large-vocabulary continuous speech recognition (LVCSR). For IWR and CDR, we propose a novel \textit{Compressive Sensing} (CS) perspective towards ASR. Here exemplar-based speech recognition is posed as a problem of recovering sparse high-dimensional word representations from compressed low-dimensional phonetic representations. In the context of LVCSR, this thesis argues that albeit their power in representation learning, DNN based acoustic models still have room for improvement in exploiting the \textit{union of low-dimensional subspaces} structure of speech data. Therefore, this thesis proposes to enhance DNN posteriors by projecting them onto the manifolds of the underlying classes using principal component analysis (PCA) or compressive sensing based dictionaries. Projected posteriors are shown to be more accurate training targets for learning better acoustic models, resulting in improved ASR performance. The proposed approach is evaluated on both close-talk and far-field conditions, confirming the importance of sparse and low-rank modeling of speech in building a robust ASR framework. Finally, the conclusions of this thesis are further consolidated by an information theoretic analysis approach which explicitly quantifies the contribution of proposed techniques in improving ASR
Handbook of Digital Face Manipulation and Detection
This open access book provides the first comprehensive collection of studies dealing with the hot topic of digital face manipulation such as DeepFakes, Face Morphing, or Reenactment. It combines the research fields of biometrics and media forensics including contributions from academia and industry. Appealing to a broad readership, introductory chapters provide a comprehensive overview of the topic, which address readers wishing to gain a brief overview of the state-of-the-art. Subsequent chapters, which delve deeper into various research challenges, are oriented towards advanced readers. Moreover, the book provides a good starting point for young researchers as well as a reference guide pointing at further literature. Hence, the primary readership is academic institutions and industry currently involved in digital face manipulation and detection. The book could easily be used as a recommended text for courses in image processing, machine learning, media forensics, biometrics, and the general security area
Hierarchical learning : theory with applications in speech and vision
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Brain and Cognitive Sciences, 2009.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Cataloged from student submitted PDF version of thesis.Includes bibliographical references (p. 123-132).Over the past two decades several hierarchical learning models have been developed and applied to a diverse range of practical tasks with much success. Little is known, however, as to why such models work as well as they do. Indeed, most are difficult to analyze, and cannot be easily characterized using the established tools from statistical learning theory. In this thesis, we study hierarchical learning architectures from two complementary perspectives: one theoretical and the other empirical. The theoretical component of the thesis centers on a mathematical framework describing a general family of hierarchical learning architectures. The primary object of interest is a recursively defined feature map, and its associated kernel. The class of models we consider exploit the fact that data in a wide variety of problems satisfy a decomposability property. Paralleling the primate visual cortex, hierarchies are assembled from alternating filtering and pooling stages that build progressively invariant representations which are simultaneously selective for increasingly complex stimuli. A goal of central importance in the study of hierarchical architectures and the cortex alike, is that of understanding quantitatively the tradeoff between invariance and selectivity, and how invariance and selectivity contribute towards providing an improved representation useful for learning from data. A reasonable expectation is that an unsupervised hierarchical representation will positively impact the sample complexity of a corresponding supervised learning task.(cont.) We therefore analyze invariance and discrimination properties that emerge in particular instances of layered models described within our framework. A group-theoretic analysis leads to a concise set of conditions which must be met to establish invariance, as well as a constructive prescription for meeting those conditions. An information-theoretic analysis is then undertaken and seen as a means by which to characterize a model's discrimination properties. The empirical component of the thesis experimentally evaluates key assumptions built into the mathematical framework. In the case of images, we present simulations which support the hypothesis that layered architectures can reduce the sample complexity of a non-trivial learning problem. In the domain of speech, we describe a 3 localized analysis technique that leads to a noise-robust representation. The resulting biologically-motivated features are found to outperform traditional methods on a standard phonetic classification task in both clean and noisy conditions.by Jacob V. Bouvrie.Ph.D
Multi-dialect Arabic broadcast speech recognition
Dialectal Arabic speech research suffers from the lack of labelled resources and
standardised orthography. There are three main challenges in dialectal Arabic
speech recognition: (i) finding labelled dialectal Arabic speech data, (ii) training
robust dialectal speech recognition models from limited labelled data and (iii)
evaluating speech recognition for dialects with no orthographic rules. This thesis
is concerned with the following three contributions:
Arabic Dialect Identification: We are mainly dealing with Arabic speech
without prior knowledge of the spoken dialect. Arabic dialects could be sufficiently
diverse to the extent that one can argue that they are different languages
rather than dialects of the same language. We have two contributions:
First, we use crowdsourcing to annotate a multi-dialectal speech corpus collected
from Al Jazeera TV channel. We obtained utterance level dialect labels for 57
hours of high-quality consisting of four major varieties of dialectal Arabic (DA),
comprised of Egyptian, Levantine, Gulf or Arabic peninsula, North African or
Moroccan from almost 1,000 hours. Second, we build an Arabic dialect identification
(ADI) system. We explored two main groups of features, namely acoustic
features and linguistic features. For the linguistic features, we look at a wide
range of features, addressing words, characters and phonemes. With respect to
acoustic features, we look at raw features such as mel-frequency cepstral coefficients
combined with shifted delta cepstra (MFCC-SDC), bottleneck features and
the i-vector as a latent variable. We studied both generative and discriminative
classifiers, in addition to deep learning approaches, namely deep neural network
(DNN) and convolutional neural network (CNN). In our work, we propose Arabic
as a five class dialect challenge comprising of the previously mentioned four
dialects as well as modern standard Arabic.
Arabic Speech Recognition: We introduce our effort in building Arabic automatic
speech recognition (ASR) and we create an open research community
to advance it. This section has two main goals: First, creating a framework for
Arabic ASR that is publicly available for research. We address our effort in building
two multi-genre broadcast (MGB) challenges. MGB-2 focuses on broadcast
news using more than 1,200 hours of speech and 130M words of text collected
from the broadcast domain. MGB-3, however, focuses on dialectal multi-genre
data with limited non-orthographic speech collected from YouTube, with special
attention paid to transfer learning. Second, building a robust Arabic ASR system
and reporting a competitive word error rate (WER) to use it as a potential
benchmark to advance the state of the art in Arabic ASR. Our overall system is
a combination of five acoustic models (AM): unidirectional long short term memory
(LSTM), bidirectional LSTM (BLSTM), time delay neural network (TDNN),
TDNN layers along with LSTM layers (TDNN-LSTM) and finally TDNN layers
followed by BLSTM layers (TDNN-BLSTM). The AM is trained using purely
sequence trained neural networks lattice-free maximum mutual information (LFMMI).
The generated lattices are rescored using a four-gram language model
(LM) and a recurrent neural network with maximum entropy (RNNME) LM.
Our official WER is 13%, which has the lowest WER reported on this task.
Evaluation: The third part of the thesis addresses our effort in evaluating dialectal
speech with no orthographic rules. Our methods learn from multiple
transcribers and align the speech hypothesis to overcome the non-orthographic
aspects. Our multi-reference WER (MR-WER) approach is similar to the BLEU
score used in machine translation (MT). We have also automated this process
by learning different spelling variants from Twitter data. We mine automatically
from a huge collection of tweets in an unsupervised fashion to build more than
11M n-to-m lexical pairs, and we propose a new evaluation metric: dialectal
WER (WERd). Finally, we tried to estimate the word error rate (e-WER) with
no reference transcription using decoding and language features. We show that
our word error rate estimation is robust for many scenarios with and without the
decoding features
Handbook of Digital Face Manipulation and Detection
This open access book provides the first comprehensive collection of studies dealing with the hot topic of digital face manipulation such as DeepFakes, Face Morphing, or Reenactment. It combines the research fields of biometrics and media forensics including contributions from academia and industry. Appealing to a broad readership, introductory chapters provide a comprehensive overview of the topic, which address readers wishing to gain a brief overview of the state-of-the-art. Subsequent chapters, which delve deeper into various research challenges, are oriented towards advanced readers. Moreover, the book provides a good starting point for young researchers as well as a reference guide pointing at further literature. Hence, the primary readership is academic institutions and industry currently involved in digital face manipulation and detection. The book could easily be used as a recommended text for courses in image processing, machine learning, media forensics, biometrics, and the general security area
- …