41 research outputs found
Unsupervised Phoneme and Word Discovery from Multiple Speakers using Double Articulation Analyzer and Neural Network with Parametric Bias
This paper describes a new unsupervised machine learning method for
simultaneous phoneme and word discovery from multiple speakers. Human infants
can acquire knowledge of phonemes and words from interactions with his/her
mother as well as with others surrounding him/her. From a computational
perspective, phoneme and word discovery from multiple speakers is a more
challenging problem than that from one speaker because the speech signals from
different speakers exhibit different acoustic features. This paper proposes an
unsupervised phoneme and word discovery method that simultaneously uses
nonparametric Bayesian double articulation analyzer (NPB-DAA) and deep sparse
autoencoder with parametric bias in hidden layer (DSAE-PBHL). We assume that an
infant can recognize and distinguish speakers based on certain other features,
e.g., visual face recognition. DSAE-PBHL is aimed to be able to subtract
speaker-dependent acoustic features and extract speaker-independent features.
An experiment demonstrated that DSAE-PBHL can subtract distributed
representations of acoustic signals, enabling extraction based on the types of
phonemes rather than on the speakers. Another experiment demonstrated that a
combination of NPB-DAA and DSAE-PB outperformed the available methods in
phoneme and word discovery tasks involving speech signals with Japanese vowel
sequences from multiple speakers.Comment: 21 pages. Submitte
Modeling DNN as human learner
In previous experiments, human listeners demonstrated that they had the ability to adapt to
unheard, ambiguous phonemes after some initial, relatively short exposures. At the same time,
previous work in the speech community has shown that pre-trained deep neural network-based
(DNN) ASR systems, like humans, also have the ability to adapt to unseen, ambiguous phonemes
after retuning their parameters on a relatively small set. In the first part of this thesis, the time-course
of phoneme category adaptation in a DNN is investigated in more detail. By retuning the
DNNs with more and more tokens with ambiguous sounds and comparing classification accuracy
of the ambiguous phonemes in a held-out test across the time-course, we found out that DNNs, like
human listeners, also demonstrated fast adaptation: the accuracy curves were step-like in almost
all cases, showing very little adaptation after seeing only one (out of ten) training bins. However,
unlike our experimental setup mentioned above, in a typical
lexically guided perceptual learning
experiment, listeners are trained with individual words instead of individual phones, and thus to truly
model such a scenario, we would require a model that could take the context of a whole utterance
into account. Traditional speech recognition systems accomplish this through the use of hidden
Markov models (HMM) and WFST decoding. In recent years, bidirectional long short-term memory (Bi-LSTM) trained under connectionist temporal classification (CTC) criterion has also attracted
much attention. In the second part of this thesis, previous experiments on ambiguous phoneme
recognition were carried out again on a new Bi-LSTM model, and phonetic transcriptions of words
ending with ambiguous phonemes were used as training targets, instead of individual sounds that
consisted of a single phoneme. We found out that despite the vastly different architecture, the
new model showed highly similar behavior in terms of classification rate over the time course of
incremental retuning. This indicated that ambiguous phonemes in a continuous context could also
be quickly adapted by neural network-based models. In the last part of this thesis, our pre-trained
Dutch Bi-LSTM from the previous part was treated as a Dutch second language learner and was
asked to transcribe English utterances in a self-adaptation scheme. In other words, we used the
Dutch model to generate phonetic transcriptions directly and retune the model on the transcriptions
it generated, although ground truth transcriptions were used to choose a subset of all self-labeled
transcriptions. Self-adaptation is of interest as a model of human second language learning, but also
has great practical engineering value, e.g., it could be used to adapt speech recognition to a lowr-resource
language. We investigated two ways to improve the adaptation scheme, with the first being
multi-task learning with articulatory feature detection during training the model on Dutch and self-labeled
adaptation, and the second being first letting the model adapt to isolated short words before
feeding it with longer utterances.Ope
An integrated clustering analysis framework for heterogeneous data
Big data is a growing area of research with some important research challenges that motivate
our work. We focus on one such challenge, the variety aspect. First, we introduce
our problem by defining heterogeneous data as data about objects that are described by
different data types, e.g., structured data, text, time-series, images, etc. Through our work
we use five datasets for experimentation: a real dataset of prostate cancer data and four
synthetic dataset that we have created and made them publicly available. Each dataset
covers different combinations of data types that are used to describe objects. Our strategy
for clustering is based on fusion approaches. We compare intermediate and late fusion
schemes. We propose an intermediary fusion approach, Similarity Matrix Fusion (SMF),
where the integration process takes place at the level of calculating similarities. SMF produces
a single distance fusion matrix and two uncertainty expression matrices. We then
propose a clustering algorithm, Hk-medoids, a modified version of the standard k-medoids
algorithm that utilises uncertainty calculations to improve on the clustering performance.
We evaluate our results by comparing them to clustering produced using individual elements
and show that the fusion approach produces equal or significantly better results.
Also, we show that there are advantages in utilising the uncertainty information as Hkmedoids
does. In addition, from a theoretical point of view, our proposed Hk-medoids
algorithm has less computation complexity than the popular PAM implementation of the
k-medoids algorithm. Then, we employed late fusion that aggregates the results of clustering
by individual elements by combining cluster labels using an object co-occurrence
matrix technique. The final cluster is then derived by a hierarchical clustering algorithm.
We show that intermediate fusion for clustering of heterogeneous data is a feasible and
efficient approach using our proposed Hk-medoids algorithm
BadVFL: Backdoor Attacks in Vertical Federated Learning
Federated learning (FL) enables multiple parties to collaboratively train a
machine learning model without sharing their data; rather, they train their own
model locally and send updates to a central server for aggregation. Depending
on how the data is distributed among the participants, FL can be classified
into Horizontal (HFL) and Vertical (VFL). In VFL, the participants share the
same set of training instances but only host a different and non-overlapping
subset of the whole feature space. Whereas in HFL, each participant shares the
same set of features while the training set is split into locally owned
training data subsets.
VFL is increasingly used in applications like financial fraud detection;
nonetheless, very little work has analyzed its security. In this paper, we
focus on robustness in VFL, in particular, on backdoor attacks, whereby an
adversary attempts to manipulate the aggregate model during the training
process to trigger misclassifications. Performing backdoor attacks in VFL is
more challenging than in HFL, as the adversary i) does not have access to the
labels during training and ii) cannot change the labels as she only has access
to the feature embeddings. We present a first-of-its-kind clean-label backdoor
attack in VFL, which consists of two phases: a label inference and a backdoor
phase. We demonstrate the effectiveness of the attack on three different
datasets, investigate the factors involved in its success, and discuss
countermeasures to mitigate its impact
Structured Dictionary Learning and its applications in Neural Recording
Widely utilized in the field of neuroscience, implantable neural recording devices could capture neuron activities with an acquisition rate on the order of megabytes per second. In order to efficiently transmit neural signals through wireless channels, these devices require compression methods that reduce power consumption. Although recent Compressed Sensing (CS) approaches have successfully demonstrated their power, their full potential is yet to be explored, particularly towards exploring a more efficient representation of the neural signals. As a promising solution, sparse representation not only provides better signal compression for bandwidth/storage efficiency, but also leads to faster processing algorithms as well as more effective signal separation for classification purpose. However, current sparsity‐based approaches for neural recording are limited due to several critical drawbacks: (i) the lack of an efficient data‐driven representation to fully capture the characteristics of specific neural signal; (ii) most existing methods do not fully explore the prior knowledge of neural signals (e.g., labels), while such information is often known; and (iii) the capability to encode discriminative information into the representation to promote classification.
Using neural recording as a case study, this dissertation presents new theoretical ideas and mathematical frameworks on structured dictionary learning with applications in compression and classification. Start with a single task setup, we provide theoretical proofs to show the benefits of using structured sparsity in dictionary learning. Then we provide various novel models for the representation of a single measurement, as well as multiple measurements where signals exhibit both with‐in class similarity as well as with‐in class difference. Under the assumption that the label information of the neural signal is known, the proposed models minimize the data fidelity term together with the structured sparsity terms to drive for more discriminative representation. We demonstrate that this is particularly essential in neural recording since it can further improve the compression ratio, classification accuracy and help deal with non‐ideal scenarios such as co-occurrences of neuron firings. Fast and efficient algorithms based on Bayesian inference and alternative direction method are proposed. Extensive experiments are conducted on both neural recording applications as well as some other classification task, such as image classification
Improving Clustering Methods By Exploiting Richness Of Text Data
Clustering is an unsupervised machine learning technique, which involves discovering different clusters (groups) of similar objects in unlabeled data and is generally considered to be a NP hard problem. Clustering methods are widely used in a verity of disciplines for analyzing different types of data, and a small improvement in clustering method can cause a ripple effect in advancing research of multiple fields.
Clustering any type of data is challenging and there are many open research questions. The clustering problem is exacerbated in the case of text data because of the additional challenges such as issues in capturing semantics of a document, handling rich features of text data and dealing with the well known problem of the curse of dimensionality.
In this thesis, we investigate the limitations of existing text clustering methods and address these limitations by providing five new text clustering methods--Query Sense Clustering (QSC), Dirichlet Weighted K-means (DWKM), Multi-View Multi-Objective Evolutionary Algorithm (MMOEA), Multi-objective Document Clustering (MDC) and Multi-Objective Multi-View Ensemble Clustering (MOMVEC). These five new clustering methods showed that the use of rich features in text clustering methods could outperform the existing state-of-the-art text clustering methods.
The first new text clustering method QSC exploits user queries (one of the rich features in text data) to generate better quality clusters and cluster labels.
The second text clustering method DWKM uses probability based weighting scheme to formulate a semantically weighted distance measure to improve the clustering results.
The third text clustering method MMOEA is based on a multi-objective evolutionary algorithm. MMOEA exploits rich features to generate a diverse set of candidate clustering solutions, and forms a better clustering solution using a cluster-oriented approach.
The fourth and the fifth text clustering method MDC and MOMVEC address the limitations of MMOEA. MDC and MOMVEC differ in terms of the implementation of their multi-objective evolutionary approaches.
All five methods are compared with existing state-of-the-art methods. The results of the comparisons show that the newly developed text clustering methods out-perform existing methods by achieving up to 16\% improvement for some comparisons. In general, almost all newly developed clustering algorithms showed statistically significant improvements over other existing methods.
The key ideas of the thesis highlight that exploiting user queries improves Search Result Clustering(SRC); utilizing rich features in weighting schemes and distance measures improves soft subspace clustering; utilizing multiple views and a multi-objective cluster oriented method improves clustering ensemble methods; and better evolutionary operators and objective functions improve multi-objective evolutionary clustering ensemble methods.
The new text clustering methods introduced in this thesis can be widely applied in various domains that involve analysis of text data. The contributions of this thesis which include five new text clustering methods, will not only help researchers in the data mining field but also to help a wide range of researchers in other fields
Clustering ensemble method
Clustering is an unsupervised learning paradigm that partitions a given dataset into
clusters so that objects in the same cluster are more similar to each other than to the
objects in the other clusters. However, when clustering algorithms are used individually,
their results are often inconsistent and unreliable. This research applies the
philosophy of Ensemble learning that combines multiple partitions using a consensus
function in order to address these issues to improve a clustering performance.
A clustering ensemble framework is presented consisting of three phases: Ensemble
Member Generation, Consensus and Evaluation. This research focuses on
two points: the consensus function and ensemble diversity. For the first, we proposed
three new consensus functions: the Object-Neighbourhood Clustering Ensemble
(ONCE), the Dual-Similarity Clustering Ensemble (DSCE), and the Adaptive
Clustering Ensemble (ACE). ONCE takes into account the neighbourhood relationship
between object pairs in the similarity matrix, while DSCE and ACE are based
on two similarity measures: cluster similarity and membership similarity.
The proposed ensemble methods were tested on benchmark real-world and artificial
datasets. The results demonstrated that ONCE outperforms the other similar
methods, and is more consistent and reliable than k-means. Furthermore, DSCE
and ACE were compared to the ONCE, CO, MCLA and DICLENS clustering ensemble
methods. The results demonstrated that on average ACE outperforms the
state-of-the-art clustering ensemble methods, which are CO, MCLA and DICLENS.
On diversity, we experimentally investigated all the existing measures for determining
their relationship with the ensemble quality. The results indicate that none of them are capable of discovering a clear relationship and the reasons for this are:
(1) they all are inappropriately defined to measure the useful difference between the
members, and (2) none of them have been used directly by any consensus function.
Therefore, we point out that these two issues need to be addressed in future research
Speech recognition with probabilistic transcriptions and end-to-end systems using deep learning
In this thesis, we develop deep learning models in automatic speech recognition (ASR) for two contrasting tasks characterized by the amounts of labeled data available for training. In the first half, we deal with scenarios when there are limited or no labeled data for training ASR systems. This situation is commonly prevalent in languages which are under-resourced. However, in the second half, we train ASR systems with large amounts of labeled data in English. Our objective is to improve modern end-to-end (E2E) ASR using attention modeling. Thus, the two primary contributions of this thesis are the following:
Cross-Lingual Speech Recognition in Under-Resourced Scenarios:
A well-resourced language is a language with an abundance of resources to support the development of speech technology. Those resources are usually defined in terms of 100+ hours of speech data, corresponding transcriptions, pronunciation dictionaries, and language models. In contrast, an under-resourced language lacks one or more of these resources. The most expensive and time-consuming resource is the acquisition of transcriptions due to the difficulty in finding native transcribers. The first part of the thesis proposes methods by which deep neural networks (DNNs) can be trained when there are limited or no transcribed data in the target language. Such scenarios are common for languages which are under-resourced.
Two key components of this proposition are Transfer Learning and Crowdsourcing. Through these methods, we demonstrate that it is possible to borrow statistical knowledge of acoustics from a variety of other well-resourced languages to learn the parameters of a the DNN in the target under-resourced language. In particular, we use well-resourced languages as cross-entropy regularizers to improve the generalization capacity of the target language. A key accomplishment of this study is that it is the first to train DNNs using noisy labels in the target language transcribed by non-native speakers available in online marketplaces.
End-to-End Large Vocabulary Automatic Speech Recognition:
Recent advances in ASR have been mostly due to the advent of deep learning models. Such models have the ability to discover complex non-linear relationships between attributes that are usually found in real-world tasks. Despite these advances, building a conventional ASR system is a cumbersome procedure since it involves optimizing several components separately in a disjoint fashion. To alleviate this problem, modern ASR systems have adopted a new approach of directly transducing speech signals to text. Such systems are known as E2E systems and one such system is the Connectionist Temporal Classification (CTC). However, one drawback of CTC is the hard alignment problem as it relies only on the current input to generate the current output. In reality, the output at the current time is influenced not only by the current input but also by inputs in the past and the future.
Thus, the second part of the thesis proposes advancing state-of-the-art E2E speech recognition for large corpora by directly incorporating attention modeling within the CTC framework. In attention modeling, inputs in the current, past, and future are distinctively weighted depending on the degree of influence they exert on the current output. We accomplish this by deriving new context vectors using time convolution features to model attention as part of the CTC network. To further improve attention modeling, we extract more reliable content information from a network representing an implicit language model. Finally, we used vector based attention weights that are applied on context vectors across both time and their individual components. A key accomplishment of this study is that it is the first to incorporate attention directly within the CTC network. Furthermore, we show that our proposed attention-based CTC model, even in the absence of an explicit language model, is able to achieve lower word error rates than a well-trained conventional ASR system equipped with a strong external language model