36 research outputs found
Graph Convolutional Network-based Feature Selection for High-dimensional and Low-sample Size Data
Feature selection is a powerful dimension reduction technique which selects a
subset of relevant features for model construction. Numerous feature selection
methods have been proposed, but most of them fail under the high-dimensional
and low-sample size (HDLSS) setting due to the challenge of overfitting. In
this paper, we present a deep learning-based method - GRAph Convolutional
nEtwork feature Selector (GRACES) - to select important features for HDLSS
data. We demonstrate empirical evidence that GRACES outperforms other feature
selection methods on both synthetic and real-world datasets.Comment: 24 pages, 4 figures, 4 table
Pretext Tasks selection for multitask self-supervised speech representation learning
Through solving pretext tasks, self-supervised learning leverages unlabeled
data to extract useful latent representations replacing traditional input
features in the downstream task. In audio/speech signal processing, a wide
range of features where engineered through decades of research efforts. As it
turns out, learning to predict such features (a.k.a pseudo-labels) has proven
to be a particularly relevant pretext task, leading to useful self-supervised
representations which prove to be effective for downstream tasks. However,
methods and common practices for combining such pretext tasks for better
performance on the downstream task have not been explored and understood
properly. In fact, the process relies almost exclusively on a computationally
heavy experimental procedure, which becomes intractable with the increase of
the number of pretext tasks. This paper introduces a method to select a group
of pretext tasks among a set of candidates. The method we propose estimates
calibrated weights for the partial losses corresponding to the considered
pretext tasks during the self-supervised training process. The experiments
conducted on automatic speech recognition, speaker and emotion recognition
validate our approach, as the groups selected and weighted with our method
perform better than classic baselines, thus facilitating the selection and
combination of relevant pseudo-labels for self-supervised representation
learning
Affective Speech Recognition
Speech, as a medium of interaction, carries two different streams of information. Whereas one stream carries explicit messages, the other one contains implicit information about speakers themselves. Affective speech recognition is a set of theories and tools that intend to automate unfolding the part of the implicit stream that has to do with humans emotion. Application of affective speech recognition is to human computer interaction; a machine that is able to recognize humans emotion could engage the user in a more effective interaction. This thesis proposes a set of analyses and methodologies that advance automatic recognition of affect from speech. The proposed solution spans two dimensions of the problem: speech signal processing, and statistical learning.
At the speech signal processing dimension, extraction of speech low-level descriptors is dis- cussed, and a set of descriptors that exploit the spectrum of the signal are proposed, which have shown to be particularly practical for capturing affective qualities of speech. Moreover, consider- ing the non-stationary property of the speech signal, further proposed is a measure of dynamicity that captures that property of speech by quantifying changes of the signal over time. Furthermore, based on the proposed set of low-level descriptors, it is shown that individual human beings are different in conveying emotions, and that parts of the spectrum that hold the affective information are different from one person to another. Therefore, the concept of emotion profile is proposed that formalizes those differences by taking into account different factors such as cultural and gender-specific differences, as well as those distinctions that have to do with individual human beings.
At the statistical learning dimension, variable selection is performed to identify speech features that are most imperative to extracting affective information. In doing so, low-level descriptors are distinguished from statistical functionals, therefore, effectiveness of each of the two are studied dependently and independently. The major importance of variable selection as a standalone component of a solution is to real-time application of affective speech recognition. Although thousands of speech features are commonly used to tackle this problem in theory, extracting that many features in a real-time manner is unrealistic, especially for mobile applications. Results of the conducted investigations show that the required number of speech features is far less than the number that is commonly used in the literature of the problem.
At the core of an affective speech recognition solution is a statistical model that uses speech features to recognize emotions. Such a model comes with a set of parameters that are estimated through a learning process. Proposed in this thesis is a learning algorithm, developed based on the notion of Hilbert-Schmidt independence criterion and named max-dependence regression, that maximizes the dependence between predicted and actual values of affective qualities. Pearson’s correlation coefficient is commonly used as the measure of goodness of a fit in the literature of affective computing, therefore max-dependence regression is proposed to make the learning and hypothesis testing criteria consistent with one another. Results of this research show that doing so results in higher prediction accuracy.
Lastly, sparse representation for affective speech datasets is considered in this thesis. For this purpose, the application of a dictionary learning algorithm based on Hilbert-Schmidt independence criterion is proposed. Dictionary learning is used to identify the most important bases of the data in order to improve the generalization capability of the proposed solution to affective speech recognition. Based on the dictionary learning approach of choice, fusion of feature vectors is proposed. It is shown that sparse representation leads to higher generalization capability for affective speech recognition
Kernelized Supervised Dictionary Learning
The representation of a signal using a learned dictionary instead of predefined operators, such as wavelets, has led to state-of-the-art results in various applications such as denoising, texture analysis, and face recognition. The area of dictionary learning is closely associated with sparse representation, which means that the signal is represented using few atoms in the dictionary. Despite recent advances in the computation of a dictionary using fast algorithms such as K-SVD, online learning, and cyclic coordinate descent, which make the computation of a dictionary from millions of data samples computationally feasible, the dictionary is mainly computed using unsupervised approaches such as k-means. These approaches learn the dictionary by minimizing the reconstruction error without taking into account the category information, which is not optimal in classification tasks.
In this thesis, we propose a supervised dictionary learning (SDL) approach by incorporating information on class labels into the learning of the dictionary. To this end, we propose to learn the dictionary in a space where the dependency between the signals and their corresponding labels is maximized. To maximize this dependency, the recently-introduced Hilbert Schmidt independence criterion (HSIC) is used. The learned dictionary is compact and has closed form; the proposed approach is fast. We show that it outperforms other unsupervised and supervised dictionary learning approaches in the literature on real-world data.
Moreover, the proposed SDL approach has as its main advantage that it can be easily kernelized, particularly by incorporating a data-driven kernel such as a compression-based kernel, into the formulation. In this thesis, we propose a novel compression-based (dis)similarity measure. The proposed measure utilizes a 2D MPEG-1 encoder, which takes into consideration the spatial locality and connectivity of pixels in the images. The proposed formulation has been carefully designed based on MPEG encoder functionality. To this end, by design, it solely uses P-frame coding to find the (dis)similarity among patches/images. We show that the proposed measure works properly on both small and large patch sizes on textures. Experimental results show that by incorporating the proposed measure as a kernel into our SDL, it significantly improves the performance of a supervised pixel-based texture classification on Brodatz and outdoor images compared to other compression-based dissimilarity measures, as well as state-of-the-art SDL methods. It also improves the computation speed by about 40% compared to its closest rival.
Eventually, we have extended the proposed SDL to multiview learning, where more than one representation is available on a dataset. We propose two different multiview approaches: one fusing the feature sets in the original space and then learning the dictionary and sparse coefficients on the fused set; and the other by learning one dictionary and the corresponding coefficients in each view separately, and then fusing the representations in the space of the dictionaries learned. We will show that the proposed multiview approaches benefit from the complementary information in multiple views, and investigate the relative performance of these approaches in the application of emotion recognition
Large-scale dimensionality reduction using perturbation theory and singular vectors
Massive volumes of high-dimensional data have become pervasive, with the number
of features significantly exceeding the number of samples in many applications.
This has resulted in a bottleneck for data mining applications and amplified the
computational burden of machine learning algorithms that perform classification or
pattern recognition. Dimensionality reduction can handle this problem in two ways,
i.e. feature selection (FS) and feature extraction. In this thesis, we focus on FS, because,
in many applications like bioinformatics, the domain experts need to validate
a set of original features to corroborate the hypothesis of the prediction models. In
processing the high-dimensional data, FS mainly involves detecting a limited number
of important features among tens/hundreds of thousands of irrelevant and redundant
features.
We start with filtering the irrelevant features using our proposed Sparse Least
Squares (SLS) method, where a score is assigned to each feature, and the low-scoring
features are removed using a soft threshold. To demonstrate the effectiveness of SLS,
we used it to augment the well-known FS methods, thereby achieving substantially
reduced running times while improving or at least maintaining the prediction accuracy
of the models.
We developed a linear FS method (DRPT) which, upon data reduction by SLS,
clusters the reduced data using the perturbation theory to detect correlations between
the remaining features. Important features are ultimately selected from each cluster,
discarding the redundant features.
To extend the clustering applicability in grouping the redundant features, we
proposed a new Singular Vectors FS (SVFS) method that is capable of both removing
the irrelevant features and effectively clustering the remaining features. As such,
the features in each cluster solely exhibit inner correlations with each other. The
independently selected important features from different clusters comprise the final
rank. Devising thresholds for filtering irrelevant and redundant features has facilitated
the adaptability of our model to the particular needs of various applications.
A comprehensive evaluation based on benchmark biological and image datasets
shows the superiority of our proposed methods compared to the state-of-the-art FS
methods in terms of classification accuracy, running time, and memory usage
From fuzzy-rough to crisp feature selection
A central problem in machine learning and pattern recognition is the process of
recognizing the most important features in a dataset. This process plays a decisive
role in big data processing by reducing the size of datasets. One major drawback of
existing feature selection methods is the high chance of redundant features appearing
in the final subset, where in most cases, finding and removing them can greatly
improve the resulting classification accuracy. To tackle this problem on two different
fronts, we employed fuzzy-rough sets and perturbation theories. On one side, we used
three strategies to improve the performance of fuzzy-rough set-based feature selection
methods. The first strategy was to code both features and samples in one binary
vector and use a shuffled frog leaping algorithm to choose the best combination using
fuzzy dependency degree as the fitness function. In the second strategy, we designed
a measure to evaluate features based on fuzzy-rough dependency degree in a fashion
where redundant features are given less priority to be selected. In the last strategy,
we designed a new binary version of the shuffled frog leaping algorithm that employs a
fuzzy positive region as its similarity measure to work in complete harmony with the
fitness function (i.e. fuzzy-rough dependency degree). To extend the applicability of
fuzzy-rough set-based feature selection to multi-party medical datasets, we designed
a privacy-preserving version of the original method. In addition, we studied the
feasibility and applicability of perturbation theory to feature selection, which to the
best of our knowledge has never been researched. We introduced a new feature
selection based on perturbation theory that is not only capable of detecting and
discarding redundant features but also is very fast and flexible in accommodating the special needs of the application. It employs a clustering algorithm to group likely-behaved
features based on the sensitivity of each feature to perturbation, the angle of
each feature to the outcome and the effect of removing each feature to the outcome,
and it chooses the closest feature to the centre of each cluster and returns all those
features as the final subset. To assess the effectiveness of the proposed methods,
we compared the results of each method with well-known feature selection methods
against a series of artificially generated datasets, and biological, medical and cancer
datasets adopted from the University of California Irvine machine learning repository,
Arizona State University repository and Gene Expression Omnibus repository