39 research outputs found
Lecture notes on ridge regression
The linear regression model cannot be fitted to high-dimensional data, as the
high-dimensionality brings about empirical non-identifiability. Penalized
regression overcomes this non-identifiability by augmentation of the loss
function by a penalty (i.e. a function of regression coefficients). The ridge
penalty is the sum of squared regression coefficients, giving rise to ridge
regression. Here many aspect of ridge regression are reviewed e.g. moments,
mean squared error, its equivalence to constrained estimation, and its relation
to Bayesian regression. Finally, its behaviour and use are illustrated in
simulation and on omics data. Subsequently, ridge regression is generalized to
allow for a more general penalty. The ridge penalization framework is then
translated to logistic regression and its properties are shown to carry over.
To contrast ridge penalized estimation, the final chapter introduces its lasso
counterpart
Semantic Autoencoder for Zero-Shot Learning
Existing zero-shot learning (ZSL) models typically learn a projection
function from a feature space to a semantic embedding space (e.g.~attribute
space). However, such a projection function is only concerned with predicting
the training seen class semantic representation (e.g.~attribute prediction) or
classification. When applied to test data, which in the context of ZSL contains
different (unseen) classes without training data, a ZSL model typically suffers
from the project domain shift problem. In this work, we present a novel
solution to ZSL based on learning a Semantic AutoEncoder (SAE). Taking the
encoder-decoder paradigm, an encoder aims to project a visual feature vector
into the semantic space as in the existing ZSL models. However, the decoder
exerts an additional constraint, that is, the projection/code must be able to
reconstruct the original visual feature. We show that with this additional
reconstruction constraint, the learned projection function from the seen
classes is able to generalise better to the new unseen classes. Importantly,
the encoder and decoder are linear and symmetric which enable us to develop an
extremely efficient learning algorithm. Extensive experiments on six benchmark
datasets demonstrate that the proposed SAE outperforms significantly the
existing ZSL models with the additional benefit of lower computational cost.
Furthermore, when the SAE is applied to supervised clustering problem, it also
beats the state-of-the-art.Comment: accepted to CVPR201
Estimating the risk of suffering gender-based violence
The primary objective of this research was to create a predictive model capable of identifying socio-demographic and clinical characteristics potentially associated with the risk of experiencing gender-based violence. Additionally, the aim was to predict the likelihood of suffering gender-based violence using machine learning techniques by evaluating data from the 2019 macro-survey on violence against women conducted by the Ministry of Equality of the Government of Spain. To achieve these goals, the study employed two feature selection methods: Recursive Feature Elimination (RFE) and LASSO Regularized Regression. Furthermore, two predictive models, namely Support Vector Machines (SVM) and Random Forest (RF), were utilized to analyze the data and make predictions
Cross-class Transfer Learning for Visual Data
PhDAutomatic analysis of visual data is a key objective of computer vision research; and performing
visual recognition of objects from images is one of the most important steps towards understanding
and gaining insights into the visual data. Most existing approaches in the literature for the
visual recognition are based on a supervised learning paradigm. Unfortunately, they require a
large amount of labelled training data which severely limits their scalability. On the other hand,
recognition is instantaneous and effortless for humans. They can recognise a new object without
seeing any visual samples by just knowing the description of it, leveraging similarities between
the description of the new object and previously learned concepts. Motivated by humans recognition
ability, this thesis proposes novel approaches to tackle cross-class transfer learning (crossclass
recognition) problem whose goal is to learn a model from seen classes (those with labelled
training samples) that can generalise to unseen classes (those with labelled testing samples) without
any training data i.e., seen and unseen classes are disjoint. Specifically, the thesis studies and
develops new methods for addressing three variants of the cross-class transfer learning:
Chapter 3 The first variant is transductive cross-class transfer learning, meaning labelled
training set and unlabelled test set are available for model learning. Considering training set
as the source domain and test set as the target domain, a typical cross-class transfer learning
assumes that the source and target domains share a common semantic space, where visual feature
vector extracted from an image can be embedded using an embedding function. Existing
approaches learn this function from the source domain and apply it without adaptation to the
target one. They are therefore prone to the domain shift problem i.e., the embedding function
is only concerned with predicting the training seen class semantic representation in the learning
stage during learning, when applied to the test data it may underperform. In this thesis, a novel
cross-class transfer learning (CCTL) method is proposed based on unsupervised domain adaptation.
Specifically, a novel regularised dictionary learning framework is formulated by which the
target class labels are used to regularise the learned target domain embeddings thus effectively
overcoming the projection domain shift problem.
Chapter 4 The second variant is inductive cross-class transfer learning, that is, only training
set is assumed to be available during model learning, resulting in a harder challenge compared
to the previous one. Nevertheless, this setting reflects a real-world setting in which test data is
available after the model learning. The main problem remains the same as the previous variant,
that is, the domain shift problem occurs when the model learned only from the training set is applied
to the test set without adaptation. In this thesis, a semantic autoencoder (SAE) is proposed
building on an encoder-decoder paradigm. Specifically, first a semantic space is defined so that
knowledge transfer is possible from the seen classes to the unseen classes. Then, an encoder aims
to embed/project a visual feature vector into the semantic space. However, the decoder exerts a
generative task, that is, the projection must be able to reconstruct the original visual features. The
generative task forces the encoder to preserve richer information, thus the learned encoder from
seen classes is able generalise better to the new unseen classes.
Chapter 5 The third one is unsupervised cross-class transfer learning. In this variant, no
supervision is available for model learning i.e., only unlabelled training data is available, leading
to the hardest setting compared to the previous cases. The goal, however, is the same, learning
some knowledge from the training data that can be transferred to the test data composed of
completely different labels from that of training data. The thesis proposes a novel approach which
requires no labelled training data yet is able to capture discriminative information. The proposed
model is based on a new graph regularised dictionary learning algorithm. By introducing a l1-
norm graph regularisation term, instead of the conventional squared l2-norm, the model is robust
against outliers and noises typical in visual data. Importantly, the graph and representation are
learned jointly, resulting in further alleviation of the effects of data outliers. As an application,
person re-identification is considered for this variant in this thesis
Pattern recognition and machine learning for magnetic resonance images with kernel methods
The aim of this thesis is to apply a particular category of machine learning and
pattern recognition algorithms, namely the kernel methods, to both functional and
anatomical magnetic resonance images (MRI). This work specifically focused on
supervised learning methods. Both methodological and practical aspects are described
in this thesis.
Kernel methods have the computational advantage for high dimensional data,
therefore they are idea for imaging data. The procedures can be broadly divided into
two components: the construction of the kernels and the actual kernel algorithms
themselves. Pre-processed functional or anatomical images can be computed into a
linear kernel or a non-linear kernel. We introduce both kernel regression and kernel
classification algorithms in two main categories: probabilistic methods and
non-probabilistic methods. For practical applications, kernel classification methods
were applied to decode the cognitive or sensory states of the subject from the fMRI
signal and were also applied to discriminate patients with neurological diseases from
normal people using anatomical MRI. Kernel regression methods were used to predict
the regressors in the design of fMRI experiments, and clinical ratings from the
anatomical scans
Characterisation of xenometabolome signatures in complex biomatrices for enhanced human population phenotyping
Metabolic phenotyping facilitates the analysis of low molecular weight compounds in complex biological samples, with resulting metabolite profiles providing a window on endogenous processes and xenobiotic exposures. Accurate characterisation of the xenobiotic component of the metabolome (the xenometabolome) is particularly valuable when metabolic phenotyping is used for epidemiological and clinical population studies where exposure of participants to xenobiotics is unknown or difficult to control/estimate. Additionally, as metabolic phenotyping has increasingly been incorporated into toxicology and drug metabolism research, phenotyping datasets may be exploited to study xenobiotic metabolism at the population level. This thesis describes novel analytical and data-driven strategies for broadening xenometabolome coverage to allow effective partitioning of endogenous and xenobiotic metabolome signatures.
The data driven strategy was multi-faceted, involving the generation of a reference database and the application of statistical methodologies. The database contains over 100 common xenobiotics profiles - generated using established liquid chromatography-mass-spectrometry methods – and provided the basis for an empirically derived screen for human urine and blood samples. The prevalence of these xenobiotics was explored in an exemplar phenotyping dataset (ALZ; n = 650; urine), with 31 xenobiotics detected in an initial screen. Statistical based methods were tailored to extract xenobiotic-related signatures and evaluated using drugs with well-characterised human metabolism.
To complement the data-driven strategies for xenometabolome coverage, a more analytical based strategy was additionally developed. A dispersive solid phase extraction sample preparation protocol for blood products was optimised, permitting efficient removal of lipids and proteins, with minimal effect on low molecular weight metabolites. The suitability and reproducibility of this method was evaluated in two independent blood sample sets (AZstudy12; n=171, MARS; n=285).
Finally, these analytical and statistical strategies were applied to two existing large-scale phenotyping study datasets: AIRWAVE (n = 3000 urine, n=3000 plasma samples) and ALZ (n= 650 urine, n= 449 serum) and used to explore both xenobiotic and endogenous responses to triclosan and polyethylene glycol exposure. Exposure to triclosan highlighted affected pathways relating to sulfation, whilst exposure to PEG highlighted a possible perturbation in the glutathione cycle.
The analytical and statistical strategies described in this thesis allow for a more comprehensive xenometabolome characterisation and have been used to uncover previously unreported relationships between xenobiotic and endogenous metabolism.Open Acces
Development, Optimization and Clinical Evaluation Of Algorithms For Ultrasound Data Analysis Used In Selected Medical Applications.
The assessment of soft and hard tissues is critical when selecting appropriate protocols for restorative and regenerative therapy in the field of dental surgery. The chosen treatment methodology will have significant ramifications on healing time, success rate and overall long-time oral health. Currently used diagnostic methods are limited to visual and invasive assessments; they are often user-dependent, inaccurate and result in misinterpretation. As such, the clinical need has been identified for objective tissue characterization, and the proposed novel ultrasound-based approach was designed to address the identified need. The device prototype consists of a miniaturized probe with a specifically designed ultrasonic transducer, electronics responsible for signal generation and acquisition, as well as an optimized signal processing algorithm required for data analysis. An algorithm where signals are being processed and features extracted in real-time has been implemented and studied. An in-depth algorithm performance study has been presented on synthetic signals. Further, in-vitro laboratory experiments were performed using the developed device with the algorithm implemented in software on animal-based samples. Results validated the capabilities of the new system to reproduce gingival assessment rapidly and effectively. The developed device has met clinical usability requirements for effectiveness and performance
Trying to Beat the Market : An Empirical Analysis of the Historical and Potential Active Returns of the Government Pension Fund Global
The Government Pension Fund Global (hereafter the GPFG) helps finance the Norwegian welfare state
and aims to be managed in such a way that it benefits both current and future generations. Today,
the fund is managed closely to a benchmark index based on a mandate determined by the Ministry of
Finance, but it is also managed actively to generate excess returns. The active management of the
fund is a heated topic and there have been frequent debates related to the management model of the
fund. This thesis aims to contribute to the discussion and investigates the historical and
potential active management and returns, through our research question: “How has the active
management and accompanying active returns of the GPFG been historically, and how could increased
active management impact active returns?”
Our thesis rests on three supportive analyses: a historical analysis evaluating fund performance
and active management, a scenario-analysis investigating potential active returns, and lastly a
qualitative study validating our findings.
We first analyse the historical active returns and management of the fund. We find that active
returns predominantly have been significant throughout the investigated time periods, and that
active management has created additional returns for the fund, both in terms of benchmark
risk-adjusted alpha and factor risk-adjusted alpha. We further establish the historical degree of
active management and find an average active share of 18.92% from 2015 to 2020 and an annual
tracking error of 0.63% since inception, essentially defining the GPFG as an index fund.
Furthermore, we construct three synthetic portfolios combining the GPFG with the New Zealand
Superannuation Fund, to analyse active returns of portfolios with higher degrees of active
management. All three synthetic portfolios outperform the GPFG’s historical active returns both
in-sample and out-of-sample, clearly indicating that there exists an opportunity for the GPFG to
increase its active returns by increasing active management. Our initial findings are further
evaluated in light of existing empirical research in the field of active versus passive management,
where the broad consensus contradicts our quantitative findings.
After having emphasized empirical research, we still find that active management and its
accompanying returns have created significant value historically and that the fund could increase
its active returns by increasing active management. Additionally, we question why the tracking
error limit set by the Ministry of Finance is not exploited, and further recommend that this should
be considered.nhhma
Hyperparameters, tuning and meta-learning for random forest and other machine learning algorithms
In this cumulative dissertation thesis, I examine the influence of hyperparameters on machine learning algorithms, with a special focus on random forest. It mainly consists of three papers that were written in the last three years.
The first paper (Probst and Boulesteix, 2018) examines the influence of the number of trees on the performance of a random forest. In general it is believed that the number of trees should be set higher to achieve better performance. However, we show some real data examples in which the expectation of measures such as accuracy and AUC (partially) decrease with growing numbers of trees. We prove theoretically why this can happen and argue that this only happens in very special data situations. For other measures such as the Brier score, the logarithmic loss or the mean squared error, we show that this cannot happen. In a benchmark study based on 306 classification and regression datasets, we illustrate the extent of this unexpected behaviour. We observe that, on average, most of the improvement regarding performance can be achieved while growing the first 100 trees. We use our new OOBCurve R package (Probst, 2017a) for the analysis, which can be used to examine performances for a growing number of trees of a random forest based on the out-of-bag observations.
The second paper (Probst et al., 2019b) is a more general work. Firstly we review literature about the influence of hyperparameters on random forest. The different hyperparameters considered are the number of variables drawn at each split, the sampling scheme for drawing observations for each tree, the minimum number of observations in a node that a tree is allowed to have, the number of trees and the splitting rule. Their influence is examined regarding performance, runtime and variable importance. In the second part of the paper different tuning strategies for obtaining optimal hyperparameters are presented. A new software package in R is introduced, tuneRanger. It executes the tuning strategy sequential model-based optimization based on the out-of-bag observations. The hyperparameters and ranges for tuning are chosen automatically. In a benchmark study this implementation is compared with other different implementations that execute tuning for random forest.
The third paper (Probst et al., 2019a) is even more general and presents a general framework for examining the tunability of hyperparameters of machine learning algorithms. It first defines the concept of defaults properly and proposes definitions for measuring the tunability of the whole algorithm, of single hyperparameters and of combinations of hyperparameters. To apply these definitions to a collection of 38 binary classification datasets, a random bot is created, which generated in total around 5 million experiment runs of 6 algorithms with different hyperparameters. The details of this bot are described in an extra paper (KĂĽhn et al., 2018), co-authored by myself, that is also included in this dissertation. The results of this bot are used to estimate the tunability of these 6 algorithms and their specific hyperparameters. Furthermore, ranges for parameter tuning of these algorithms are proposed