39 research outputs found

    Lecture notes on ridge regression

    Full text link
    The linear regression model cannot be fitted to high-dimensional data, as the high-dimensionality brings about empirical non-identifiability. Penalized regression overcomes this non-identifiability by augmentation of the loss function by a penalty (i.e. a function of regression coefficients). The ridge penalty is the sum of squared regression coefficients, giving rise to ridge regression. Here many aspect of ridge regression are reviewed e.g. moments, mean squared error, its equivalence to constrained estimation, and its relation to Bayesian regression. Finally, its behaviour and use are illustrated in simulation and on omics data. Subsequently, ridge regression is generalized to allow for a more general penalty. The ridge penalization framework is then translated to logistic regression and its properties are shown to carry over. To contrast ridge penalized estimation, the final chapter introduces its lasso counterpart

    Semantic Autoencoder for Zero-Shot Learning

    Full text link
    Existing zero-shot learning (ZSL) models typically learn a projection function from a feature space to a semantic embedding space (e.g.~attribute space). However, such a projection function is only concerned with predicting the training seen class semantic representation (e.g.~attribute prediction) or classification. When applied to test data, which in the context of ZSL contains different (unseen) classes without training data, a ZSL model typically suffers from the project domain shift problem. In this work, we present a novel solution to ZSL based on learning a Semantic AutoEncoder (SAE). Taking the encoder-decoder paradigm, an encoder aims to project a visual feature vector into the semantic space as in the existing ZSL models. However, the decoder exerts an additional constraint, that is, the projection/code must be able to reconstruct the original visual feature. We show that with this additional reconstruction constraint, the learned projection function from the seen classes is able to generalise better to the new unseen classes. Importantly, the encoder and decoder are linear and symmetric which enable us to develop an extremely efficient learning algorithm. Extensive experiments on six benchmark datasets demonstrate that the proposed SAE outperforms significantly the existing ZSL models with the additional benefit of lower computational cost. Furthermore, when the SAE is applied to supervised clustering problem, it also beats the state-of-the-art.Comment: accepted to CVPR201

    Estimating the risk of suffering gender-based violence

    Get PDF
    The primary objective of this research was to create a predictive model capable of identifying socio-demographic and clinical characteristics potentially associated with the risk of experiencing gender-based violence. Additionally, the aim was to predict the likelihood of suffering gender-based violence using machine learning techniques by evaluating data from the 2019 macro-survey on violence against women conducted by the Ministry of Equality of the Government of Spain. To achieve these goals, the study employed two feature selection methods: Recursive Feature Elimination (RFE) and LASSO Regularized Regression. Furthermore, two predictive models, namely Support Vector Machines (SVM) and Random Forest (RF), were utilized to analyze the data and make predictions

    Cross-class Transfer Learning for Visual Data

    Get PDF
    PhDAutomatic analysis of visual data is a key objective of computer vision research; and performing visual recognition of objects from images is one of the most important steps towards understanding and gaining insights into the visual data. Most existing approaches in the literature for the visual recognition are based on a supervised learning paradigm. Unfortunately, they require a large amount of labelled training data which severely limits their scalability. On the other hand, recognition is instantaneous and effortless for humans. They can recognise a new object without seeing any visual samples by just knowing the description of it, leveraging similarities between the description of the new object and previously learned concepts. Motivated by humans recognition ability, this thesis proposes novel approaches to tackle cross-class transfer learning (crossclass recognition) problem whose goal is to learn a model from seen classes (those with labelled training samples) that can generalise to unseen classes (those with labelled testing samples) without any training data i.e., seen and unseen classes are disjoint. Specifically, the thesis studies and develops new methods for addressing three variants of the cross-class transfer learning: Chapter 3 The first variant is transductive cross-class transfer learning, meaning labelled training set and unlabelled test set are available for model learning. Considering training set as the source domain and test set as the target domain, a typical cross-class transfer learning assumes that the source and target domains share a common semantic space, where visual feature vector extracted from an image can be embedded using an embedding function. Existing approaches learn this function from the source domain and apply it without adaptation to the target one. They are therefore prone to the domain shift problem i.e., the embedding function is only concerned with predicting the training seen class semantic representation in the learning stage during learning, when applied to the test data it may underperform. In this thesis, a novel cross-class transfer learning (CCTL) method is proposed based on unsupervised domain adaptation. Specifically, a novel regularised dictionary learning framework is formulated by which the target class labels are used to regularise the learned target domain embeddings thus effectively overcoming the projection domain shift problem. Chapter 4 The second variant is inductive cross-class transfer learning, that is, only training set is assumed to be available during model learning, resulting in a harder challenge compared to the previous one. Nevertheless, this setting reflects a real-world setting in which test data is available after the model learning. The main problem remains the same as the previous variant, that is, the domain shift problem occurs when the model learned only from the training set is applied to the test set without adaptation. In this thesis, a semantic autoencoder (SAE) is proposed building on an encoder-decoder paradigm. Specifically, first a semantic space is defined so that knowledge transfer is possible from the seen classes to the unseen classes. Then, an encoder aims to embed/project a visual feature vector into the semantic space. However, the decoder exerts a generative task, that is, the projection must be able to reconstruct the original visual features. The generative task forces the encoder to preserve richer information, thus the learned encoder from seen classes is able generalise better to the new unseen classes. Chapter 5 The third one is unsupervised cross-class transfer learning. In this variant, no supervision is available for model learning i.e., only unlabelled training data is available, leading to the hardest setting compared to the previous cases. The goal, however, is the same, learning some knowledge from the training data that can be transferred to the test data composed of completely different labels from that of training data. The thesis proposes a novel approach which requires no labelled training data yet is able to capture discriminative information. The proposed model is based on a new graph regularised dictionary learning algorithm. By introducing a l1- norm graph regularisation term, instead of the conventional squared l2-norm, the model is robust against outliers and noises typical in visual data. Importantly, the graph and representation are learned jointly, resulting in further alleviation of the effects of data outliers. As an application, person re-identification is considered for this variant in this thesis

    Pattern recognition and machine learning for magnetic resonance images with kernel methods

    Get PDF
    The aim of this thesis is to apply a particular category of machine learning and pattern recognition algorithms, namely the kernel methods, to both functional and anatomical magnetic resonance images (MRI). This work specifically focused on supervised learning methods. Both methodological and practical aspects are described in this thesis. Kernel methods have the computational advantage for high dimensional data, therefore they are idea for imaging data. The procedures can be broadly divided into two components: the construction of the kernels and the actual kernel algorithms themselves. Pre-processed functional or anatomical images can be computed into a linear kernel or a non-linear kernel. We introduce both kernel regression and kernel classification algorithms in two main categories: probabilistic methods and non-probabilistic methods. For practical applications, kernel classification methods were applied to decode the cognitive or sensory states of the subject from the fMRI signal and were also applied to discriminate patients with neurological diseases from normal people using anatomical MRI. Kernel regression methods were used to predict the regressors in the design of fMRI experiments, and clinical ratings from the anatomical scans

    Characterisation of xenometabolome signatures in complex biomatrices for enhanced human population phenotyping

    Get PDF
    Metabolic phenotyping facilitates the analysis of low molecular weight compounds in complex biological samples, with resulting metabolite profiles providing a window on endogenous processes and xenobiotic exposures. Accurate characterisation of the xenobiotic component of the metabolome (the xenometabolome) is particularly valuable when metabolic phenotyping is used for epidemiological and clinical population studies where exposure of participants to xenobiotics is unknown or difficult to control/estimate. Additionally, as metabolic phenotyping has increasingly been incorporated into toxicology and drug metabolism research, phenotyping datasets may be exploited to study xenobiotic metabolism at the population level. This thesis describes novel analytical and data-driven strategies for broadening xenometabolome coverage to allow effective partitioning of endogenous and xenobiotic metabolome signatures. The data driven strategy was multi-faceted, involving the generation of a reference database and the application of statistical methodologies. The database contains over 100 common xenobiotics profiles - generated using established liquid chromatography-mass-spectrometry methods – and provided the basis for an empirically derived screen for human urine and blood samples. The prevalence of these xenobiotics was explored in an exemplar phenotyping dataset (ALZ; n = 650; urine), with 31 xenobiotics detected in an initial screen. Statistical based methods were tailored to extract xenobiotic-related signatures and evaluated using drugs with well-characterised human metabolism. To complement the data-driven strategies for xenometabolome coverage, a more analytical based strategy was additionally developed. A dispersive solid phase extraction sample preparation protocol for blood products was optimised, permitting efficient removal of lipids and proteins, with minimal effect on low molecular weight metabolites. The suitability and reproducibility of this method was evaluated in two independent blood sample sets (AZstudy12; n=171, MARS; n=285). Finally, these analytical and statistical strategies were applied to two existing large-scale phenotyping study datasets: AIRWAVE (n = 3000 urine, n=3000 plasma samples) and ALZ (n= 650 urine, n= 449 serum) and used to explore both xenobiotic and endogenous responses to triclosan and polyethylene glycol exposure. Exposure to triclosan highlighted affected pathways relating to sulfation, whilst exposure to PEG highlighted a possible perturbation in the glutathione cycle. The analytical and statistical strategies described in this thesis allow for a more comprehensive xenometabolome characterisation and have been used to uncover previously unreported relationships between xenobiotic and endogenous metabolism.Open Acces

    Development, Optimization and Clinical Evaluation Of Algorithms For Ultrasound Data Analysis Used In Selected Medical Applications.

    Get PDF
    The assessment of soft and hard tissues is critical when selecting appropriate protocols for restorative and regenerative therapy in the field of dental surgery. The chosen treatment methodology will have significant ramifications on healing time, success rate and overall long-time oral health. Currently used diagnostic methods are limited to visual and invasive assessments; they are often user-dependent, inaccurate and result in misinterpretation. As such, the clinical need has been identified for objective tissue characterization, and the proposed novel ultrasound-based approach was designed to address the identified need. The device prototype consists of a miniaturized probe with a specifically designed ultrasonic transducer, electronics responsible for signal generation and acquisition, as well as an optimized signal processing algorithm required for data analysis. An algorithm where signals are being processed and features extracted in real-time has been implemented and studied. An in-depth algorithm performance study has been presented on synthetic signals. Further, in-vitro laboratory experiments were performed using the developed device with the algorithm implemented in software on animal-based samples. Results validated the capabilities of the new system to reproduce gingival assessment rapidly and effectively. The developed device has met clinical usability requirements for effectiveness and performance

    Trying to Beat the Market : An Empirical Analysis of the Historical and Potential Active Returns of the Government Pension Fund Global

    Get PDF
    The Government Pension Fund Global (hereafter the GPFG) helps finance the Norwegian welfare state and aims to be managed in such a way that it benefits both current and future generations. Today, the fund is managed closely to a benchmark index based on a mandate determined by the Ministry of Finance, but it is also managed actively to generate excess returns. The active management of the fund is a heated topic and there have been frequent debates related to the management model of the fund. This thesis aims to contribute to the discussion and investigates the historical and potential active management and returns, through our research question: “How has the active management and accompanying active returns of the GPFG been historically, and how could increased active management impact active returns?” Our thesis rests on three supportive analyses: a historical analysis evaluating fund performance and active management, a scenario-analysis investigating potential active returns, and lastly a qualitative study validating our findings. We first analyse the historical active returns and management of the fund. We find that active returns predominantly have been significant throughout the investigated time periods, and that active management has created additional returns for the fund, both in terms of benchmark risk-adjusted alpha and factor risk-adjusted alpha. We further establish the historical degree of active management and find an average active share of 18.92% from 2015 to 2020 and an annual tracking error of 0.63% since inception, essentially defining the GPFG as an index fund. Furthermore, we construct three synthetic portfolios combining the GPFG with the New Zealand Superannuation Fund, to analyse active returns of portfolios with higher degrees of active management. All three synthetic portfolios outperform the GPFG’s historical active returns both in-sample and out-of-sample, clearly indicating that there exists an opportunity for the GPFG to increase its active returns by increasing active management. Our initial findings are further evaluated in light of existing empirical research in the field of active versus passive management, where the broad consensus contradicts our quantitative findings. After having emphasized empirical research, we still find that active management and its accompanying returns have created significant value historically and that the fund could increase its active returns by increasing active management. Additionally, we question why the tracking error limit set by the Ministry of Finance is not exploited, and further recommend that this should be considered.nhhma

    Hyperparameters, tuning and meta-learning for random forest and other machine learning algorithms

    Get PDF
    In this cumulative dissertation thesis, I examine the influence of hyperparameters on machine learning algorithms, with a special focus on random forest. It mainly consists of three papers that were written in the last three years. The first paper (Probst and Boulesteix, 2018) examines the influence of the number of trees on the performance of a random forest. In general it is believed that the number of trees should be set higher to achieve better performance. However, we show some real data examples in which the expectation of measures such as accuracy and AUC (partially) decrease with growing numbers of trees. We prove theoretically why this can happen and argue that this only happens in very special data situations. For other measures such as the Brier score, the logarithmic loss or the mean squared error, we show that this cannot happen. In a benchmark study based on 306 classification and regression datasets, we illustrate the extent of this unexpected behaviour. We observe that, on average, most of the improvement regarding performance can be achieved while growing the first 100 trees. We use our new OOBCurve R package (Probst, 2017a) for the analysis, which can be used to examine performances for a growing number of trees of a random forest based on the out-of-bag observations. The second paper (Probst et al., 2019b) is a more general work. Firstly we review literature about the influence of hyperparameters on random forest. The different hyperparameters considered are the number of variables drawn at each split, the sampling scheme for drawing observations for each tree, the minimum number of observations in a node that a tree is allowed to have, the number of trees and the splitting rule. Their influence is examined regarding performance, runtime and variable importance. In the second part of the paper different tuning strategies for obtaining optimal hyperparameters are presented. A new software package in R is introduced, tuneRanger. It executes the tuning strategy sequential model-based optimization based on the out-of-bag observations. The hyperparameters and ranges for tuning are chosen automatically. In a benchmark study this implementation is compared with other different implementations that execute tuning for random forest. The third paper (Probst et al., 2019a) is even more general and presents a general framework for examining the tunability of hyperparameters of machine learning algorithms. It first defines the concept of defaults properly and proposes definitions for measuring the tunability of the whole algorithm, of single hyperparameters and of combinations of hyperparameters. To apply these definitions to a collection of 38 binary classification datasets, a random bot is created, which generated in total around 5 million experiment runs of 6 algorithms with different hyperparameters. The details of this bot are described in an extra paper (KĂĽhn et al., 2018), co-authored by myself, that is also included in this dissertation. The results of this bot are used to estimate the tunability of these 6 algorithms and their specific hyperparameters. Furthermore, ranges for parameter tuning of these algorithms are proposed
    corecore