96 research outputs found

    High-dimensional variable selection for GLMs and survival models

    Get PDF
    De focus van het proefschrift is op de statistische numerieke benaderingen om geringe genoomgegevens te passen met GLM- en overlevingsgegevens. Het proefschrift beschrijft de selectie van verklarende variabelen die een univariate uitkomst kunnen beïnvloeden. Het resultaat heeft een kansverdeling die valt in de klasse van de exponentiële dispersie familie. De aanpak die wordt onderzocht is de regressie van de differentiaalgeometrie van de minimale hoek (dgLARS) die is ontwikkeld voor genormaliseerde lineaire modellen. De dgLARS-aanpak wordt vergeleken met alternatieve methoden voor variabele selectie in algemene lineaire modellen. De numerieke procedures van dgLARS zijn verbeterd voor de algemene instelling, en wordt aangeduid als de uitgebreide dgLARS. Bovendien onderzoeken we hoe goed de dispersieparameter van de familie van exponentiële verdelingen kan worden geschat. In de tussentijd richten we ons op overlevingsgegevens en de genomische invloed, met behulp van de relatieve risicofunctie. In alle hoofdstukken blijkt dat de verbeterde en ontwikkelde numerieke procedures snel en accuraat zijn bij het schatten van parameters. Uiteindelijk wordt een volledige beschrijving van het pakket code{R} dat is ontwikkeld om alle analyses te doen, gepresenteerd.The focus of the thesis is on the statistical numerical approaches to fit sparse genomic data with GLM and survival data. The thesis describes on the selection of explanatory variables that may affect a univariate outcome. The outcome has a probability distribution that falls in the class of the exponential dispersion family. The approach that is explored is the differential geometry least angle regression (dgLARS) that is developed for generalized linear models. The dgLARS approach is compared to alternative methods for variable selection in generalized linear models. The numerical procedures of dgLARS is improved for the general setting, and is referred to as the extended dgLARS. Moreover, we investigate how well the dispersion parameter of the family of exponential distributions can be estimated. In the meantime, we focus on survival data and the genomic influence, using the relative risk function. In all chapters it is shown that the improved and developed numerical procedures is fast and accurate in the estimation of parameters. In the end, a full description of the code{R} package that has been developed to do all the analysis is presented

    High-dimensional variable selection for GLMs and survival models

    Get PDF
    De focus van het proefschrift is op de statistische numerieke benaderingen om geringe genoomgegevens te passen met GLM- en overlevingsgegevens. Het proefschrift beschrijft de selectie van verklarende variabelen die een univariate uitkomst kunnen beïnvloeden. Het resultaat heeft een kansverdeling die valt in de klasse van de exponentiële dispersie familie. De aanpak die wordt onderzocht is de regressie van de differentiaalgeometrie van de minimale hoek (dgLARS) die is ontwikkeld voor genormaliseerde lineaire modellen. De dgLARS-aanpak wordt vergeleken met alternatieve methoden voor variabele selectie in algemene lineaire modellen. De numerieke procedures van dgLARS zijn verbeterd voor de algemene instelling, en wordt aangeduid als de uitgebreide dgLARS. Bovendien onderzoeken we hoe goed de dispersieparameter van de familie van exponentiële verdelingen kan worden geschat. In de tussentijd richten we ons op overlevingsgegevens en de genomische invloed, met behulp van de relatieve risicofunctie. In alle hoofdstukken blijkt dat de verbeterde en ontwikkelde numerieke procedures snel en accuraat zijn bij het schatten van parameters. Uiteindelijk wordt een volledige beschrijving van het pakket code{R} dat is ontwikkeld om alle analyses te doen, gepresenteerd

    Sparse relative risk regression models

    Get PDF
    Clinical studies where patients are routinely screened for many genomic features are becoming more routine. In principle, this holds the promise of being able to find genomic signatures for a particular disease. In particular, cancer survival is thought to be closely linked to the genomic constitution of the tumor. Discovering such signatures will be useful in the diagnosis of the patient, may be used for treatment decisions and, perhaps, even the development of new treatments. However, genomic data are typically noisy and high-dimensional, not rarely outstripping the number of patients included in the study. Regularized survival models have been proposed to deal with such scenarios. These methods typically induce sparsity by means of a coincidental match of the geometry of the convex likelihood and a (near) non-convex regularizer. The disadvantages of such methods are that they are typically non-invariant to scale changes of the covariates, they struggle with highly correlated covariates, and they have a practical problem of determining the amount of regularization. In this article, we propose an extension of the differential geometric least angle regression method for sparse inference in relative risk regression models. A software implementation of our met

    Statistical Methods for Normalization and Analysis of High-Throughput Genomic Data

    Get PDF
    High-throughput genomic datasets obtained from microarray or sequencing studies have revolutionized the field of molecular biology over the last decade. The complexity of these new technologies also poses new challenges to statisticians to separate biological relevant information from technical noise. Two methods are introduced that address important issues with normalization of array comparative genomic hybridization (aCGH) microarrays and the analysis of RNA sequencing (RNA-Seq) studies. Many studies investigating copy number aberrations at the DNA level for cancer and genetic studies use comparative genomic hybridization (CGH) on oligo arrays. However, aCGH data often suffer from low signal to noise ratios resulting in poor resolution of fine features. Bilke et al. showed that the commonly used running average noise reduction strategy performs poorly when errors are dominated by systematic components. A method called pcaCGH is proposed that significantly reduces noise using a non-parametric regression on technical covariates of probes to estimate systematic bias. Then a robust principal components analysis (PCA) estimates any remaining systematic bias not explained by technical covariates used in the preceding regression. The proposed algorithm is demonstrated on two CGH datasets measuring the NCI-60 cell lines utilizing NimbleGen and Agilent microarrays. The method achieves a nominal error variance reduction of 60%-65% as well as an 2-fold increase in signal to noise ratio on average, resulting in more detailed copy number estimates. Furthermore, correlations of signal intensity ratios of NimbleGen and Agilent arrays are increased by 40% on average, indicating a significant improvement in agreement between the technologies. A second algorithm called gamSeq is introduced to test for differential gene expression in RNA sequencing studies. Limitations of existing methods are outlined and the proposed algorithm is compared to these existing algorithms. Simulation studies and real data are used to show that gamSeq improves upon existing methods with regards to type I error control while maintaining similar or better power for a range of sample sizes for RNA-Seq studies. Furthermore, the proposed method is applied to detect differential 3\u27 UTR usage

    Big Data Analytics and Information Science for Business and Biomedical Applications

    Get PDF
    The analysis of Big Data in biomedical as well as business and financial research has drawn much attention from researchers worldwide. This book provides a platform for the deep discussion of state-of-the-art statistical methods developed for the analysis of Big Data in these areas. Both applied and theoretical contributions are showcased

    Machine Learning in Insurance

    Get PDF
    Machine learning is a relatively new field, without a unanimous definition. In many ways, actuaries have been machine learners. In both pricing and reserving, but also more recently in capital modelling, actuaries have combined statistical methodology with a deep understanding of the problem at hand and how any solution may affect the company and its customers. One aspect that has, perhaps, not been so well developed among actuaries is validation. Discussions among actuaries’ “preferred methods” were often without solid scientific arguments, including validation of the case at hand. Through this collection, we aim to promote a good practice of machine learning in insurance, considering the following three key issues: a) who is the client, or sponsor, or otherwise interested real-life target of the study? b) The reason for working with a particular data set and a clarification of the available extra knowledge, that we also call prior knowledge, besides the data set alone. c) A mathematical statistical argument for the validation procedure
    • …
    corecore