169 research outputs found

    High-Dimensional Linear and Functional Analysis of Multivariate Grapevine Data

    Get PDF
    Variable selection plays a major role in multivariate high-dimensional statistical modeling. Hence, we need to select a consistent model, which avoids overfitting in prediction, enhances model interpretability and identifies relevant variables. We explore various continuous, nearly unbiased, sparse and accurate technique of linear model using coefficients paths like penalized maximum likelihood and nonconvex penalties, and iterative Sure Independence Screening (SIS). The convex penalized (pseudo-) likelihood approach based on the elastic net uses a mixture of the â„“1 (Lasso) and â„“2 (ridge regression) simultaneously achieve automatic variable selection, continuous shrinkage, and selection of the groups of correlated variables. Variable selection using coefficients paths for minimax concave penalty (MCP), starts applying penalization at the same rate as Lasso, and then smoothly relaxes the rate down to zero as the absolute value of the coefficient increases. The sure screening method is based on correlation learning, which computes component wise estimators using AIC for tuning the regularization parameter of the penalized likelihood Lasso. To reflect the eternal nature of spectral data, we use the Functional Data approach by approximating the finite linear combination of basis functions using B-splines. MCP, SIS and Functional regression are based on the intuition that the predictors are independent. However, high-dimensional grapevine dataset suffers from ill-conditioning of the covariance matrix due to multicollinearity. Under collinearity, the Elastic-Net Regularization path via Coordinate Descent yields the best result to control the sparsity of the model and cross-validation to reduce bias in variable selection. Iterative stepwise multiple linear regression reduces complexity and enhances the predictability of the model by selecting only significant predictors

    Bayesian analytical approaches for metabolomics : a novel method for molecular structure-informed metabolite interaction modeling, a novel diagnostic model for differentiating myocardial infarction type, and approaches for compound identification given mass spectrometry data.

    Get PDF
    Metabolomics, the study of small molecules in biological systems, has enjoyed great success in enabling researchers to examine disease-associated metabolic dysregulation and has been utilized for the discovery biomarkers of disease and phenotypic states. In spite of recent technological advances in the analytical platforms utilized in metabolomics and the proliferation of tools for the analysis of metabolomics data, significant challenges in metabolomics data analyses remain. In this dissertation, we present three of these challenges and Bayesian methodological solutions for each. In the first part we develop a new methodology to serve a basis for making higher order inferences in metabolomics, which we define as the testing of hypotheses that are more complex than single metabolite hypothesis tests. This methodology utilizes informative priors that are generated via the analysis of molecular structure similarity to enable the estimation of metabolite interactomes (or probabilistic models) which are organism-, sample media-, and condition-specific as well as comprehensive; and that can serve as reference models for studying perturbations in metabolic systems. After discussing the development of our methodology, we present an evaluation of its performance conducted using simulation studies, and we use the methodology for estimating a plasma metabolite interactome for stable heart disease. This interactome may serve as a reference model for evaluating systems-level changes that occur with acute disease events such as myocardial infarction (MI) or unstable angina. In the second part of this work, we present the challenge of developing diagnostic classification models which utilize metabolite abundances and that do not overfit relatively small sample sizes, especially given the high dimensionality of metabolite data acquired using platforms such as liquid chromatography-mass spectrometry. We use a Bayesian methodology for estimating a multinomial logistic regression classifier for the detection and discrimination of the subtype of acute myocardial infarction utilizing metabolite abundance data quantified from blood plasma. As heart disease is the leading cause of global mortality, a blood-based and non-invasive diagnostic test that could differentiate between MI types at the time of the event would have great utility. In the final part of this dissertation we review Bayesian approaches for compound identification in metabolomics experiments that utilize liquid chromatography-mass spectrometry which remains a challenging problem

    Nonlinear Structural Functional Models

    Get PDF
    A common objective in functional data analyses is the registration of data curves and estimation of the locations of their salient structures, such as spikes or local extrema. Existing methods separate curve modeling and structure estimation into disjoint steps, optimize different criteria for estimation, or recast the problem into the testing framework. Moreover, curve registration is often implemented in a pre-processing step. The aim of this dissertation is to ameliorate the shortcomings of existing methods through the development of unified nonlinear modeling procedures for the analysis of structural functional data. A general model-based framework is proposed to unify registration and estimation of curves and their structures. In particular, this work focuses on three specific research problems. First, a Sparse Semiparametric Nonlinear Model (SSNM) is proposed to jointly register curves, perform model selection, and estimate the features of sparsely-structured functional data. The SSNM is fitted to chromatographic data from a study of the composition of Chinese rhubarb. Next, the SSNM is extended to the nonlinear mixed effects setting to enable the comparison of sparse structures across group-averaged curves. The model is utilized to compare compositions of medicinal herbs collected from two groups of production sites. Finally, a Piecewise Monotonic B-spline Model (PMBM) is proposed to estimate the locations of local extrema in a curve. The PMBM is applied to MRI data from a study of gray matter growth in the brain

    Variable selection and structural discovery in joint models of longitudinal and survival data

    Get PDF
    Indiana University-Purdue University Indianapolis (IUPUI)Joint models of longitudinal and survival outcomes have been used with increasing frequency in clinical investigations. Correct specification of fixed and random effects, as well as their functional forms is essential for practical data analysis. However, no existing methods have been developed to meet this need in a joint model setting. In this dissertation, I describe a penalized likelihood-based method with adaptive least absolute shrinkage and selection operator (ALASSO) penalty functions for model selection. By reparameterizing variance components through a Cholesky decomposition, I introduce a penalty function of group shrinkage; the penalized likelihood is approximated by Gaussian quadrature and optimized by an EM algorithm. The functional forms of the independent effects are determined through a procedure for structural discovery. Specifically, I first construct the model by penalized cubic B-spline and then decompose the B-spline to linear and nonlinear elements by spectral decomposition. The decomposition represents the model in a mixed-effects model format, and I then use the mixed-effects variable selection method to perform structural discovery. Simulation studies show excellent performance. A clinical application is described to illustrate the use of the proposed methods, and the analytical results demonstrate the usefulness of the methods

    Smooth Lasso Estimator for the Function-on-Function Linear Regression Model

    Full text link
    A new estimator, named as S-LASSO, is proposed for the coefficient function of a functional linear regression model where values of the response function, at a given domain point, depends on the full trajectory of the covariate function. The S-LASSO estimator is shown to be able to increase the interpretability of the model, by better locating regions where the coefficient function is zero, and to smoothly estimate non-zero values of the coefficient function. The sparsity of the estimator is ensured by a functional LASSO penalty whereas the smoothness is provided by two roughness penalties. The resulting estimator is proved to be estimation and pointwise sign consistent. Via an extensive Monte Carlo simulation study, the estimation and predictive performance of the S-LASSO estimator are shown to be better than (or at worst comparable with) competing estimators already presented in the literature before. Practical advantages of the S-LASSO estimator are illustrated through the analysis of the well known \textit{Canadian weather} and \textit{Swedish mortality dat

    Variable selection in varying coefficient models for mapping quantitative trait loci

    Get PDF
    The Collaborative Cross (CC), a renewable mouse resource that mimics the genetic diversity in humans, provides great data sources for mapping Quantitative Trait Loci (QTL). The recombinant inbred intercrosses (RIX) generated from CC recombinant inbred (RI) lines have several attractive features and can be produced repeatedly. Many quantitative traits are inherently complex and change with other covariates. To map such complex traits, phenotypes are measured across multiple values of covariates on each subject. In the first topic, we propose a more flexible nonparametric varying coefficient QTL mapping method for RIX data. This model lets the QTL effects evolve with certain covariates, and naturally extends classical parametric QTL mapping methods. Simulation results indicate that the varying coefficient QTL mapping has substantially higher power and higher mapping precision compared to parametric models when the assumption of constant genetic effects fails. We model the time-varying genetic effects with functional approximation using B-spline basis. We apply a nested permutation method to obtain threshold values for QTL detection. In the second topic, we extend the single marker QTL mapping to multiple QTL mapping. We treat multiple QTL mapping as a model/variable selection problem and propose a penalized mixed effects model. We apply a penalty function for the group selection of coefficients associated with each gene. We propose new selection procedures for tuning parameters. Simulations showed that the new mapping method performs better than the single marker analysis when multiple QTL exist. Last, in the third topic, we extend the multiple QTL mapping method to longitudinal data. We pay special attention to modeling the covariance structure of repeated measurements. Popular stationary assumptions on variance and covariance structures may not be realistic for many longitudinal traits. The structured antedependence (SAD) model is a parsimonious covariance model that allows for both nonstationary variance and correlation. We propose a penalized likelihood method for multiple QTL mapping using the SAD model. Simulation results showed the model selection method outperforms the single marker analysis. Furthermore, the performance of multiple QTL mapping will be affected if the covariance model is misspecified

    New covariates selection approaches in high dimensional or functional regression models

    Get PDF
    In a Big Data context, the number of covariates used to explain a variable of interest, p, is likely to be high, sometimes even higher than the available sample size (p > n). Ordinary procedures for fitting regression models start to perform wrongly in this situation. As a result, other approaches are needed. A first covariates selection step is of interest to consider only the relevant terms and to reduce the problem dimensionality. The purpose of this thesis is the study and development of covariates selection techniques for regression models in complex settings. In particular, we focus on recent high dimensional or functional data contexts of interest. Assuming some model structure, regularization techniques are widely employed alternatives for both: model estimation and covariates selection simultaneously. Specifically, an extensive and critical review of penalization techniques for covariates selection is carried out. This is developed in the context of the high dimensional linear model of the vectorial framework. Conversely, if no model structure wants to be assumed, stateof- the-art dependence measures based on distances are an attractive option for covariates selection. New specification tests using these ideas are proposed for the functional concurrent model. Both versions are considered separately: the synchronous and the asynchronous case. These approaches are based on novel dependence measures derived from the distance covariance coefficient

    A Likelihood Based Framework for Data Integration with Application to eQTL Mapping

    Get PDF
    We develop a new way of thinking about and integrating gene expression data (continuous) and genomic information data (binary) by jointly compressing the two data sets and embedding their signals in low dimensional feature spaces with an information sharing mechanism, which connects the continuous data to the binary data, under the penalized log-likelihood framework. In particular, the continuous data are modeled by a Gaussian likelihood and the binary data are modeled by a Bernoulli likelihood which is formed by transforming the feature space of the genomic information with a logit link. The smoothly clipped absolute deviation (SCAD) penalty, is added on the basis vectors of the low dimensional feature spaces for both data sets, which is based on the assumption that only a small set of genetic variants are associated with a small fraction of gene expression and the fact that those basis vectors can be interpreted as weights assigned on the genetic variants and gene expression similar to the way the loading vectors of principal component analysis (PCA) or canonical correlation analysis (CCA) are interpreted. Algorithmically, a Majorization-Minimization (MM) algorithm with local linear approximation (LLA) to SCAD penalty is developed to effectively and efficiently solve the optimization problem involved, which produces closed-form updating rules. The effectiveness of our method is demonstrated by simulations in various setups with comparisons to some popular competing methods and an application to eQTL mapping with real data
    • …
    corecore