410 research outputs found

    Regularization Methods for Predicting an Ordinal Response using Longitudinal High-dimensional Genomic Data

    Get PDF
    Ordinal scales are commonly used to measure health status and disease related outcomes in hospital settings as well as in translational medical research. Notable examples include cancer staging, which is a five-category ordinal scale indicating tumor size, node involvement, and likelihood of metastasizing. Glasgow Coma Scale (GCS), which gives a reliable and objective assessment of conscious status of a patient, is an ordinal scaled measure. In addition, repeated measurements are common in clinical practice for tracking and monitoring the progression of complex diseases. Classical ordinal modeling methods based on the likelihood approach have contributed to the analysis of data in which the response categories are ordered and the number of covariates (p) is smaller than the sample size (n). With the emergence of genomic technologies being increasingly applied for obtaining a more accurate diagnosis and prognosis, a novel type of data, known as high-dimensional data where the number of covariates (p) is much larger than the number of samples (n), are generated. However, corresponding statistical methodologies as well as computational software are lacking for analyzing high-dimensional data with an ordinal or a longitudinal ordinal response. In this thesis, we develop a regularization algorithm to build a parsimonious model for predicting an ordinal response. In addition, we utilize the classical ordinal model with longitudinal measurements to incorporate the cutting-edge data mining tool for a comprehensive understanding of the causes of complex disease on both the molecular level and environmental level. Moreover, we develop the corresponding R package for general utilization. The algorithm was applied to several real datasets as well as to simulated data to demonstrate the efficiency in variable selection and precision in prediction and classification. The four real datasets are from: 1) the National Institute of Mental Health Schizophrenia Collaborative Study; 2) the San Diego Health Services Research Example; 3) A gene expression experiment to understand `Decreased Expression of Intelectin 1 in The Human Airway Epithelium of Smokers Compared to Nonsmokers\u27 by Weill Cornell Medical College; and 4) the National Institute of General Medical Sciences Inflammation and the Host Response to Burn Injury Collaborative Study

    A Multivariate Framework for Variable Selection and Identification of Biomarkers in High-Dimensional Omics Data

    Get PDF
    In this thesis, we address the identification of biomarkers in high-dimensional omics data. The identification of valid biomarkers is especially relevant for personalized medicine that depends on accurate prediction rules. Moreover, biomarkers elucidate the provenance of disease, or molecular changes related to disease. From a statistical point of view the identification of biomarkers is best cast as variable selection. In particular, we refer to variables as the molecular attributes under investigation, e.g. genes, genetic variation, or metabolites; and we refer to observations as the specific samples whose attributes we investigate, e.g. patients and controls. Variable selection in high-dimensional omics data is a complicated challenge due to the characteristic structure of omics data. For one, omics data is high-dimensional, comprising cellular information in unprecedented details. Moreover, there is an intricate correlation structure among the variables due to e.g internal cellular regulation, or external, latent factors. Variable selection for uncorrelated data is well established. In contrast, there is no consensus on how to approach variable selection under correlation. Here, we introduce a multivariate framework for variable selection that explicitly accounts for the correlation among markers. In particular, we present two novel quantities for variable importance: the correlation-adjusted t (CAT) score for classification, and the correlation-adjusted (marginal) correlation (CAR) score for regression. The CAT score is defined as the Mahalanobis-decorrelated t-score vector, and the CAR score as the Mahalanobis-decorrelated correlation between the predictor variables and the outcome. We derive the CAT and CAR score from a predictive point of view in linear discriminant analysis and regression; both quantities assess the weight of a decorrelated and standardized variable on the prediction rule. Furthermore, we discuss properties of both scores and relations to established quantities. Above all, the CAT score decomposes Hotelling’s T 2 and the CAR score the proportion of variance explained. Notably, the decomposition of total variance into explained and unexplained variance in the linear model can be rewritten in terms of CAR scores. To render our approach applicable on high-dimensional omics data we devise an efficient algorithm for shrinkage estimates of the CAT and CAR score. Subsequently, we conduct extensive simulation studies to investigate the performance of our novel approaches in ranking and prediction under correlation. Here, CAT and CAR scores consistently improve over marginal approaches in terms of more true positives selected and a lower model error. Finally, we illustrate the application of CAT and CAR score on real omics data. In particular, we analyze genomics, transcriptomics, and metabolomics data. We ascertain that CAT and CAR score are competitive or outperform state of the art techniques in terms of true positives detected and prediction error

    Statistical Modelling

    Get PDF
    The book collects the proceedings of the 19th International Workshop on Statistical Modelling held in Florence on July 2004. Statistical modelling is an important cornerstone in many scientific disciplines, and the workshop has provided a rich environment for cross-fertilization of ideas from different disciplines. It consists in four invited lectures, 48 contributed papers and 47 posters. The contributions are arranged in sessions: Statistical Modelling; Statistical Modelling in Genomics; Semi-parametric Regression Models; Generalized Linear Mixed Models; Correlated Data Modelling; Missing Data, Measurement of Error and Survival Analysis; Spatial Data Modelling and Time Series and Econometrics

    Robust Procedures for Estimating and Testing in the Framework of Divergence Measures

    Get PDF
    The scope of the contributions to this book will be to present new and original research papers based on MPHIE, MHD, and MDPDE, as well as test statistics based on these estimators from a theoretical and applied point of view in different statistical problems with special emphasis on robustness. Manuscripts given solutions to different statistical problems as model selection criteria based on divergence measures or in statistics for high-dimensional data with divergence measures as loss function are considered. Reviews making emphasis in the most recent state-of-the art in relation to the solution of statistical problems base on divergence measures are also presented

    CLADAG 2021 BOOK OF ABSTRACTS AND SHORT PAPERS

    Get PDF
    The book collects the short papers presented at the 13th Scientific Meeting of the Classification and Data Analysis Group (CLADAG) of the Italian Statistical Society (SIS). The meeting has been organized by the Department of Statistics, Computer Science and Applications of the University of Florence, under the auspices of the Italian Statistical Society and the International Federation of Classification Societies (IFCS). CLADAG is a member of the IFCS, a federation of national, regional, and linguistically-based classification societies. It is a non-profit, non-political scientific organization, whose aims are to further classification research

    Untangling hotel industry’s inefficiency: An SFA approach applied to a renowned Portuguese hotel chain

    Get PDF
    The present paper explores the technical efficiency of four hotels from Teixeira Duarte Group - a renowned Portuguese hotel chain. An efficiency ranking is established from these four hotel units located in Portugal using Stochastic Frontier Analysis. This methodology allows to discriminate between measurement error and systematic inefficiencies in the estimation process enabling to investigate the main inefficiency causes. Several suggestions concerning efficiency improvement are undertaken for each hotel studied.info:eu-repo/semantics/publishedVersio

    Latent class approaches for modelling multiple ordinal items

    Get PDF
    The modelling of the latent class structure of multiple Likert items is reviewd. The standard latent class approach is to model the absolute Likert ratings. Commonly, an ordinal latent class model is used where the logits of the profile probabilities for each item have an adjacent category formulation (DeSantis et al., 2008). an alternative developed in this paper is to model the relative orderings, using a mixture model of the relative differences between pairs of Likert items. This produces a paired comparison adjacent category log-linear model (Dittrich et al., 2007; Francis and Dittrich, 2017), with item estimates placed on a (0,1) “worth” scale for each latent class. The two approaches are compared using data on environmental risk from the International Social Survey Programme, and conclusions are presented
    • …
    corecore