7 research outputs found

    A variant of sparse partial least squares for variable selection and data exploration

    Get PDF
    When data are sparse and/or predictors multicollinear, current implementation of sparse partial least squares (SPLS) does not give estimates for non-selected predictors nor provide a measure of inference. In response, an approach termed "all-possible" SPLS is proposed, which fits a SPLS model for all tuning parameter values across a set grid. Noted is the percentage of time a given predictor is chosen, as well as the average non-zero parameter estimate. Using a "large" number of multicollinear predictors, simulation confirmed variables not associated with the outcome were least likely to be chosen as sparsity increased across the grid of tuning parameters, while the opposite was true for those strongly associated. Lastly, variables with a weak association were chosen more often than those with no association, but less often than those with a strong relationship to the outcome. Similarly, predictors most strongly related to the outcome had the largest average parameter estimate magnitude, followed by those with a weak relationship, followed by those with no relationship. Across two independent studies regarding the relationship between volumetric MRI measures and a cognitive test score, this method confirmed a priori hypotheses about which brain regions would be selected most often and have the largest average parameter estimates. In conclusion, the percentage of time a predictor is chosen is a useful measure for ordering the strength of the relationship between the independent and dependent variables, serving as a form of inference. The average parameter estimates give further insight regarding the direction and strength of association. As a result, all-possible SPLS gives more information than the dichotomous output of traditional SPLS, making it useful when undertaking data exploration and hypothesis generation for a large number of potential predictors. © 2014 Olson Hunt, Weissfeld, Boudreau, Aizenstein, Newman, Simonsick, Van Domelen, Thomas, Yaffeand Rosano

    Canonical Correlation Analysis and Partial Least Squares for identifying brain-behaviour associations: a tutorial and a comparative study

    Get PDF
    Canonical Correlation Analysis (CCA) and Partial Least Squares (PLS) are powerful multivariate methods for capturing associations across two modalities of data (e.g., brain and behaviour). However, when the sample size is similar or smaller than the number of variables in the data, CCA and PLS models may overfit, i.e., find spurious associations that generalise poorly to new data. Dimensionality reduction and regularized extensions of CCA and PLS have been proposed to address this problem, yet most studies using these approaches have some limitations. This work gives a theoretical and practical introduction into the most common CCA/PLS models and their regularized variants. We examine the limitations of standard CCA and PLS when the sample size is similar or smaller than the number of variables. We discuss how dimensionality reduction and regularization techniques address this problem and explain their main advantages and disadvantages. We highlight crucial aspects of the CCA/PLS analysis framework, including optimising the hyperparameters of the model and testing the identified associations for statistical significance. We apply the described CCA/PLS models to simulated data and real data from the Human Connectome Project and the Alzheimer's Disease Neuroimaging Initiative (both of n>500). We use both low and high dimensionality versions of each data (i.e., ratios between sample size and variables in the range of ∼1-10 and ∼0.1-0.01) to demonstrate the impact of data dimensionality on the models. Finally, we summarize the key lessons of the tutorial

    A variant of sparse partial least squares for variable selection and data exploration

    No full text
    When data are sparse and/or predictors multicollinear, current implementation of sparse partial least squares (SPLS) does not give estimates for non-selected predictors nor provide a measure of inference. In response, an approach termed "all-possible" SPL

    Desarrollo de un algoritmo de Mínimos Cuadrados Parciales para análisis de datos de chips de ADN usando el estadístico VIP para selección de genes y clasificación binaria

    Get PDF
    [ES]La tecnología de chips de ADN o "microarrays" ha supuesto un nuevo paradigma en la investigación biomédica, especialmente en el estudio del cáncer. Estos chips miden los niveles de expresión de miles de genes simultáneamente, que se usan luego para caracterizar el perfil génico de las enfermedades, la respuesta a los tratamientos o la evaluación de pronósticos. En esta investigación se ha implementado un algoritmo de mínimos cuadrados parciales (partial least squares (PLS) en inglés) que realiza una selección de variables (genes) mediante el estadístico VIP (Variable Influence on Projection), que de modo iterativo optimiza la selección de variables y el número de factores PLS del modelo predictor, con el fin de obtener un valor mínimo de error de clasificación binaria que es la habitual en Biomedicina. Este algoritmo se ha probado de forma sistemática tanto con datos simulados como con datos reales. Se ha investigado también el funcionamiento del algoritmo con modelos predictores que combinan variables clínicas y génicas a la vez

    Longitudinal clinical covariates influence on CD4+ cell count after seroconversion.

    Get PDF
    Doctoral Degree. University of KwaZulu-Natal, Durban.The Acquired Immunodeficiency Syndrome (AIDS) pandemic is a global challenge. The human immunodeficiency virus (HIV) is notoriously known for weakening the immune system and opening channels for opportunistic infections. The Cluster of Difference 4 (CD4+) cells are mainly killed by the HIV and hence used as a health indicator for HIV infected patients. In the past, the CD4+ count diagnostics were very expensive and therefore beyond the reach of many in resource-limited settings. Accordingly, the CD4+ count’s clinical covariates were the potential diagnostic tools. From a different angle, it is essential to examine a trail of the clinical covariates effecting the CD4+ cell response. That is, inasmuch as the immune system regulates the CD4+ count fluctuations in reaction to the viral invasion, the body’s other complex functional systems are bound to adjust too. However, little is known about the corresponding adaptive behavioural patterns of the clinical covariates influence on the CD4+ cell count. The investigation in this study was carried out on data obtained from the Centre for the Programme of AIDS research in South Africa (CAPRISA), where initially, HIV negative patients were enrolled into different cohorts, for different objectives. These HIV negative patients were then followed up in their respective cohort studies. As soon as a patient seroconverted in any of the cohort studies, the patient was then enrolled again, into a new cohort of HIV positive patients only. The follow-up on the seroconvertants involved a simultaneous recording of repeated measurements of the CD4+ count and 46 clinical covariates. An extensive exploratory analysis was consequently performed with three variable reduction methods for high-dimensional longitudinal data to identify the strongest clinical covariates. The sparse partial least squares approach proved to be the most appropriate and a robust technique to adopt. It identified 18 strongest clinical covariates which were subsequently used to fit other sophisticated statistical models including the longitudinal multilevel models for assessing inter-individual variation in the CD4+ count due to each clinical covariate. Generalised additive mixed models were then used to gain insight into the CD4+ count trends and possible adaptive optimal set-points of the clinical covariates. To single out break-points in the change of linear relationships between the CD4+ count and the covariates, segmented regression models were employed. In getting to grips with the understanding of the highly complex and intertwined relationships between the CD4+ count, clinical covariates and the time lagged effects during the HIV disease progression, a Structural Equation Model (SEM) was constructed and fitted. The results showed that sodium consistently changed its effects at 132mEq/L and 140 mEq/L across all the post HIV infection phases. Generally, the covariate influence on the CD4+ count varied with infection phase and widely between individuals during the anti-retroviral therapy (ART). We conlude that there is evidence of covariate set-point adaptive behaviour to positively influence the CD4+ cell count during the HIV disease progression
    corecore