183 research outputs found
Regularized-Generalized PLS-DA
EnLinear Discriminant Analysis leads to unstable models and poor predictions in the presence of quasi collinearity among variables or in situations where the number of variables is large with respect to the samples. Partial Least Squares Discriminant Analysis (PLS-DA) was than proposed to overcome the multicollinearity problem and defined as a straightforward extension of the PLS regression. Generalized PLS-DA (GPLS-DA) and “Between” PLS-DA (B-PLS-DA) are two suitable extension of PLS-DA. A simple regularization procedure is proposed to cope with the problems of quasi collinearity or multicollinearity. It is shown that the GPLS-DA and Between PLS-DA are the two end points of a continuum approach
PLS dimension reduction for classification of microarray data
PLS dimension reduction is known to give good prediction accuracy in the context of classification with high-dimensional microarray data. In this paper, PLS is compared with some of the best state-of-the-art classification methods. In addition, a simple procedure to choose the number of components is suggested. The connection between PLS dimension reduction and gene selection is examined and a property of the first PLS component for binary classification is proven. PLS can also be used as a visualization tool for high-dimensional data in the classification framework. The whole study is based on 9 real microarray cancer data sets
Recommended from our members
Multimodal MRI-based Imputation of the Aβ+ in Early Mild Cognitive Impairment.
ObjectiveTo identify brain atrophy from structural-MRI and cerebral blood flow(CBF) patterns from arterial spin labeling perfusion-MRI that are best predictors of the Aβ-burden, measured as composite 18F-AV45-PET uptake, in individuals with early mild cognitive impairment(MCI). Furthermore, to assess the relative importance of imaging modalities in classification of Aβ+/Aβ- early mild cognitive impairment.MethodsSixty-seven ADNI-GO/2 participants with early-MCI were included. Voxel-wise anatomical shape variation measures were computed by estimating the initial diffeomorphic mapping momenta from an unbiased control template. CBF measures normalized to average motor cortex CBF were mapped onto the template space. Using partial least squares regression, we identified the structural and CBF signatures of Aβ after accounting for normal cofounding effects of age, sex, and education.Results18F-AV45-positive early-MCIs could be identified with 83% classification accuracy, 87% positive predictive value, and 84% negative predictive value by multidisciplinary classifiers combining demographics data, ApoE ε4-genotype, and a multimodal MRI-based Aβ score.InterpretationMultimodal-MRI can be used to predict the amyloid status of early-MCI individuals. MRI is a very attractive candidate for the identification of inexpensive and non-invasive surrogate biomarkers of Aβ deposition. Our approach is expected to have value for the identification of individuals likely to be Aβ+ in circumstances where cost or logistical problems prevent Aβ detection using cerebrospinal fluid analysis or Aβ-PET. This can also be used in clinical settings and clinical trials, aiding subject recruitment and evaluation of treatment efficacy. Imputation of the Aβ-positivity status could also complement Aβ-PET by identifying individuals who would benefit the most from this assessment
Multivariate paired data analysis: multilevel PLSDA versus OPLSDA
Metabolomics data obtained from (human) nutritional intervention studies can have a rather complex structure that depends on the underlying experimental design. In this paper we discuss the complex structure in data caused by a cross-over designed experiment. In such a design, each subject in the study population acts as his or her own control and makes the data paired. For a single univariate response a paired t-test or repeated measures ANOVA can be used to test the differences between the paired observations. The same principle holds for multivariate data. In the current paper we compare a method that exploits the paired data structure in cross-over multivariate data (multilevel PLSDA) with a method that is often used by default but that ignores the paired structure (OPLSDA). The results from both methods have been evaluated in a small simulated example as well as in a genuine data set from a cross-over designed nutritional metabolomics study. It is shown that exploiting the paired data structure underlying the cross-over design considerably improves the power and the interpretability of the multivariate solution. Furthermore, the multilevel approach provides complementary information about (I) the diversity and abundance of the treatment effects within the different (subsets of) subjects across the study population, and (II) the intrinsic differences between these study subjects
Improving stacking methodology for combining classifiers: applications to cosmetic industry
International audienceStacking (Wolpert (1992), Breiman (1996)) is known to be a successful way of linearly combining several models. We modify the usual stacking methodology when the response is binary and predictions highly correlated,by combining predictions with PLS-Discriminant Analysis instead of ordinary least squares. For small data sets we develop a strategy based on repeated split samples in order to select relevant variables and ensure the robustness of the nal model. Five base (or level-0) classiers are combined in order to get an improved rule which is applied to a classical benchmark of UCI Machine Learning Repository. Our methodology is then applied to the prediction of dangerousness of 165 chemicals used in the cosmetic industry, described by 35 in vitro and in silico characteristics, since faced to safety constraints, one cannot rely on a single prediction method, especially when the sample sizeis low
Partial Least Squares and Principal Component Analysis with Non-metric Variables for Composite Indices
Ein zusammengesetzter Index ist eine aggregierte Variable, die aus individuellen Indikatoren und Gewichten besteht, wobei die Gewichte die relative Wichtigkeit jedes Indikators darstellen. Zusammengesetzte Indizes werden oft benutzt um latente Phänomene zu schreiben oder komplexe Informationen zu einer geringen Anzahl an Variablen zusammenzufassen. Es ist von großer Bedeutung richtige Gewichte für die Variablen, die einen zusammengesetzten Index bilden, zu wählen. Hauptkomponentenanalyse (PCA) ist ein populärer Ansatz um Gewichte abzuleiten, aber es ist ungeeignet, wenn informative Variationen nur kleine Varianzen der Variablen in einem zusammengesetzten Index haben. Deshalb schlägt diese Studie vor, Partial Least Squares (PLS) anzuwenden, welches die Beziehung zwischen Zielvariablen and den Variablen in einem zusammengesetzten Index ausnutzt. Unsere Simulationsstudie zeigt, dass PLS so gut wie PCA funktioniert oder erheblich es übertrifft. Zusätzlich sind in der Praxis die Variablen in einem zusammengesetzten Index häufig nicht-metrisch. Solche Variablen benötigen spezielle Verfahren, um PCA oder PLS anzuwenden. Diese Studie untersucht mehrere PCA und PLS Algorithmen für nicht-metrische Variablen in der vorliegenden Literatur und vergleicht sie durch umfangreiche Simulationsstudien, um Empfehlungen für die Praxis abzugeben. Dummy coding zeigt häufig zufriedenstellende Leistung im Vergleich zu komplizierteren Methoden. Als unsere Anwendungen betrachten wir Vermögen, Globalisierung, Geschlechtergleichheit und Korruption, indem PCA- und PLS-basierte zusammengesetzte Indizes angewendet werden. PLS erzeugt für die jeweiligen Zielvariablen massgeschnittene zusammengesetzte Indizes, die häufig bessere Leistung als PCA zeigten. Ein Vergleich zwischen PCA und PLS Gewichten und Koeffizienten zeigt, welche Variablen für die jeweiligen Zielvariablen besonders relevant sind
Kinome-wide interaction modelling using alignment-based and alignment-independent approaches for kinase description and linear and non-linear data analysis techniques
<p>Abstract</p> <p>Background</p> <p>Protein kinases play crucial roles in cell growth, differentiation, and apoptosis. Abnormal function of protein kinases can lead to many serious diseases, such as cancer. Kinase inhibitors have potential for treatment of these diseases. However, current inhibitors interact with a broad variety of kinases and interfere with multiple vital cellular processes, which causes toxic effects. Bioinformatics approaches that can predict inhibitor-kinase interactions from the chemical properties of the inhibitors and the kinase macromolecules might aid in design of more selective therapeutic agents, that show better efficacy and lower toxicity.</p> <p>Results</p> <p>We applied proteochemometric modelling to correlate the properties of 317 wild-type and mutated kinases and 38 inhibitors (12,046 inhibitor-kinase combinations) to the respective combination's interaction dissociation constant (K<sub>d</sub>). We compared six approaches for description of protein kinases and several linear and non-linear correlation methods. The best performing models encoded kinase sequences with amino acid physico-chemical z-scale descriptors and used support vector machines or partial least- squares projections to latent structures for the correlations. Modelling performance was estimated by double cross-validation. The best models showed high predictive ability; the squared correlation coefficient for new kinase-inhibitor pairs ranging P<sup>2 </sup>= 0.67-0.73; for new kinases it ranged P<sup>2</sup><sub>kin </sub>= 0.65-0.70. Models could also separate interacting from non-interacting inhibitor-kinase pairs with high sensitivity and specificity; the areas under the ROC curves ranging AUC = 0.92-0.93. We also investigated the relationship between the number of protein kinases in the dataset and the modelling results. Using only 10% of all data still a valid model was obtained with P<sup>2 </sup>= 0.47, P<sup>2</sup><sub>kin </sub>= 0.42 and AUC = 0.83.</p> <p>Conclusions</p> <p>Our results strongly support the applicability of proteochemometrics for kinome-wide interaction modelling. Proteochemometrics might be used to speed-up identification and optimization of protein kinase targeted and multi-targeted inhibitors.</p
Multivariate Prediction Models for Bio-Analytical Data
Quantitative bio-analytical techniques that enable parallel measurements of large
numbers of biomolecules generate vast amounts of information for studying and
characterising biological systems. These analytical methods are commonly referred
to as omics technologies, and can be applied for measurements of e.g. mRNA transcript,
protein or metabolite abundances in a biological sample.
The work presented in this thesis focuses on the application of multivariate prediction
models for modelling and analysis of biological data generated by omics
technologies. Omics data commonly contain up to tens of thousands of variables,
which are often both noisy and multicollinear. Multivariate statistical methods have
previously been shown to be valuable for visualisation and predictive modelling of
biological and chemical data with similar properties to omics data. In this thesis
currently available multivariate modelling methods are used in new applications,
and new methods are developed to address some of the specific challenges associated
with modelling of biological data.
Three closely related areas of multivariate modelling of biological data are described
and demonstrated in this thesis. First, a multivariate projection method is
used in a novel application for predictive modelling between omics data sets, demonstrating
how data from two analytical sources can be integrated and modelled to-
gether by exploring covariation patterns between the data sets. This approach is
exemplified by modelling of data from two studies, the first containing proteomic
and metabolic profiling data and the second containing transcriptomic and metabolic
profiling data. Second, a method for piecewise multivariate modelling of short timeseries
data is developed and demonstrated by modelling of simulated data as well
as metabolic profiling data from a toxicity study, providing a new method for characterisation
of multivariate bio-analytical time-series data. Third, a kernel-based
method is developed and applied for non-linear multivariate prediction modelling
of omics data, addressing the specific challenge of modelling non-linear variation in
biological data
- …