183 research outputs found

    Regularized-Generalized PLS-DA

    Get PDF
    EnLinear Discriminant Analysis leads to unstable models and poor predictions in the presence of quasi collinearity among variables or in situations where the number of variables is large with respect to the samples. Partial Least Squares Discriminant Analysis (PLS-DA) was than proposed to overcome the multicollinearity problem and defined as a straightforward extension of the PLS regression. Generalized PLS-DA (GPLS-DA) and “Between” PLS-DA (B-PLS-DA) are two suitable extension of PLS-DA. A simple regularization procedure is proposed to cope with the problems of quasi collinearity or multicollinearity. It is shown that the GPLS-DA and Between PLS-DA are the two end points of a continuum approach

    PLS dimension reduction for classification of microarray data

    Get PDF
    PLS dimension reduction is known to give good prediction accuracy in the context of classification with high-dimensional microarray data. In this paper, PLS is compared with some of the best state-of-the-art classification methods. In addition, a simple procedure to choose the number of components is suggested. The connection between PLS dimension reduction and gene selection is examined and a property of the first PLS component for binary classification is proven. PLS can also be used as a visualization tool for high-dimensional data in the classification framework. The whole study is based on 9 real microarray cancer data sets

    Multivariate paired data analysis: multilevel PLSDA versus OPLSDA

    Get PDF
    Metabolomics data obtained from (human) nutritional intervention studies can have a rather complex structure that depends on the underlying experimental design. In this paper we discuss the complex structure in data caused by a cross-over designed experiment. In such a design, each subject in the study population acts as his or her own control and makes the data paired. For a single univariate response a paired t-test or repeated measures ANOVA can be used to test the differences between the paired observations. The same principle holds for multivariate data. In the current paper we compare a method that exploits the paired data structure in cross-over multivariate data (multilevel PLSDA) with a method that is often used by default but that ignores the paired structure (OPLSDA). The results from both methods have been evaluated in a small simulated example as well as in a genuine data set from a cross-over designed nutritional metabolomics study. It is shown that exploiting the paired data structure underlying the cross-over design considerably improves the power and the interpretability of the multivariate solution. Furthermore, the multilevel approach provides complementary information about (I) the diversity and abundance of the treatment effects within the different (subsets of) subjects across the study population, and (II) the intrinsic differences between these study subjects

    Improving stacking methodology for combining classifiers: applications to cosmetic industry

    Get PDF
    International audienceStacking (Wolpert (1992), Breiman (1996)) is known to be a successful way of linearly combining several models. We modify the usual stacking methodology when the response is binary and predictions highly correlated,by combining predictions with PLS-Discriminant Analysis instead of ordinary least squares. For small data sets we develop a strategy based on repeated split samples in order to select relevant variables and ensure the robustness of the nal model. Five base (or level-0) classiers are combined in order to get an improved rule which is applied to a classical benchmark of UCI Machine Learning Repository. Our methodology is then applied to the prediction of dangerousness of 165 chemicals used in the cosmetic industry, described by 35 in vitro and in silico characteristics, since faced to safety constraints, one cannot rely on a single prediction method, especially when the sample sizeis low

    Partial Least Squares and Principal Component Analysis with Non-metric Variables for Composite Indices

    Get PDF
    Ein zusammengesetzter Index ist eine aggregierte Variable, die aus individuellen Indikatoren und Gewichten besteht, wobei die Gewichte die relative Wichtigkeit jedes Indikators darstellen. Zusammengesetzte Indizes werden oft benutzt um latente Phänomene zu schreiben oder komplexe Informationen zu einer geringen Anzahl an Variablen zusammenzufassen. Es ist von großer Bedeutung richtige Gewichte für die Variablen, die einen zusammengesetzten Index bilden, zu wählen. Hauptkomponentenanalyse (PCA) ist ein populärer Ansatz um Gewichte abzuleiten, aber es ist ungeeignet, wenn informative Variationen nur kleine Varianzen der Variablen in einem zusammengesetzten Index haben. Deshalb schlägt diese Studie vor, Partial Least Squares (PLS) anzuwenden, welches die Beziehung zwischen Zielvariablen and den Variablen in einem zusammengesetzten Index ausnutzt. Unsere Simulationsstudie zeigt, dass PLS so gut wie PCA funktioniert oder erheblich es übertrifft. Zusätzlich sind in der Praxis die Variablen in einem zusammengesetzten Index häufig nicht-metrisch. Solche Variablen benötigen spezielle Verfahren, um PCA oder PLS anzuwenden. Diese Studie untersucht mehrere PCA und PLS Algorithmen für nicht-metrische Variablen in der vorliegenden Literatur und vergleicht sie durch umfangreiche Simulationsstudien, um Empfehlungen für die Praxis abzugeben. Dummy coding zeigt häufig zufriedenstellende Leistung im Vergleich zu komplizierteren Methoden. Als unsere Anwendungen betrachten wir Vermögen, Globalisierung, Geschlechtergleichheit und Korruption, indem PCA- und PLS-basierte zusammengesetzte Indizes angewendet werden. PLS erzeugt für die jeweiligen Zielvariablen massgeschnittene zusammengesetzte Indizes, die häufig bessere Leistung als PCA zeigten. Ein Vergleich zwischen PCA und PLS Gewichten und Koeffizienten zeigt, welche Variablen für die jeweiligen Zielvariablen besonders relevant sind

    Kinome-wide interaction modelling using alignment-based and alignment-independent approaches for kinase description and linear and non-linear data analysis techniques

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Protein kinases play crucial roles in cell growth, differentiation, and apoptosis. Abnormal function of protein kinases can lead to many serious diseases, such as cancer. Kinase inhibitors have potential for treatment of these diseases. However, current inhibitors interact with a broad variety of kinases and interfere with multiple vital cellular processes, which causes toxic effects. Bioinformatics approaches that can predict inhibitor-kinase interactions from the chemical properties of the inhibitors and the kinase macromolecules might aid in design of more selective therapeutic agents, that show better efficacy and lower toxicity.</p> <p>Results</p> <p>We applied proteochemometric modelling to correlate the properties of 317 wild-type and mutated kinases and 38 inhibitors (12,046 inhibitor-kinase combinations) to the respective combination's interaction dissociation constant (K<sub>d</sub>). We compared six approaches for description of protein kinases and several linear and non-linear correlation methods. The best performing models encoded kinase sequences with amino acid physico-chemical z-scale descriptors and used support vector machines or partial least- squares projections to latent structures for the correlations. Modelling performance was estimated by double cross-validation. The best models showed high predictive ability; the squared correlation coefficient for new kinase-inhibitor pairs ranging P<sup>2 </sup>= 0.67-0.73; for new kinases it ranged P<sup>2</sup><sub>kin </sub>= 0.65-0.70. Models could also separate interacting from non-interacting inhibitor-kinase pairs with high sensitivity and specificity; the areas under the ROC curves ranging AUC = 0.92-0.93. We also investigated the relationship between the number of protein kinases in the dataset and the modelling results. Using only 10% of all data still a valid model was obtained with P<sup>2 </sup>= 0.47, P<sup>2</sup><sub>kin </sub>= 0.42 and AUC = 0.83.</p> <p>Conclusions</p> <p>Our results strongly support the applicability of proteochemometrics for kinome-wide interaction modelling. Proteochemometrics might be used to speed-up identification and optimization of protein kinase targeted and multi-targeted inhibitors.</p

    Multivariate Prediction Models for Bio-Analytical Data

    No full text
    Quantitative bio-analytical techniques that enable parallel measurements of large numbers of biomolecules generate vast amounts of information for studying and characterising biological systems. These analytical methods are commonly referred to as omics technologies, and can be applied for measurements of e.g. mRNA transcript, protein or metabolite abundances in a biological sample. The work presented in this thesis focuses on the application of multivariate prediction models for modelling and analysis of biological data generated by omics technologies. Omics data commonly contain up to tens of thousands of variables, which are often both noisy and multicollinear. Multivariate statistical methods have previously been shown to be valuable for visualisation and predictive modelling of biological and chemical data with similar properties to omics data. In this thesis currently available multivariate modelling methods are used in new applications, and new methods are developed to address some of the specific challenges associated with modelling of biological data. Three closely related areas of multivariate modelling of biological data are described and demonstrated in this thesis. First, a multivariate projection method is used in a novel application for predictive modelling between omics data sets, demonstrating how data from two analytical sources can be integrated and modelled to- gether by exploring covariation patterns between the data sets. This approach is exemplified by modelling of data from two studies, the first containing proteomic and metabolic profiling data and the second containing transcriptomic and metabolic profiling data. Second, a method for piecewise multivariate modelling of short timeseries data is developed and demonstrated by modelling of simulated data as well as metabolic profiling data from a toxicity study, providing a new method for characterisation of multivariate bio-analytical time-series data. Third, a kernel-based method is developed and applied for non-linear multivariate prediction modelling of omics data, addressing the specific challenge of modelling non-linear variation in biological data
    corecore