122 research outputs found
Approaches to Sample Size Determination for Multivariate Data:Applications to PCA and PLS-DA of Omics Data
Sample size determination is a fundamental step in the design of experiments. Methods for sample size determination are abundant for univariate analysis methods, but scarce in the multivariate case. Omics data are multivariate in nature and are commonly investigated using multivariate statistical methods, such as principal component analysis (PCA) and partial least-squares discriminant analysis (PLS-DA). No simple approaches to sample size determination exist for PCA and PLS-DA. In this paper we will introduce important concepts and offer strategies for (minimally) required sample size estimation when planning experiments to be analyzed using PCA and/or PLS-DA.</p
Considering Horn’s parallel analysis from a random matrix theory point of view
Horn’s parallel analysis is a widely used method for assessing the number of principal components and common factors. We discuss the theoretical foundations of parallel analysis for principal components based on a covariance matrix by making use of arguments from random matrix theory. In particular, we show that (i) for the first component, parallel analysis is an inferential method equivalent to the Tracy–Widom test, (ii) its use to test high-order eigenvalues is equivalent to the use of the joint distribution of the eigenvalues, and thus should be discouraged, and (iii) a formal test for higher-order components can be obtained based on a Tracy–Widom approximation. We illustrate the performance of the two testing procedures using simulated data generated under both a principal component model and a common factors model. For the principal component model, the Tracy–Widom test performs consistently in all conditions, while parallel analysis shows unpredictable behavior for higher-order components. For the common factor model, including major and minor factors, both procedures are heuristic approaches, with variable performance. We conclude that the Tracy–Widom procedure is preferred over parallel analysis for statistically testing the number of principal components based on a covariance matrix.<p>Horn’s parallel analysis is a widely used method for assessing the number of principal components and common factors. We discuss the theoretical foundations of parallel analysis for principal components based on a covariance matrix by making use of arguments from random matrix theory. In particular, we show that (i) for the first component, parallel analysis is an inferential method equivalent to the Tracy–Widom test, (ii) its use to test high-order eigenvalues is equivalent to the use of the joint distribution of the eigenvalues, and thus should be discouraged, and (iii) a formal test for higher-order components can be obtained based on a Tracy–Widom approximation. We illustrate the performance of the two testing procedures using simulated data generated under both a principal component model and a common factors model. For the principal component model, the Tracy–Widom test performs consistently in all conditions, while parallel analysis shows unpredictable behavior for higher-order components. For the common factor model, including major and minor factors, both procedures are heuristic approaches, with variable performance. We conclude that the Tracy–Widom procedure is preferred over parallel analysis for statistically testing the number of principal components based on a covariance matrix.</p
Group-wise Partial Least Square Regression
This paper introduces the Group-wise Partial Least Squares (GPLS) regression.
GPLS is a new Sparse PLS (SPLS) technique where the sparsity structure is
de ned in terms of groups of correlated variables, similarly to what is done in
the related Group-wise Principal Component Analysis (GPCA). These groups
are found in correlation maps derived from the data to be analyzed. GPLS is
especially useful for exploratory data analysis, since suitable values for its metaparameters can be inferred upon visualization of the correlation maps. Following
this approach, we show GPLS solves an inherent problem of SPLS: its tendency
to confound the data structure as a result of setting its metaparameters using
standard approaches for optimizing prediction, like cross-validation. Results are
shown for both simulated and experimental data
On the use of the observation-wise k-fold operation in PCA cross-validation
Cross-validation (CV) is a common approach for determining the optimal number of components in a principal component analysis model. To guarantee the
independence between model testing and calibration, the observation-wise k-fold
operation is commonly implemented in each cross-validation step. This operation renders the CV algorithm computationally intensive and it is the main
limitation to apply CV on very large data sets. In this paper we carry out an
empirical and theoretical investigation of the use of this operation in the element
wise k-fold (ekf ) algorithm, the state-of-the-art CV algorithm. We show that
when very large data sets need to be cross-validated and the computational time
is a matter of concern, the observation-wise k-fold operation can be skipped.
The theoretical properties of the resulting modi ed algorithm, referred to as
column wise k-fold (ckf ) algorithm, are derived. Also, its performance is evaluated with several arti cial and real data sets. We suggest the ckf algorithm
to be a valid alternative to the standard ekf to reduce the computational time
needed to cross-validate a data set
Age and Sex Effects on Plasma Metabolite Association Networks in Healthy Subjects
In the era of precision medicine, the analysis of simple information like sex and age can increase the potential to better diagnose and treat conditions that occur more frequently in one of the two sexes, present sex-specific symptoms and outcomes, or are characteristic of a specific age group. We present here a study of the association networks constructed from an array of 22 plasma metabolites measured on a cohort of 844 healthy blood donors. Through differential network analysis we show that specific association networks can be associated with sex and age: Different connectivity patterns were observed, suggesting sex-related variability in several metabolic pathways (branched-chain amino acids, ketone bodies, and propanoate metabolism). Reduction in metabolite hub connectivity was also found to be associated with age in both sex groups. Network analysis was complemented with standard univariate and multivariate statistical analysis that revealed age- and sex-specific metabolic signatures. Our results demonstrate that the characterization of metabolite-metabolite association networks is a promising and powerful tool to investigate the human phenotype at a molecular level
Comparative transcriptomics reveal developmental turning points during embryogenesis of a hemimetabolous insect, the damselfly Ischnura elegans
Identifying transcriptional changes during embryogenesis is of crucial importance for unravelling evolutionary, molecular and cellular mechanisms that underpin patterning and morphogenesis. However, comparative studies focusing on early/embryonic stages during insect development are limited to a few taxa. Drosophila melanogaster is the paradigm for insect development, whereas comparative transcriptomic studies of embryonic stages of hemimetabolous insects are completely lacking. We reconstructed the first comparative transcriptome covering the daily embryonic developmental progression of the blue-tailed damselfly Ischnura elegans (Odonata), an ancient hemimetabolous representative. We identified a “core” set of 6,794 transcripts – shared by all embryonic stages – which are mainly involved in anatomical structure development and cellular nitrogen compound metabolic processes. We further used weighted gene co-expression network analysis to identify transcriptional changes during Odonata embryogenesis. Based on these analyses distinct clusters of transcriptional active sequences could be revealed, indicating that embryos at different development stages have their own transcriptomic profile according to the developmental events and leading to sequential reprogramming of metabolic and developmental genes. Interestingly, a major change in transcriptionally active sequences is correlated with katatrepsis (revolution) during mid-embryogenesis, a 180° rotation of the embryo within the egg and specific to hemimetabolous insects
Comparative Transcriptomics Reveal Developmental Turning Points during Embryogenesis of a Hemimetabolous Insect, the Damselfly Ischnura elegans
Identifying transcriptional changes during embryogenesis is of crucial importance for unravelling evolutionary, molecular and cellular mechanisms that underpin patterning and morphogenesis. However, comparative studies focusing on early/embryonic stages during insect development are limited to a few taxa. Drosophila melanogaster is the paradigm for insect development, whereas comparative transcriptomic studies of embryonic stages of hemimetabolous insects are completely lacking. We reconstructed the first comparative transcriptome covering the daily embryonic developmental progression of the blue-tailed damselfly Ischnura elegans (Odonata), an ancient hemimetabolous representative. We identified a “core” set of 6,794 transcripts – shared by all embryonic stages – which are mainly involved in anatomical structure development and cellular nitrogen compound metabolic processes. We further used weighted gene co-expression network analysis to identify transcriptional changes during Odonata embryogenesis. Based on these analyses distinct clusters of transcriptional active sequences could be revealed, indicating that embryos at different development stages have their own transcriptomic profile according to the developmental events and leading to sequential reprogramming of metabolic and developmental genes. Interestingly, a major change in transcriptionally active sequences is correlated with katatrepsis (revolution) during mid-embryogenesis, a 180° rotation of the embryo within the egg and specific to hemimetabolous insects
Semi-supervised Multivariate Statistical Network Monitoring for Learning Security Threats
This paper presents a semi-supervised approach
for intrusion detection. The method extends the unsupervised
Multivariate Statistical Network Monitoring approach based
on Principal Component Analysis by introducing a supervised
optimization technique to learn the optimum scaling in the input
data. It inherits the advantages of the unsupervised strategy,
capable of uncovering new threats, with that of supervised
strategies, able of learning the pattern of a targeted threat. The
supervised learning is based on an extension of the gradient
descent method based on Partial Least Squares (PLS). Moreover,
we enhance this method by using sparse PLS variants. The
practical application of the system is demonstrated on a recently
published real case study, showing relevant improvements in
detection performance and in the interpretation of the attacks
- …