27 research outputs found
Cross-validation in PCA models with the element-wise k-fold (ekf) algorithm: theoretical aspects
[EN] Cross-validation has become one of the principal methods to adjust the meta-parameters in predictive models.
Extensions of the cross-validation idea have been proposed to select the number of components in principal
components analysis (PCA). The element-wise k-fold (ekf) cross-validation is among the most used algorithms for
principal components analysis cross-validation. This is the method programmed in the PLS_Toolbox, and it has been
stated to outperform other methods under most circumstances in a numerical experiment. The ekf algorithm is
based on missing data imputation, and it can be programmed using any method for this purpose. In this paper,
the ekf algorithm with the simplest missing data imputation method, trimmed score imputation, is analyzed. A
theoretical study is driven to identify in which situations the application of ekf is adequate and, more importantly,
in which situations it is not. The results presented show that the ekf method may be unable to assess the extent to
which a model represents a test set and may lead to discard principal components with important information. On a
second paper of this series, other imputation methods are studied within the ekf algorithmResearch in this area is partially supported by the Spanish Ministry of Economy and Competitiveness and FEDER funds from the European Union through grant DPI2011-28112-C04-02. Jose Camacho was funded by the Juan de la Cierva program, Ministry of Science and Innovation, Spain. This study was carried out when Jose Camacho was at the Universidad Politecnica de Valencia and Universitat de Girona, Spain.Camacho Páez, J.; Ferrer Riquelme, AJ. (2012). Cross-validation in PCA models with the element-wise k-fold (ekf) algorithm: theoretical aspects. Journal of Chemometrics. 26(1):361-373. https://doi.org/10.1002/cem.2440S361373261Wold, S. (1978). Cross-Validatory Estimation of the Number of Components in Factor and Principal Components Models. Technometrics, 20(4), 397-405. doi:10.1080/00401706.1978.10489693Eastment, H. T., & Krzanowski, W. J. (1982). Cross-Validatory Choice of the Number of Components From a Principal Component Analysis. Technometrics, 24(1), 73-77. doi:10.1080/00401706.1982.10487712Nomikos, P., & MacGregor, J. F. (1995). Multivariate SPC Charts for Monitoring Batch Processes. Technometrics, 37(1), 41-59. doi:10.1080/00401706.1995.10485888Bro, R., Kjeldahl, K., Smilde, A. K., & Kiers, H. A. L. (2008). Cross-validation of component models: A critical look at current methods. Analytical and Bioanalytical Chemistry, 390(5), 1241-1251. doi:10.1007/s00216-007-1790-1Wise BM Gallagher NB Bro R Shaver JM Windig W Koch RS PLSToolbox 3.5 for use with Matlab 2005Nelson, P. R. C., Taylor, P. A., & MacGregor, J. F. (1996). Missing data methods in PCA and PLS: Score calculations with incomplete observations. Chemometrics and Intelligent Laboratory Systems, 35(1), 45-65. doi:10.1016/s0169-7439(96)00007-xArteaga, F., & Ferrer, A. (2002). Dealing with missing data in MSPC: several methods, different interpretations, some examples. Journal of Chemometrics, 16(8-10), 408-418. doi:10.1002/cem.750Arteaga, F., & Ferrer, A. (2005). Framework for regression-based missing data imputation methods in on-line MSPC. Journal of Chemometrics, 19(8), 439-447. doi:10.1002/cem.946Zhang, P. (1993). Model Selection Via Multifold Cross Validation. The Annals of Statistics, 21(1), 299-313. doi:10.1214/aos/1176349027Louwerse, D. J., Smilde, A. K., & Kiers, H. A. L. (1999). Cross-validation of multiway component models. Journal of Chemometrics, 13(5), 491-510. doi:10.1002/(sici)1099-128x(199909/10)13:53.0.co;2-2Lei, F., Rotbøll, M., & Jørgensen, S. B. (2001). A biochemically structured model for Saccharomyces cerevisiae. Journal of Biotechnology, 88(3), 205-221. doi:10.1016/s0168-1656(01)00269-3López, F., Miguel Valiente, J., Manuel Prats, J., & Ferrer, A. (2008). Performance evaluation of soft color texture descriptors for surface grading using experimental design and logistic regression. Pattern Recognition, 41(5), 1744-1755. doi:10.1016/j.patcog.2007.09.011Camacho, J., Picó, J., & Ferrer, A. (2010). Data understanding with PCA: Structural and Variance Information plots. Chemometrics and Intelligent Laboratory Systems, 100(1), 48-56. doi:10.1016/j.chemolab.2009.10.005Mercer, A. M., & Mercer, P. R. (2000). Cauchy’s interlace theorem and lower bounds for the spectral radius. International Journal of Mathematics and Mathematical Sciences, 23(8), 563-566. doi:10.1155/s016117120000257
On the Generation of Random Multivariate Data
The simulation of multivariate data is often necessary for assessing the performance
of multivariate analysis techniques. The random generation of multivariate data
when the covariance matrix is completely or partly specified is solved by different
methods, from the Cholesky decomposition to some recent alternatives. However,
many times the covariance matrix has to be generated also at random, so that
the data simulation spans different situations from highly correlated to uncorrelated data. This is the case when assessing a new multivariate analysis technique
in Montercarlo experiments. In this paper, we introduce a new algorithm for the
generation of random data from covariance matrices of random structure, where
the user only decides the data dimension and the level of correlation. We will illustrate the application of this algorithm in several relevant problems in multivariate
analysis, namely the selection of the number of Principal Components in Principal Component Analysis, the evaluation of the performance of sparse Partial Least
Squares and the calibration of Multivariate Statistical Process Control systems. The
algorithm is available as part of the MEDA Toolbox v1.1
Variable-selection ANOVA Simultaneous Component Analysis (VASCA)
Motivation: ANOVA Simultaneous Component Analysis (ASCA) is a popular method for the analysis of multivariate
data yielded by designed experiments. Meaningful associations between factors/interactions of the experimental
design and measured variables in the dataset are typically identified via significance testing, with permutation tests
being the standard go-to choice. However, in settings with large numbers of variables, like omics (genomics,
transcriptomics, proteomics and metabolomics) experiments, the ‘holistic’ testing approach of ASCA (all variables
considered) often overlooks statistically significant effects encoded by only a few variables (biomarkers).
Results: We hereby propose Variable-selection ASCA (VASCA), a method that generalizes ASCA through variable
selection, augmenting its statistical power without inflating the Type-I error risk. The method is evaluated with
simulations and with a real dataset from a multi-omic clinical experiment. We show that VASCA is more powerful
than both ASCA and the widely adopted false discovery rate controlling procedure; the latter is used as a benchmark
for variable selection based on multiple significance testing. We further illustrate the usefulness of VASCA for
exploratory data analysis in comparison to the popular partial least squares discriminant analysis method and its
sparse counterpart.Agencia Andaluza del Conocimiento, Regional Government of Andalucia , in SpainEuropean Commission B-TIC-136-UGR20State Research Agency (AEI) of SpainEuropean Social Fund (ESF) RYC2020-030536-IAEI PID2020-118139RB-I0
Group-wise Partial Least Square Regression
This paper introduces the Group-wise Partial Least Squares (GPLS) regression.
GPLS is a new Sparse PLS (SPLS) technique where the sparsity structure is
de ned in terms of groups of correlated variables, similarly to what is done in
the related Group-wise Principal Component Analysis (GPCA). These groups
are found in correlation maps derived from the data to be analyzed. GPLS is
especially useful for exploratory data analysis, since suitable values for its metaparameters can be inferred upon visualization of the correlation maps. Following
this approach, we show GPLS solves an inherent problem of SPLS: its tendency
to confound the data structure as a result of setting its metaparameters using
standard approaches for optimizing prediction, like cross-validation. Results are
shown for both simulated and experimental data
On the use of the observation-wise k-fold operation in PCA cross-validation
Cross-validation (CV) is a common approach for determining the optimal number of components in a principal component analysis model. To guarantee the
independence between model testing and calibration, the observation-wise k-fold
operation is commonly implemented in each cross-validation step. This operation renders the CV algorithm computationally intensive and it is the main
limitation to apply CV on very large data sets. In this paper we carry out an
empirical and theoretical investigation of the use of this operation in the element
wise k-fold (ekf ) algorithm, the state-of-the-art CV algorithm. We show that
when very large data sets need to be cross-validated and the computational time
is a matter of concern, the observation-wise k-fold operation can be skipped.
The theoretical properties of the resulting modi ed algorithm, referred to as
column wise k-fold (ckf ) algorithm, are derived. Also, its performance is evaluated with several arti cial and real data sets. We suggest the ckf algorithm
to be a valid alternative to the standard ekf to reduce the computational time
needed to cross-validate a data set
Group-Wise Principal Component Analysis for Exploratory Intrusion Detection
Intrusion detection is a relevant layer of cybersecurity to prevent hacking and illegal activities
from happening on the assets of corporations. Anomaly-based Intrusion Detection Systems perform an
unsupervised analysis on data collected from the network and end systems, in order to identify singular
events. While this approach may produce many false alarms, it is also capable of identifying new (zeroday)
security threats. In this context, the use of multivariate approaches such as Principal Component
Analysis (PCA) provided promising results in the past. PCA can be used in exploratory mode or in learning
mode. Here, we propose an exploratory intrusion detection that replaces PCA with Group-wise PCA
(GPCA), a recently proposed data analysis technique with additional exploratory characteristics. A main
advantage of GPCA over PCA is that the former yields simple models, easy to understand by security
professionals not trained in multivariate tools. Besides, the workflow in the intrusion detection with GPCA
is more coherent with dominant strategies in intrusion detection. We illustrate the application of GPCA in
two case studies.This work was supported in part by the Spanish Government-MINECO (Ministerio de Economía y Competitividad), using the Fondo
Europeo de Desarrollo Regional (FEDER), under Projects TIN2014-60346-R and Project TIN2017-83494-R
Evaluation of Diagnosis Methods in PCA-based Multivariate Statistical Process Control
Multivariate Statistical Process Control (MSPC) based on Principal Component
Analysis (PCA) is a well-known methodology in chemometrics that is aimed at testing whether an industrial process is under Normal Operation Conditions (NOC).
As a part of the methodology, once an anomalous behaviour is detected, the root
causes need to be diagnosed to troubleshoot the problem and/or avoid it in the
future. While there have been a number of developments in diagnosis in the past
decades, no sound method for comparing existing approaches has been proposed.
In this paper, we propose such a procedure and use it to compare several diagnosis
methods using randomly simulated data and from realistic data sources. This is a
general comparative approach that takes into account factors that have not previously been considered in the literature. The results show that univariate diagnosis
is more reliable than its multivariate counterpart
Semi-supervised Multivariate Statistical Network Monitoring for Learning Security Threats
This paper presents a semi-supervised approach
for intrusion detection. The method extends the unsupervised
Multivariate Statistical Network Monitoring approach based
on Principal Component Analysis by introducing a supervised
optimization technique to learn the optimum scaling in the input
data. It inherits the advantages of the unsupervised strategy,
capable of uncovering new threats, with that of supervised
strategies, able of learning the pattern of a targeted threat. The
supervised learning is based on an extension of the gradient
descent method based on Partial Least Squares (PLS). Moreover,
we enhance this method by using sparse PLS variants. The
practical application of the system is demonstrated on a recently
published real case study, showing relevant improvements in
detection performance and in the interpretation of the attacks
Optimal Relay Placement in Multi-hop Wireless Networks
Relay node placement in wireless environments is a research topic recurrently
studied in the specialized literature. A variety of network performance goals,
such as coverage, data rate and network lifetime, are considered as criteria
to lead the placement of the nodes. In this work, a new relay placement approach to maximize network connectivity in a multi-hop wireless network is
presented. Here, connectivity is defined as a combination of inter-node reachability and network throughput. The nodes are placed following a two-step
procedure: (i) initial distribution, and (ii) solution selection. Additionally,
a third stage for placement optimization is optionally proposed to maximize
throughput. This tries to be a general approach for placement, and several
initialization, selection and optimization algorithms can be used in each of
the steps. For experimentation purposes, a leave-one-out selection procedure and a PSO related optimization algorithm are employed and evaluated
for second and third stages, respectively. Other node placement solutions
available in the literature are compared with the proposed one in realistic
simulated scenarios. The results obtained through the properly devised experiments show the improvements achieved by the proposed approach
Production of structured triacylglycerols by acidolysis catalyzed by lipases immobilized in a packed bed reactor
The aim of this work was to produce structured triacylglycerols (STAGs), with caprylic acid located at positions 1 and 3 of the glycerol backbone and docosohexaenoic acid (DHA) at position 2, by acidolysis of tuna oil and caprylic acid (CA) catalyzed by lipases Rd, from Rhizopus delemar, and Palatase 20000L from Mucor miehei immobilized on Accurel MP1000 in a packed bed reactor (PBR), working in continuous and recirculation modes. First, different lipase/support ratios were tested for the immobilization of lipases and the best results were obtained with ratios of 0.67 (w/w) for lipase Rd and 6.67 (w/w) for Palatase. Both lipases were stable for at least 4 days in the operational conditions. In the storage conditions (5 ºC) lipases Rd and Palatase maintained constant activity for 5 and 1 month, respectively. These catalysts have been used to obtain STAGs by acidolysis of tuna oil and CA in a PBR operating with recirculation of the reaction mixture through the lipase bed. Thus, STAGs with 52-53% CA and 14-15% DHA were obtained. These results were the basis for establishing the operational conditions to obtain STAGs operating in continuous mode. These new conditions were established maintaining constant intensity of treatment (IOT, lipase amount × reaction time/oil amount). In this way STAGs with 44-50% CA and 17-24% DHA were obtained operating in continuous mode. Although the compositions of STAGs obtained with both lipases were similar, Palatase required an IOT about 4 times higher than lipase Rd. To separate the acidolysis products (free fatty acids, FFAs, and STAGs) an extraction method of FFAs by water-ethanol solutions was tested. The following variables were optimized: water/ethanol ratio (the best results were attained with a water/ethanol ratio of 30:70 w/w), the solvent/FFA-STAG mixture ratio (3:1 w/w) and the number of extraction steps (3 to 5). In these conditions highly pure STAGs (93-96%) were obtained with a yield of 85%. The residual FFAs can be eliminated by neutralization with a hydroethanolic KOH solution to obtain pure STAGs. The positional analysis of these STAGs, carried out by alcoholysis catalyzed by lipase Novozym 435, has shown that CA represents 55% of fatty acids located at positions 1 and 3 and DHA represents 42% of fatty acids at position 2.Ministerio de Educación y Ciencia (Spain), FEDER, Projects AGL2003-03335 and CTQ2007-6407