27 research outputs found

    Cross-validation in PCA models with the element-wise k-fold (ekf) algorithm: theoretical aspects

    Full text link
    [EN] Cross-validation has become one of the principal methods to adjust the meta-parameters in predictive models. Extensions of the cross-validation idea have been proposed to select the number of components in principal components analysis (PCA). The element-wise k-fold (ekf) cross-validation is among the most used algorithms for principal components analysis cross-validation. This is the method programmed in the PLS_Toolbox, and it has been stated to outperform other methods under most circumstances in a numerical experiment. The ekf algorithm is based on missing data imputation, and it can be programmed using any method for this purpose. In this paper, the ekf algorithm with the simplest missing data imputation method, trimmed score imputation, is analyzed. A theoretical study is driven to identify in which situations the application of ekf is adequate and, more importantly, in which situations it is not. The results presented show that the ekf method may be unable to assess the extent to which a model represents a test set and may lead to discard principal components with important information. On a second paper of this series, other imputation methods are studied within the ekf algorithmResearch in this area is partially supported by the Spanish Ministry of Economy and Competitiveness and FEDER funds from the European Union through grant DPI2011-28112-C04-02. Jose Camacho was funded by the Juan de la Cierva program, Ministry of Science and Innovation, Spain. This study was carried out when Jose Camacho was at the Universidad Politecnica de Valencia and Universitat de Girona, Spain.Camacho Páez, J.; Ferrer Riquelme, AJ. (2012). Cross-validation in PCA models with the element-wise k-fold (ekf) algorithm: theoretical aspects. Journal of Chemometrics. 26(1):361-373. https://doi.org/10.1002/cem.2440S361373261Wold, S. (1978). Cross-Validatory Estimation of the Number of Components in Factor and Principal Components Models. Technometrics, 20(4), 397-405. doi:10.1080/00401706.1978.10489693Eastment, H. T., & Krzanowski, W. J. (1982). Cross-Validatory Choice of the Number of Components From a Principal Component Analysis. Technometrics, 24(1), 73-77. doi:10.1080/00401706.1982.10487712Nomikos, P., & MacGregor, J. F. (1995). Multivariate SPC Charts for Monitoring Batch Processes. Technometrics, 37(1), 41-59. doi:10.1080/00401706.1995.10485888Bro, R., Kjeldahl, K., Smilde, A. K., & Kiers, H. A. L. (2008). Cross-validation of component models: A critical look at current methods. Analytical and Bioanalytical Chemistry, 390(5), 1241-1251. doi:10.1007/s00216-007-1790-1Wise BM Gallagher NB Bro R Shaver JM Windig W Koch RS PLSToolbox 3.5 for use with Matlab 2005Nelson, P. R. C., Taylor, P. A., & MacGregor, J. F. (1996). Missing data methods in PCA and PLS: Score calculations with incomplete observations. Chemometrics and Intelligent Laboratory Systems, 35(1), 45-65. doi:10.1016/s0169-7439(96)00007-xArteaga, F., & Ferrer, A. (2002). Dealing with missing data in MSPC: several methods, different interpretations, some examples. Journal of Chemometrics, 16(8-10), 408-418. doi:10.1002/cem.750Arteaga, F., & Ferrer, A. (2005). Framework for regression-based missing data imputation methods in on-line MSPC. Journal of Chemometrics, 19(8), 439-447. doi:10.1002/cem.946Zhang, P. (1993). Model Selection Via Multifold Cross Validation. The Annals of Statistics, 21(1), 299-313. doi:10.1214/aos/1176349027Louwerse, D. J., Smilde, A. K., & Kiers, H. A. L. (1999). Cross-validation of multiway component models. Journal of Chemometrics, 13(5), 491-510. doi:10.1002/(sici)1099-128x(199909/10)13:53.0.co;2-2Lei, F., Rotbøll, M., & Jørgensen, S. B. (2001). A biochemically structured model for Saccharomyces cerevisiae. Journal of Biotechnology, 88(3), 205-221. doi:10.1016/s0168-1656(01)00269-3López, F., Miguel Valiente, J., Manuel Prats, J., & Ferrer, A. (2008). Performance evaluation of soft color texture descriptors for surface grading using experimental design and logistic regression. Pattern Recognition, 41(5), 1744-1755. doi:10.1016/j.patcog.2007.09.011Camacho, J., Picó, J., & Ferrer, A. (2010). Data understanding with PCA: Structural and Variance Information plots. Chemometrics and Intelligent Laboratory Systems, 100(1), 48-56. doi:10.1016/j.chemolab.2009.10.005Mercer, A. M., & Mercer, P. R. (2000). Cauchy’s interlace theorem and lower bounds for the spectral radius. International Journal of Mathematics and Mathematical Sciences, 23(8), 563-566. doi:10.1155/s016117120000257

    On the Generation of Random Multivariate Data

    Get PDF
    The simulation of multivariate data is often necessary for assessing the performance of multivariate analysis techniques. The random generation of multivariate data when the covariance matrix is completely or partly specified is solved by different methods, from the Cholesky decomposition to some recent alternatives. However, many times the covariance matrix has to be generated also at random, so that the data simulation spans different situations from highly correlated to uncorrelated data. This is the case when assessing a new multivariate analysis technique in Montercarlo experiments. In this paper, we introduce a new algorithm for the generation of random data from covariance matrices of random structure, where the user only decides the data dimension and the level of correlation. We will illustrate the application of this algorithm in several relevant problems in multivariate analysis, namely the selection of the number of Principal Components in Principal Component Analysis, the evaluation of the performance of sparse Partial Least Squares and the calibration of Multivariate Statistical Process Control systems. The algorithm is available as part of the MEDA Toolbox v1.1

    Variable-selection ANOVA Simultaneous Component Analysis (VASCA)

    Get PDF
    Motivation: ANOVA Simultaneous Component Analysis (ASCA) is a popular method for the analysis of multivariate data yielded by designed experiments. Meaningful associations between factors/interactions of the experimental design and measured variables in the dataset are typically identified via significance testing, with permutation tests being the standard go-to choice. However, in settings with large numbers of variables, like omics (genomics, transcriptomics, proteomics and metabolomics) experiments, the ‘holistic’ testing approach of ASCA (all variables considered) often overlooks statistically significant effects encoded by only a few variables (biomarkers). Results: We hereby propose Variable-selection ASCA (VASCA), a method that generalizes ASCA through variable selection, augmenting its statistical power without inflating the Type-I error risk. The method is evaluated with simulations and with a real dataset from a multi-omic clinical experiment. We show that VASCA is more powerful than both ASCA and the widely adopted false discovery rate controlling procedure; the latter is used as a benchmark for variable selection based on multiple significance testing. We further illustrate the usefulness of VASCA for exploratory data analysis in comparison to the popular partial least squares discriminant analysis method and its sparse counterpart.Agencia Andaluza del Conocimiento, Regional Government of Andalucia , in SpainEuropean Commission B-TIC-136-UGR20State Research Agency (AEI) of SpainEuropean Social Fund (ESF) RYC2020-030536-IAEI PID2020-118139RB-I0

    Group-wise Partial Least Square Regression

    Get PDF
    This paper introduces the Group-wise Partial Least Squares (GPLS) regression. GPLS is a new Sparse PLS (SPLS) technique where the sparsity structure is de ned in terms of groups of correlated variables, similarly to what is done in the related Group-wise Principal Component Analysis (GPCA). These groups are found in correlation maps derived from the data to be analyzed. GPLS is especially useful for exploratory data analysis, since suitable values for its metaparameters can be inferred upon visualization of the correlation maps. Following this approach, we show GPLS solves an inherent problem of SPLS: its tendency to confound the data structure as a result of setting its metaparameters using standard approaches for optimizing prediction, like cross-validation. Results are shown for both simulated and experimental data

    On the use of the observation-wise k-fold operation in PCA cross-validation

    Get PDF
    Cross-validation (CV) is a common approach for determining the optimal number of components in a principal component analysis model. To guarantee the independence between model testing and calibration, the observation-wise k-fold operation is commonly implemented in each cross-validation step. This operation renders the CV algorithm computationally intensive and it is the main limitation to apply CV on very large data sets. In this paper we carry out an empirical and theoretical investigation of the use of this operation in the element wise k-fold (ekf ) algorithm, the state-of-the-art CV algorithm. We show that when very large data sets need to be cross-validated and the computational time is a matter of concern, the observation-wise k-fold operation can be skipped. The theoretical properties of the resulting modi ed algorithm, referred to as column wise k-fold (ckf ) algorithm, are derived. Also, its performance is evaluated with several arti cial and real data sets. We suggest the ckf algorithm to be a valid alternative to the standard ekf to reduce the computational time needed to cross-validate a data set

    Group-Wise Principal Component Analysis for Exploratory Intrusion Detection

    Get PDF
    Intrusion detection is a relevant layer of cybersecurity to prevent hacking and illegal activities from happening on the assets of corporations. Anomaly-based Intrusion Detection Systems perform an unsupervised analysis on data collected from the network and end systems, in order to identify singular events. While this approach may produce many false alarms, it is also capable of identifying new (zeroday) security threats. In this context, the use of multivariate approaches such as Principal Component Analysis (PCA) provided promising results in the past. PCA can be used in exploratory mode or in learning mode. Here, we propose an exploratory intrusion detection that replaces PCA with Group-wise PCA (GPCA), a recently proposed data analysis technique with additional exploratory characteristics. A main advantage of GPCA over PCA is that the former yields simple models, easy to understand by security professionals not trained in multivariate tools. Besides, the workflow in the intrusion detection with GPCA is more coherent with dominant strategies in intrusion detection. We illustrate the application of GPCA in two case studies.This work was supported in part by the Spanish Government-MINECO (Ministerio de Economía y Competitividad), using the Fondo Europeo de Desarrollo Regional (FEDER), under Projects TIN2014-60346-R and Project TIN2017-83494-R

    Evaluation of Diagnosis Methods in PCA-based Multivariate Statistical Process Control

    Get PDF
    Multivariate Statistical Process Control (MSPC) based on Principal Component Analysis (PCA) is a well-known methodology in chemometrics that is aimed at testing whether an industrial process is under Normal Operation Conditions (NOC). As a part of the methodology, once an anomalous behaviour is detected, the root causes need to be diagnosed to troubleshoot the problem and/or avoid it in the future. While there have been a number of developments in diagnosis in the past decades, no sound method for comparing existing approaches has been proposed. In this paper, we propose such a procedure and use it to compare several diagnosis methods using randomly simulated data and from realistic data sources. This is a general comparative approach that takes into account factors that have not previously been considered in the literature. The results show that univariate diagnosis is more reliable than its multivariate counterpart

    Semi-supervised Multivariate Statistical Network Monitoring for Learning Security Threats

    Get PDF
    This paper presents a semi-supervised approach for intrusion detection. The method extends the unsupervised Multivariate Statistical Network Monitoring approach based on Principal Component Analysis by introducing a supervised optimization technique to learn the optimum scaling in the input data. It inherits the advantages of the unsupervised strategy, capable of uncovering new threats, with that of supervised strategies, able of learning the pattern of a targeted threat. The supervised learning is based on an extension of the gradient descent method based on Partial Least Squares (PLS). Moreover, we enhance this method by using sparse PLS variants. The practical application of the system is demonstrated on a recently published real case study, showing relevant improvements in detection performance and in the interpretation of the attacks

    Optimal Relay Placement in Multi-hop Wireless Networks

    Get PDF
    Relay node placement in wireless environments is a research topic recurrently studied in the specialized literature. A variety of network performance goals, such as coverage, data rate and network lifetime, are considered as criteria to lead the placement of the nodes. In this work, a new relay placement approach to maximize network connectivity in a multi-hop wireless network is presented. Here, connectivity is defined as a combination of inter-node reachability and network throughput. The nodes are placed following a two-step procedure: (i) initial distribution, and (ii) solution selection. Additionally, a third stage for placement optimization is optionally proposed to maximize throughput. This tries to be a general approach for placement, and several initialization, selection and optimization algorithms can be used in each of the steps. For experimentation purposes, a leave-one-out selection procedure and a PSO related optimization algorithm are employed and evaluated for second and third stages, respectively. Other node placement solutions available in the literature are compared with the proposed one in realistic simulated scenarios. The results obtained through the properly devised experiments show the improvements achieved by the proposed approach

    Production of structured triacylglycerols by acidolysis catalyzed by lipases immobilized in a packed bed reactor

    Get PDF
    The aim of this work was to produce structured triacylglycerols (STAGs), with caprylic acid located at positions 1 and 3 of the glycerol backbone and docosohexaenoic acid (DHA) at position 2, by acidolysis of tuna oil and caprylic acid (CA) catalyzed by lipases Rd, from Rhizopus delemar, and Palatase 20000L from Mucor miehei immobilized on Accurel MP1000 in a packed bed reactor (PBR), working in continuous and recirculation modes. First, different lipase/support ratios were tested for the immobilization of lipases and the best results were obtained with ratios of 0.67 (w/w) for lipase Rd and 6.67 (w/w) for Palatase. Both lipases were stable for at least 4 days in the operational conditions. In the storage conditions (5 ºC) lipases Rd and Palatase maintained constant activity for 5 and 1 month, respectively. These catalysts have been used to obtain STAGs by acidolysis of tuna oil and CA in a PBR operating with recirculation of the reaction mixture through the lipase bed. Thus, STAGs with 52-53% CA and 14-15% DHA were obtained. These results were the basis for establishing the operational conditions to obtain STAGs operating in continuous mode. These new conditions were established maintaining constant intensity of treatment (IOT, lipase amount × reaction time/oil amount). In this way STAGs with 44-50% CA and 17-24% DHA were obtained operating in continuous mode. Although the compositions of STAGs obtained with both lipases were similar, Palatase required an IOT about 4 times higher than lipase Rd. To separate the acidolysis products (free fatty acids, FFAs, and STAGs) an extraction method of FFAs by water-ethanol solutions was tested. The following variables were optimized: water/ethanol ratio (the best results were attained with a water/ethanol ratio of 30:70 w/w), the solvent/FFA-STAG mixture ratio (3:1 w/w) and the number of extraction steps (3 to 5). In these conditions highly pure STAGs (93-96%) were obtained with a yield of 85%. The residual FFAs can be eliminated by neutralization with a hydroethanolic KOH solution to obtain pure STAGs. The positional analysis of these STAGs, carried out by alcoholysis catalyzed by lipase Novozym 435, has shown that CA represents 55% of fatty acids located at positions 1 and 3 and DHA represents 42% of fatty acids at position 2.Ministerio de Educación y Ciencia (Spain), FEDER, Projects AGL2003-03335 and CTQ2007-6407
    corecore