106 research outputs found

    On the complementary nature of ANOVA simultaneous component analysis (ASCA+) and Tucker3 tensor decompositions on designed multi-way datasets

    Get PDF
    The complementary nature of analysis of variance (ANOVA) Simultaneous Component Analysis (ASCA+) and Tucker3 tensor decompositions is demonstrated on designed datasets. We show how ASCA+ can be used to (a) identify statistically sufficient Tucker3 models; (b) identify statistically important triads making their interpretation easier; and (c) eliminate non-significant triads making visualization and interpretation simpler. For multivariate datasets with an experimental design of at least two factors, the data matrix can be folded into a multi-way tensor. ASCA+ can be used on the unfolded matrix, and Tucker3 modeling can be used on the folded matrix (tensor). Two novel strategies are reported to determine the statistical significance of Tucker3 models using a previously published dataset. A statistically sufficient model was created by adding factors to the Tucker3 model in a stepwise manner until no ASCA+ detectable structure was observed in the residuals. Bootstrap analysis of the Tucker3 model residuals was used to determine confidence intervals for the loadings and the individual elements of the core matrix and showed that 21 out of 63 core values of the 3 7 3 model were not significant at the 95% confidence level. Exploiting the mutual orthogonality of the 63 triads of the Tucker3 model, these 21 factors (triads) were removed from the model. An ASCA+ backward elimination strategy is reported to further simplify the Tucker3 3 7 3 model to 36 core values and associated triads. ASCA+ was also used to identify individual factors (triads) with selective responses on experimental factors A, B, or interactions, A B, for improved model visualization and interpretation

    Cross-validation in PCA models with the element-wise k-fold (ekf) algorithm: theoretical aspects

    Full text link
    [EN] Cross-validation has become one of the principal methods to adjust the meta-parameters in predictive models. Extensions of the cross-validation idea have been proposed to select the number of components in principal components analysis (PCA). The element-wise k-fold (ekf) cross-validation is among the most used algorithms for principal components analysis cross-validation. This is the method programmed in the PLS_Toolbox, and it has been stated to outperform other methods under most circumstances in a numerical experiment. The ekf algorithm is based on missing data imputation, and it can be programmed using any method for this purpose. In this paper, the ekf algorithm with the simplest missing data imputation method, trimmed score imputation, is analyzed. A theoretical study is driven to identify in which situations the application of ekf is adequate and, more importantly, in which situations it is not. The results presented show that the ekf method may be unable to assess the extent to which a model represents a test set and may lead to discard principal components with important information. On a second paper of this series, other imputation methods are studied within the ekf algorithmResearch in this area is partially supported by the Spanish Ministry of Economy and Competitiveness and FEDER funds from the European Union through grant DPI2011-28112-C04-02. Jose Camacho was funded by the Juan de la Cierva program, Ministry of Science and Innovation, Spain. This study was carried out when Jose Camacho was at the Universidad Politecnica de Valencia and Universitat de Girona, Spain.Camacho Páez, J.; Ferrer Riquelme, AJ. (2012). Cross-validation in PCA models with the element-wise k-fold (ekf) algorithm: theoretical aspects. Journal of Chemometrics. 26(1):361-373. https://doi.org/10.1002/cem.2440S361373261Wold, S. (1978). Cross-Validatory Estimation of the Number of Components in Factor and Principal Components Models. Technometrics, 20(4), 397-405. doi:10.1080/00401706.1978.10489693Eastment, H. T., & Krzanowski, W. J. (1982). Cross-Validatory Choice of the Number of Components From a Principal Component Analysis. Technometrics, 24(1), 73-77. doi:10.1080/00401706.1982.10487712Nomikos, P., & MacGregor, J. F. (1995). Multivariate SPC Charts for Monitoring Batch Processes. Technometrics, 37(1), 41-59. doi:10.1080/00401706.1995.10485888Bro, R., Kjeldahl, K., Smilde, A. K., & Kiers, H. A. L. (2008). Cross-validation of component models: A critical look at current methods. Analytical and Bioanalytical Chemistry, 390(5), 1241-1251. doi:10.1007/s00216-007-1790-1Wise BM Gallagher NB Bro R Shaver JM Windig W Koch RS PLSToolbox 3.5 for use with Matlab 2005Nelson, P. R. C., Taylor, P. A., & MacGregor, J. F. (1996). Missing data methods in PCA and PLS: Score calculations with incomplete observations. Chemometrics and Intelligent Laboratory Systems, 35(1), 45-65. doi:10.1016/s0169-7439(96)00007-xArteaga, F., & Ferrer, A. (2002). Dealing with missing data in MSPC: several methods, different interpretations, some examples. Journal of Chemometrics, 16(8-10), 408-418. doi:10.1002/cem.750Arteaga, F., & Ferrer, A. (2005). Framework for regression-based missing data imputation methods in on-line MSPC. Journal of Chemometrics, 19(8), 439-447. doi:10.1002/cem.946Zhang, P. (1993). Model Selection Via Multifold Cross Validation. The Annals of Statistics, 21(1), 299-313. doi:10.1214/aos/1176349027Louwerse, D. J., Smilde, A. K., & Kiers, H. A. L. (1999). Cross-validation of multiway component models. Journal of Chemometrics, 13(5), 491-510. doi:10.1002/(sici)1099-128x(199909/10)13:53.0.co;2-2Lei, F., Rotbøll, M., & Jørgensen, S. B. (2001). A biochemically structured model for Saccharomyces cerevisiae. Journal of Biotechnology, 88(3), 205-221. doi:10.1016/s0168-1656(01)00269-3López, F., Miguel Valiente, J., Manuel Prats, J., & Ferrer, A. (2008). Performance evaluation of soft color texture descriptors for surface grading using experimental design and logistic regression. Pattern Recognition, 41(5), 1744-1755. doi:10.1016/j.patcog.2007.09.011Camacho, J., Picó, J., & Ferrer, A. (2010). Data understanding with PCA: Structural and Variance Information plots. Chemometrics and Intelligent Laboratory Systems, 100(1), 48-56. doi:10.1016/j.chemolab.2009.10.005Mercer, A. M., & Mercer, P. R. (2000). Cauchy’s interlace theorem and lower bounds for the spectral radius. International Journal of Mathematics and Mathematical Sciences, 23(8), 563-566. doi:10.1155/s016117120000257

    On the Generation of Random Multivariate Data

    Get PDF
    The simulation of multivariate data is often necessary for assessing the performance of multivariate analysis techniques. The random generation of multivariate data when the covariance matrix is completely or partly specified is solved by different methods, from the Cholesky decomposition to some recent alternatives. However, many times the covariance matrix has to be generated also at random, so that the data simulation spans different situations from highly correlated to uncorrelated data. This is the case when assessing a new multivariate analysis technique in Montercarlo experiments. In this paper, we introduce a new algorithm for the generation of random data from covariance matrices of random structure, where the user only decides the data dimension and the level of correlation. We will illustrate the application of this algorithm in several relevant problems in multivariate analysis, namely the selection of the number of Principal Components in Principal Component Analysis, the evaluation of the performance of sparse Partial Least Squares and the calibration of Multivariate Statistical Process Control systems. The algorithm is available as part of the MEDA Toolbox v1.1

    Variable-selection ANOVA Simultaneous Component Analysis (VASCA)

    Get PDF
    Motivation: ANOVA Simultaneous Component Analysis (ASCA) is a popular method for the analysis of multivariate data yielded by designed experiments. Meaningful associations between factors/interactions of the experimental design and measured variables in the dataset are typically identified via significance testing, with permutation tests being the standard go-to choice. However, in settings with large numbers of variables, like omics (genomics, transcriptomics, proteomics and metabolomics) experiments, the ‘holistic’ testing approach of ASCA (all variables considered) often overlooks statistically significant effects encoded by only a few variables (biomarkers). Results: We hereby propose Variable-selection ASCA (VASCA), a method that generalizes ASCA through variable selection, augmenting its statistical power without inflating the Type-I error risk. The method is evaluated with simulations and with a real dataset from a multi-omic clinical experiment. We show that VASCA is more powerful than both ASCA and the widely adopted false discovery rate controlling procedure; the latter is used as a benchmark for variable selection based on multiple significance testing. We further illustrate the usefulness of VASCA for exploratory data analysis in comparison to the popular partial least squares discriminant analysis method and its sparse counterpart.Agencia Andaluza del Conocimiento, Regional Government of Andalucia , in SpainEuropean Commission B-TIC-136-UGR20State Research Agency (AEI) of SpainEuropean Social Fund (ESF) RYC2020-030536-IAEI PID2020-118139RB-I0

    Present and Future of Network Security Monitoring

    Get PDF
    This work was funded by the Ministry of Science and Innovation through CDTI through the Ayudas Cervera para Centros Tecnologicos grant of the Spanish Centre for the Development of Industrial Technology (CDTI) through the Project EGIDA under Grant CER-20191012, and in part by the Spanish Ministry of Economy and Competitiveness and European Regional Development Fund (ERDF) funds under Project TIN2017-83494-R.Network Security Monitoring (NSM) is a popular term to refer to the detection of security incidents by monitoring the network events. An NSM system is central for the security of current networks, given the escalation in sophistication of cyberwarfare. In this paper, we review the state-of-the-art in NSM, and derive a new taxonomy of the functionalities and modules in an NSM system. This taxonomy is useful to assess current NSM deployments and tools for both researchers and practitioners. We organize a list of popular tools according to this new taxonomy, and identify challenges in the application of NSM in modern network deployments, like Software Defined Network (SDN) and Internet of Things (IoT).Ministry of Science and Innovation through CDTI through the Ayudas Cervera para Centros Tecnologicos grant of the Spanish Centre for the Development of Industrial Technology (CDTI) through the Project EGIDA CER-20191012Spanish Ministry of Economy and CompetitivenessEuropean Regional Development Fund (ERDF) funds TIN2017-83494-

    Group-wise Partial Least Square Regression

    Get PDF
    This paper introduces the Group-wise Partial Least Squares (GPLS) regression. GPLS is a new Sparse PLS (SPLS) technique where the sparsity structure is de ned in terms of groups of correlated variables, similarly to what is done in the related Group-wise Principal Component Analysis (GPCA). These groups are found in correlation maps derived from the data to be analyzed. GPLS is especially useful for exploratory data analysis, since suitable values for its metaparameters can be inferred upon visualization of the correlation maps. Following this approach, we show GPLS solves an inherent problem of SPLS: its tendency to confound the data structure as a result of setting its metaparameters using standard approaches for optimizing prediction, like cross-validation. Results are shown for both simulated and experimental data

    On the use of the observation-wise k-fold operation in PCA cross-validation

    Get PDF
    Cross-validation (CV) is a common approach for determining the optimal number of components in a principal component analysis model. To guarantee the independence between model testing and calibration, the observation-wise k-fold operation is commonly implemented in each cross-validation step. This operation renders the CV algorithm computationally intensive and it is the main limitation to apply CV on very large data sets. In this paper we carry out an empirical and theoretical investigation of the use of this operation in the element wise k-fold (ekf ) algorithm, the state-of-the-art CV algorithm. We show that when very large data sets need to be cross-validated and the computational time is a matter of concern, the observation-wise k-fold operation can be skipped. The theoretical properties of the resulting modi ed algorithm, referred to as column wise k-fold (ckf ) algorithm, are derived. Also, its performance is evaluated with several arti cial and real data sets. We suggest the ckf algorithm to be a valid alternative to the standard ekf to reduce the computational time needed to cross-validate a data set

    Monitorización y selección de incidentes en seguridad de redes mediante EDA

    Get PDF
    Uno de los mayores retos a los que se enfrentan los sistemas de monitorización de seguridad en redes es el gran volumen de datos de diversa naturaleza y relevancia que deben procesar para su presentación adecuada al equipo administrador del sistema, tratando de incorporar la información semántica más relevante. En este artículo se propone la aplicación de herramientas derivadas de técnicas de análisis exploratorio de datos para la selección de los eventos críticos en los que el administrador debe focalizar su atención. Adicionalmente, estas herramientas son capaces de proporcionar información semántica en relación a los elementos involucrados y su grado de implicación en los eventos seleccionados. La propuesta se presenta y evalúa utilizando el desafío VAST 2012 como caso de estudio, obteniéndose resultados altamente satisfactorios.Este trabajo ha sido parcialmente financiado por el MICINN a través del proyecto TEC2011-22579

    A Novel Zero-Trust Network Access Control Scheme based on the Security Profile of Devices and Users

    Get PDF
    Security constitutes a principal concern for communication networks and services at present. This way, threats should be under control to minimize risks over time in real environments. With this aim, we introduce here a new approach for access control aimed to strengthen security in corporate networks and service providers related environments. Our proposal, named SADAC (Security Attribute-based Dynamic Access Control) presents three main novel features: (i) security related attributes regarding both configuration and operation are considered for network access control of final devices/users; (ii) a dynamic supervision procedure is implemented to evaluate the security profile associated to devices/users over time and, if so, to apply corresponding access restrictions; and (iii) a supervision procedure that also permits to diagnose the causes of inadequate security behaviours, so that the final devices/users can adapt their configuration and/or operation. We describe the overall access control methodology as well as the aspects for its implementation. In particular, we present and evaluate the specific deployment of SADAC for a corporate WiFi environment supported on a Raspberry Pi-based AP to provide Internet access to mobile devices. Through this experimentation we can conclude the convenience of adopting the approach for improving security by minimizing risks in network and communication environments

    Group-Wise Principal Component Analysis for Exploratory Intrusion Detection

    Get PDF
    Intrusion detection is a relevant layer of cybersecurity to prevent hacking and illegal activities from happening on the assets of corporations. Anomaly-based Intrusion Detection Systems perform an unsupervised analysis on data collected from the network and end systems, in order to identify singular events. While this approach may produce many false alarms, it is also capable of identifying new (zeroday) security threats. In this context, the use of multivariate approaches such as Principal Component Analysis (PCA) provided promising results in the past. PCA can be used in exploratory mode or in learning mode. Here, we propose an exploratory intrusion detection that replaces PCA with Group-wise PCA (GPCA), a recently proposed data analysis technique with additional exploratory characteristics. A main advantage of GPCA over PCA is that the former yields simple models, easy to understand by security professionals not trained in multivariate tools. Besides, the workflow in the intrusion detection with GPCA is more coherent with dominant strategies in intrusion detection. We illustrate the application of GPCA in two case studies.This work was supported in part by the Spanish Government-MINECO (Ministerio de Economía y Competitividad), using the Fondo Europeo de Desarrollo Regional (FEDER), under Projects TIN2014-60346-R and Project TIN2017-83494-R
    corecore