99 research outputs found

    Non-Gaussian component analysis: testing the dimension of the signal subspace

    Full text link
    Dimension reduction is a common strategy in multivariate data analysis which seeks a subspace which contains all interesting features needed for the subsequent analysis. Non-Gaussian component analysis attempts for this purpose to divide the data into a non-Gaussian part, the signal, and a Gaussian part, the noise. We will show that the simultaneous use of two scatter functionals can be used for this purpose and suggest a bootstrap test to test the dimension of the non-Gaussian subspace. Sequential application of the test can then for example be used to estimate the signal dimension

    Robust Nonparametric Inference

    Get PDF
    In this article, we provide a personal review of the literature on nonparametric and robust tools in the standard univariate and multivariate location and scatter, as well as linear regression problems, with a special focus on sign and rank methods, their equivariance and invariance properties, and their robustness and efficiency. Beyond parametric models, the population quantities of interest are often formulated as location, scatter, skewness, kurtosis and other functionals. Some old and recent tools for model checking, dimension reduction, and subspace estimation in wide semiparametric models are discussed. We also discuss recent extensions of procedures in certain nonstandard semiparametric cases including clustered and matrix-valued data. Our personal list of important unsolved and future issues is provided

    On linear dimension reduction based on diagonalization of scatter matrices for bioinformatics downstream analyses

    Get PDF
    Dimension reduction is often a preliminary step in the analysis of data sets with a large number of variables. Most classical, both supervised and unsupervised, dimension reduction methods such as principal component analysis (PCA), independent component analysis (ICA) or sliced inverse regression (SIR) can be formulated using one, two or several different scatter matrix functionals. Scatter matrices can be seen as different measures of multivariate dispersion and might highlight different features of the data and when compared might reveal interesting structures. Such analysis then searches for a projection onto an interesting (signal) part of the data, and it is also important to know the correct dimension of the signal subspace. These approaches usually make either no model assumptions or work in wide classes of semiparametric models. Theoretical results in the literature are however limited to the case where the sample size exceeds the number of variables which is hardly ever true for data sets encountered in bioinformatics. In this paper, we briefly review the relevant literature and explore if the dimension reduction tools can be used to find relevant and interesting subspaces for small-n-large-p data sets. We illustrate the methods with a microarray dataset of prostate cancer patients and healthy controls.</p

    Blind Source Separation Based on Joint Diagonalization in R: The Packages JADE and BSSasymp

    Get PDF
    Blind source separation (BSS) is a well-known signal processing tool which is used to solve practical data analysis problems in various fields of science. In BSS, we assume that the observed data consists of linear mixtures of latent variables. The mixing system and the distributions of the latent variables are unknown. The aim is to find an estimate of an unmixing matrix which then transforms the observed data back to latent sources. In this paper we present the R packages JADE and BSSasymp. The package JADE offers several BSS methods which are based on joint diagonalization. Package BSSasymp contains functions for computing the asymptotic covariance matrices as well as their data-based estimates for most of the BSS estimators included in package JADE. Several simulated and real datasets are used to illustrate the functions in these two packages.</p

    Sliced average variance estimation for multivariate time series

    Get PDF
    Supervised dimension reduction for time series is challenging as there may be temporal dependence between the response y and the predictors . Recently a time series version of sliced inverse regression, TSIR, was suggested, which applies approximate joint diagonalization of several supervised lagged covariance matrices to consider the temporal nature of the data. In this paper, we develop this concept further and propose a time series version of sliced average variance estimation, TSAVE. As both TSIR and TSAVE have their own advantages and disadvantages, we consider furthermore a hybrid version of TSIR and TSAVE. Based on examples and simulations we demonstrate and evaluate the differences between the three methods and show also that they are superior to apply their iid counterparts to when also using lagged values of the explaining variables as predictors

    Independent component analysis for tensor-valued data

    Get PDF
    In preprocessing tensor-valued data, e.g., images and videos, a common procedure is to vectorize the observations and subject the resulting vectors to one of the many methods used for independent component analysis (ICA). However, the tensor structure of the original data is lost in the vectorization and, as a more suitable alternative, we propose the matrix- and tensor fourth order blind identification (MFOBI and TFOBI). In these tensorial extensions of the classic fourth order blind identification (FOBI) we assume a Kronecker structure for the mixing and perform FOBI simultaneously on each direction of the observed tensors. We discuss the theory and assumptions behind MFOBI and TFOBI and provide two different algorithms and related estimates of the unmixing matrices along with their asymptotic properties. Finally, simulations are used to compare the method's performance with that of classical FOBI for vectorized data and we end with a real data clustering example. (C) 2017 Elsevier Inc. All rights reserved

    Extracting Conditionally Heteroskedastic Components using Independent Component Analysis

    Get PDF
    In the independent component model, the multivariate data are assumed to be a mixture of mutually independent latent components. The independent component analysis (ICA) then aims at estimating these latent components. In this article, we study an ICA method which combines the use of linear and quadratic autocorrelations to enable efficient estimation of various kinds of stationary time series. Statistical properties of the estimator are studied by finding its limiting distribution under general conditions, and the asymptotic variances are derived in the case of ARMA-GARCH model. We use the asymptotic results and a finite sample simulation study to compare different choices of a weight coefficient. As it is often of interest to identify all those components which exhibit stochastic volatility features we suggest a test statistic for this problem. We also show that a slightly modified version of the principal volatility component analysis can be seen as an ICA method. Finally, we apply the estimators in analysing a data set which consists of time series of exchange rates of seven currencies to US dollar. Supporting information including proofs of the theorems is available online

    Tensorial blind source separation for improved analysis of multi-omic data

    Get PDF
    There is an increased need for integrative analyses of multi-omic data. We present and benchmark a novel tensorial independent component analysis (tICA) algorithm against current state-of-the-art methods. We find that tICA outperforms competing methods in identifying biological sources of data variation at a reduced computational cost. On epigenetic data, tICA can identify methylation quantitative trait loci at high sensitivity. In the cancer context, tICA identifies gene modules whose expression variation across tumours is driven by copy-number or DNA methylation changes, but whose deregulation relative to normal tissue is independent of such alterations, a result we validate by direct analysis of individual data types

    Tensorial blind source separation for improved analysis of multi-omic data

    Get PDF
    There is an increased need for integrative analyses of multi-omic data. We present and benchmark a novel tensorial independent component analysis (tICA) algorithm against current state-of-the-art methods. We find that tICA outperforms competing methods in identifying biological sources of data variation at a reduced computational cost. On epigenetic data, tICA can identify methylation quantitative trait loci at high sensitivity. In the cancer context, tICA identifies gene modules whose expression variation across tumours is driven by copy-number or DNA methylation changes, but whose deregulation relative to normal tissue is independent of such alterations, a result we validate by direct analysis of individual data types

    Subgroup detection in genotype data using invariant coordinate selection

    Get PDF
    Background: The current gold standard in dimension reduction methods for high-throughput genotype data is the Principle Component Analysis (PCA). The presence of PCA is so dominant, that other methods usually cannot be found in the analyst's toolbox and hence are only rarely applied.Results: We present a modern dimension reduction method called 'Invariant Coordinate Selection' (ICS) and its application to high-throughput genotype data. The more commonly known Independent Component Analysis (ICA) is in this framework just a special case of ICS. We use ICS on both, a simulated and a real dataset to demonstrate first some deficiencies of PCA and how ICS is capable to recover the correct subgroups within the simulated data. Second, we apply the ICS method on a chicken dataset and also detect there two subgroups. These subgroups are then further investigated with respect to their genotype to provide further evidence of the biological relevance of the detected subgroup division. Further, we compare the performance of ICS also to five other popular dimension reduction methods.Conclusion: The ICS method was able to detect subgroups in data where the PCA fails to detect anything. Hence, we promote the application of ICS to high-throughput genotype data in addition to the established PCA. Especially in statistical programming environments like e.g. R, its application does not add any computational burden to the analysis pipeline
    • …
    corecore