459 research outputs found
Boundary behavior in High Dimension, Low Sample Size asymptotics of PCA
In High Dimension, Low Sample Size (HDLSS) data situations, where the dimension d is much larger than the sample size n, principal component analysis (PCA) plays an important role in statistical analysis. Under which conditions does the sample PCA well reflect the population covariance structure? We answer this question in a relevant asymptotic context where d grows and n is fixed, under a generalized spiked covariance model. Specifically, we assume the largest population eigenvalues to be of the order dα, where α1. Earlier results show the conditions for consistency and strong inconsistency of eigenvectors of the sample covariance matrix. In the boundary case, α=1, where the sample PC directions are neither consistent nor strongly inconsistent, we show that eigenvalues and eigenvectors do not degenerate but have limiting distributions. The result smoothly bridges the phase transition represented by the other two cases, and thus gives a spectrum of limits for the sample PCA in the HDLSS asymptotics. While the results hold under a general situation, the limiting distributions under Gaussian assumption are illustrated in greater detail. In addition, the geometric representation of HDLSS data is extended to give three different representations, that depend on the magnitude of variances in the first few principal components
Boundary behavior in High Dimension, Low Sample Size asymptotics of PCA
In High Dimension, Low Sample Size (HDLSS) data situations, where the dimension d is much larger than the sample size n, principal component analysis (PCA) plays an important role in statistical analysis. Under which conditions does the sample PCA well reflect the population covariance structure? We answer this question in a relevant asymptotic context where d grows and n is fixed, under a generalized spiked covariance model. Specifically, we assume the largest population eigenvalues to be of the order dα, where α1. Earlier results show the conditions for consistency and strong inconsistency of eigenvectors of the sample covariance matrix. In the boundary case, α=1, where the sample PC directions are neither consistent nor strongly inconsistent, we show that eigenvalues and eigenvectors do not degenerate but have limiting distributions. The result smoothly bridges the phase transition represented by the other two cases, and thus gives a spectrum of limits for the sample PCA in the HDLSS asymptotics. While the results hold under a general situation, the limiting distributions under Gaussian assumption are illustrated in greater detail. In addition, the geometric representation of HDLSS data is extended to give three different representations, that depend on the magnitude of variances in the first few principal components
Sampling algorithms for validation of supervised learning models for Ising-like systems
In this paper, we build and explore supervised learning models of
ferromagnetic system behavior, using Monte-Carlo sampling of the spin
configuration space generated by the 2D Ising model. Given the enormous size of
the space of all possible Ising model realizations, the question arises as to
how to choose a reasonable number of samples that will form physically
meaningful and non-intersecting training and testing datasets. Here, we propose
a sampling technique called ID-MH that uses the Metropolis-Hastings algorithm
creating Markov process across energy levels within the predefined
configuration subspace. We show that application of this method retains phase
transitions in both training and testing datasets and serves the purpose of
validation of a machine learning algorithm. For larger lattice dimensions,
ID-MH is not feasible as it requires knowledge of the complete configuration
space. As such, we develop a new "block-ID" sampling strategy: it decomposes
the given structure into square blocks with lattice dimension no greater than 5
and uses ID-MH sampling of candidate blocks. Further comparison of the
performance of commonly used machine learning methods such as random forests,
decision trees, k nearest neighbors and artificial neural networks shows that
the PCA-based Decision Tree regressor is the most accurate predictor of
magnetizations of the Ising model. For energies, however, the accuracy of
prediction is not satisfactory, highlighting the need to consider more
algorithmically complex methods (e.g., deep learning).Comment: 43 pages and 16 figure
Asymptotics of hierarchical clustering for growing dimension
Modern day science presents many challenges to data analysts. Advances in data collection provide very large (number of observations and number of dimensions) data sets. In many areas of data analysis an informative task is to find natural separations of data into homogeneous groups, i.e. clusters. In this paper we study the asymptotic behavior of hierarchical clustering in situations where both sample size and dimension grow to infinity. We derive explicit signal vs noise boundaries between different types of clustering behaviors. We also show that the clustering behavior within the boundaries is the same across a wide spectrum of asymptotic settings
- …