7 research outputs found

    Breakdown point of model selection when the number of variables exceeds the number of observations

    Get PDF
    Abstract — The classical multivariate linear regression problem assumes p variables X1, X2,..., Xp and a response vector y, each with n observations, and a linear relationship between the two: y = Xβ + z, where z ∼ N(0, σ 2). We point out that when p> n, there is a breakdown point for standard model selection schemes, such that model selection only works well below a certain critical complexity level depending on n/p. We apply this notion to some standard model selection algorithms (Forward Stepwise, LASSO, LARS) in the case where p ≫ n. We find that 1) the breakdown point is welldefined for random X-models and low noise, 2) increasing noise shifts the breakdown point to lower levels of sparsity, and reduces the model recovery ability of the algorithm in a systematic way, and 3) below breakdown, the size of coefficient errors follows the theoretical error distribution for the classical linear model. I

    Observed Universality of Phase Transitions in High-Dimensional Geometry, with Implications for Modern Data Analysis and Signal Processing

    Full text link
    We review connections between phase transitions in high-dimensional combinatorial geometry and phase transitions occurring in modern high-dimensional data analysis and signal processing. In data analysis, such transitions arise as abrupt breakdown of linear model selection, robust data fitting or compressed sensing reconstructions, when the complexity of the model or the number of outliers increases beyond a threshold. In combinatorial geometry these transitions appear as abrupt changes in the properties of face counts of convex polytopes when the dimensions are varied. The thresholds in these very different problems appear in the same critical locations after appropriate calibration of variables. These thresholds are important in each subject area: for linear modelling, they place hard limits on the degree to which the now-ubiquitous high-throughput data analysis can be successful; for robustness, they place hard limits on the degree to which standard robust fitting methods can tolerate outliers before breaking down; for compressed sensing, they define the sharp boundary of the undersampling/sparsity tradeoff in undersampling theorems. Existing derivations of phase transitions in combinatorial geometry assume the underlying matrices have independent and identically distributed (iid) Gaussian elements. In applications, however, it often seems that Gaussianity is not required. We conducted an extensive computational experiment and formal inferential analysis to test the hypothesis that these phase transitions are {\it universal} across a range of underlying matrix ensembles. The experimental results are consistent with an asymptotic large-nn universality across matrix ensembles; finite-sample universality can be rejected.Comment: 47 pages, 24 figures, 10 table

    Advances in the analysis of event-related potential data with factor analytic methods

    Get PDF
    Researchers are often interested in comparing brain activity between experimental contexts. Event-related potentials (ERPs) are a common electrophysiological measure of brain activity that is time-locked to an event (e.g., a stimulus presented to the participant). A variety of decomposition methods has been used for ERP data among them temporal exploratory factor analysis (EFA). Essentially, temporal EFA decomposes the ERP waveform into a set of latent factors where the factor loadings reflect the time courses of the latent factors, and the amplitudes are represented by the factor scores. An important methodological concern is to ensure the estimates of the condition effects are unbiased and the term variance misallocation has been introduced in reference to the case of biased estimates. The aim of the present thesis was to explore how exploratory factor analytic methods can be made less prone to variance misallocation. These efforts resulted in a series of three publications in which variance misallocation in EFA was described as a consequence of the properties of ERP data, ESEM was proposed as an extension of EFA that acknowledges the structure of ERP data sets, and regularized estimation was suggested as an alternative to simple structure rotation with desirable properties. The presence of multiple sources of (co-)variance, the factor scoring step, and high temporal overlap of the factors were identified as major causes of variance misallocation in EFA for ERP data. It was shown that ESEM is capable of separating the (co-)variance sources and that it avoids biases due to factor scoring. Further, regularized estimation was shown to be a suitable alternative for factor rotation that is able to recover factor loading patterns in which only a subset of the variables follow a simple structure. Based on these results, regSEMs and ESEMs with ERP-specific rotation have been proposed as promising extensions of the EFA approach that might be less prone to variance misallocation. Future research should provide a direct comparison of regSEM and ESEM, and conduct simulation studies with more physiologically motivated data generation algorithms

    Studying the ability of finding single and interaction effects with Random Forest, and its application in psychiatric genetics

    Get PDF
    Psychotic disorders such as schizophrenia and bipolar disorder have a strong genetic component. The aetiology of psychoses is known to be complex, including additive effects from multiple susceptibility genes, interactions between genes, environmental risk factors, and gene by environment interactions. With the development of new technologies such as genome-wide association studies and imputation of ungenotyped variants, the amount of genomic data has increased dramatically leading to the necessary use of Machine Learning techniques. Random Forest has been widely used to study the underlying genetic factors of psychiatric disorders such as epistasis and gene-gene interactions. Several authors have investigated the ability of this algorithm in finding single and interaction effects, but have reported contradictory results. Therefore, in order to examine Random Forest ability of detecting single and interaction effects based on different variable importance measures, I conducted a simulation study assessing whether the algorithm was able to detect single and interaction models under different correlation conditions. The results suggest that the optimal Variable Importance Measures to use in real situations under correlation is the unconditional unscaled permutation variable importance measure. Several studies have shown bias in one of the most popular variable importance measures, the Gini importance. Hence, in a second simulation study I study whether the Gini variable importance is influenced by the variability of predictors, the precision of measuring them, and the variability of the error. Evidence of other biases in this variable importance was found. The results from the first simulation study were used to study whether genes related to 29 molecular biomarkers, which have been associated with schizophrenia, influence risk for schizophrenia in a case-control study of 26476 cases and 31804 controls from 39 different European ancestry cohorts. Single effects from ACAT2 and TNC genes were detected to contribute risk for schizophrenia. ACAT2 is a gene in the chromosome 6 which is related to energy metabolism. Transcriptional differences have been shown in schizophrenia brain tissue studies. TNC is expressed in the brain where is involved in the migration of the neurons and axons. In addition, we also used the simulation results to examine whether interactions between genes associated with abnormal emotion/affect behaviour influence risk for psychosis and cognition in humans, in a case-control study of 2049 cases and 1794 controls. Before correcting for multiple testing, significant interactions between CRHR1 and ESR1, and between MAPT and ESR1, and among CRHR1, ESR1 and TOM1L2, and among MAPT, ESR1 and TOM1L2 were observed in abnormal fear/anxiety-related behaviour pathway. There was no evidence for epistasis after Bonferroni correction
    corecore