Search CORE

7 research outputs found

Recommended from our members

The Scientific Method in Practice: Reproducibility in the Computational Sciences

Author: Stodden Victoria C.
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2010
Field of study

Since the 1660's the scientific method has included reproducibility as a mainstay in its effort to root error from scientific discovery. With the explosive growth of digitization in scientific research and communication, it is easier than ever to satisfy this requirement. In computational research experimental details and methods can be recorded in code and scripts, data is digital, papers are frequently online, and the result is the potential for "really reproducible research." Imagine the ability to routinely inspect code and data and recreate others' results: Every step taken to achieve the findings can potentially be transparent. Now imagine anyone with an Internet connection and the capability of running the code being able to do this. This paper investigates the obstacles blocking the sharing of code and data to understand conditions under which computational scientists reveal their full research compendium. A survey of registrants at a top machine learning conference (NIPS) was used to discover the strength of underlying factors that affect the decision to reveal code, data, and ideas. Sharing of code and data is becoming more common as about a third of respondents post some on their websites, and about 85% self report to have some code or data publicly available on the web. Contrary to theoretical expectations, the decision to share work is grounded in communitarian norms, although when work remains hidden private incentives dominate the decision. We find that code, data, and ideas are each regarded differently in terms of how they are revealed and that guidance from scientific norms varies with pervasiveness of computation in the field. The largest barriers to sharing are time involved in preparation of work and the legal Intellectual Property framework scientists face. This paper does two things. It provides evidence in the debate about whether scientists' research revealing behavior is wholly governed by considerations of personal impact or whether the reasoning behind the revealing decision involves larger scientific ideals, and secondly, this research describes the actual sharing behavior in the Machine Learning community

Columbia University Academic Commons

Breakdown point of model selection when the number of variables exceeds the number of observations

Author: David Donoho
Victoria Stodden
Publication venue
Publication date: 01/01/2006
Field of study

Abstract — The classical multivariate linear regression problem assumes p variables X1, X2,..., Xp and a response vector y, each with n observations, and a linear relationship between the two: y = Xβ + z, where z ∼ N(0, σ 2). We point out that when p> n, there is a breakdown point for standard model selection schemes, such that model selection only works well below a certain critical complexity level depending on n/p. We apply this notion to some standard model selection algorithms (Forward Stepwise, LASSO, LARS) in the case where p ≫ n. We find that 1) the breakdown point is welldefined for random X-models and low noise, 2) increasing noise shifts the breakdown point to lower levels of sparsity, and reduces the model recovery ability of the algorithm in a systematic way, and 3) below breakdown, the size of coefficient errors follows the theoretical error distribution for the classical linear model. I

CiteSeerX

Columbia University Academic Commons

Observed Universality of Phase Transitions in High-Dimensional Geometry, with Implications for Modern Data Analysis and Signal Processing

Author: Donoho David L.
Tanner Jared
Publication venue: 'The Royal Society'
Publication date: 01/01/1906
Field of study

We review connections between phase transitions in high-dimensional combinatorial geometry and phase transitions occurring in modern high-dimensional data analysis and signal processing. In data analysis, such transitions arise as abrupt breakdown of linear model selection, robust data fitting or compressed sensing reconstructions, when the complexity of the model or the number of outliers increases beyond a threshold. In combinatorial geometry these transitions appear as abrupt changes in the properties of face counts of convex polytopes when the dimensions are varied. The thresholds in these very different problems appear in the same critical locations after appropriate calibration of variables. These thresholds are important in each subject area: for linear modelling, they place hard limits on the degree to which the now-ubiquitous high-throughput data analysis can be successful; for robustness, they place hard limits on the degree to which standard robust fitting methods can tolerate outliers before breaking down; for compressed sensing, they define the sharp boundary of the undersampling/sparsity tradeoff in undersampling theorems. Existing derivations of phase transitions in combinatorial geometry assume the underlying matrices have independent and identically distributed (iid) Gaussian elements. In applications, however, it often seems that Gaussianity is not required. We conducted an extensive computational experiment and formal inferential analysis to test the hypothesis that these phase transitions are {\it universal} across a range of underlying matrix ensembles. The experimental results are consistent with an asymptotic large-

n

universality across matrix ensembles; finite-sample universality can be rejected.Comment: 47 pages, 24 figures, 10 table

arXiv.org e-Print Archive

CiteSeerX

Oxford University Research Archive

Advances in the analysis of event-related potential data with factor analytic methods

Author: Scharf Florian
Publication venue
Publication date: 04/04/2019
Field of study

Researchers are often interested in comparing brain activity between experimental contexts. Event-related potentials (ERPs) are a common electrophysiological measure of brain activity that is time-locked to an event (e.g., a stimulus presented to the participant). A variety of decomposition methods has been used for ERP data among them temporal exploratory factor analysis (EFA). Essentially, temporal EFA decomposes the ERP waveform into a set of latent factors where the factor loadings reﬂect the time courses of the latent factors, and the amplitudes are represented by the factor scores. An important methodological concern is to ensure the estimates of the condition eﬀects are unbiased and the term variance misallocation has been introduced in reference to the case of biased estimates. The aim of the present thesis was to explore how exploratory factor analytic methods can be made less prone to variance misallocation. These eﬀorts resulted in a series of three publications in which variance misallocation in EFA was described as a consequence of the properties of ERP data, ESEM was proposed as an extension of EFA that acknowledges the structure of ERP data sets, and regularized estimation was suggested as an alternative to simple structure rotation with desirable properties. The presence of multiple sources of (co-)variance, the factor scoring step, and high temporal overlap of the factors were identiﬁed as major causes of variance misallocation in EFA for ERP data. It was shown that ESEM is capable of separating the (co-)variance sources and that it avoids biases due to factor scoring. Further, regularized estimation was shown to be a suitable alternative for factor rotation that is able to recover factor loading patterns in which only a subset of the variables follow a simple structure. Based on these results, regSEMs and ESEMs with ERP-speciﬁc rotation have been proposed as promising extensions of the EFA approach that might be less prone to variance misallocation. Future research should provide a direct comparison of regSEM and ESEM, and conduct simulation studies with more physiologically motivated data generation algorithms

Qucosa - Publikationsserver der Universität Leipzig

Studying the ability of finding single and interaction effects with Random Forest, and its application in psychiatric genetics

Author: Neira Gonzalez Lara Andrea
Publication venue: The University of Edinburgh
Publication date: 30/06/2018
Field of study

Psychotic disorders such as schizophrenia and bipolar disorder have a strong genetic component. The aetiology of psychoses is known to be complex, including additive effects from multiple susceptibility genes, interactions between genes, environmental risk factors, and gene by environment interactions. With the development of new technologies such as genome-wide association studies and imputation of ungenotyped variants, the amount of genomic data has increased dramatically leading to the necessary use of Machine Learning techniques. Random Forest has been widely used to study the underlying genetic factors of psychiatric disorders such as epistasis and gene-gene interactions. Several authors have investigated the ability of this algorithm in finding single and interaction effects, but have reported contradictory results. Therefore, in order to examine Random Forest ability of detecting single and interaction effects based on different variable importance measures, I conducted a simulation study assessing whether the algorithm was able to detect single and interaction models under different correlation conditions. The results suggest that the optimal Variable Importance Measures to use in real situations under correlation is the unconditional unscaled permutation variable importance measure. Several studies have shown bias in one of the most popular variable importance measures, the Gini importance. Hence, in a second simulation study I study whether the Gini variable importance is influenced by the variability of predictors, the precision of measuring them, and the variability of the error. Evidence of other biases in this variable importance was found. The results from the first simulation study were used to study whether genes related to 29 molecular biomarkers, which have been associated with schizophrenia, influence risk for schizophrenia in a case-control study of 26476 cases and 31804 controls from 39 different European ancestry cohorts. Single effects from ACAT2 and TNC genes were detected to contribute risk for schizophrenia. ACAT2 is a gene in the chromosome 6 which is related to energy metabolism. Transcriptional differences have been shown in schizophrenia brain tissue studies. TNC is expressed in the brain where is involved in the migration of the neurons and axons. In addition, we also used the simulation results to examine whether interactions between genes associated with abnormal emotion/affect behaviour influence risk for psychosis and cognition in humans, in a case-control study of 2049 cases and 1794 controls. Before correcting for multiple testing, significant interactions between CRHR1 and ESR1, and between MAPT and ESR1, and among CRHR1, ESR1 and TOM1L2, and among MAPT, ESR1 and TOM1L2 were observed in abnormal fear/anxiety-related behaviour pathway. There was no evidence for epistasis after Bonferroni correction

Edinburgh Research Archive