7 research outputs found
Recommended from our members
The Scientific Method in Practice: Reproducibility in the Computational Sciences
Since the 1660's the scientific method has included reproducibility as a mainstay in its effort to root error from scientific discovery. With the explosive growth of digitization in scientific research and communication, it is easier than ever to satisfy this requirement. In computational research experimental details and methods can be recorded in code and scripts, data is digital, papers are frequently online, and the result is the potential for "really reproducible research." Imagine the ability to routinely inspect code and data and recreate others' results: Every step taken to achieve the findings can potentially be transparent. Now imagine anyone with an Internet connection and the capability of running the code being able to do this. This paper investigates the obstacles blocking the sharing of code and data to understand conditions under which computational scientists reveal their full research compendium. A survey of registrants at a top machine learning conference (NIPS) was used to discover the strength of underlying factors that affect the decision to reveal code, data, and ideas. Sharing of code and data is becoming more common as about a third of respondents post some on their websites, and about 85% self report to have some code or data publicly available on the web. Contrary to theoretical expectations, the decision to share work is grounded in communitarian norms, although when work remains hidden private incentives dominate the decision. We find that code, data, and ideas are each regarded differently in terms of how they are revealed and that guidance from scientific norms varies with pervasiveness of computation in the field. The largest barriers to sharing are time involved in preparation of work and the legal Intellectual Property framework scientists face. This paper does two things. It provides evidence in the debate about whether scientists' research revealing behavior is wholly governed by considerations of personal impact or whether the reasoning behind the revealing decision involves larger scientific ideals, and secondly, this research describes the actual sharing behavior in the Machine Learning community
Breakdown point of model selection when the number of variables exceeds the number of observations
Abstract — The classical multivariate linear regression problem assumes p variables X1, X2,..., Xp and a response vector y, each with n observations, and a linear relationship between the two: y = Xβ + z, where z ∼ N(0, σ 2). We point out that when p> n, there is a breakdown point for standard model selection schemes, such that model selection only works well below a certain critical complexity level depending on n/p. We apply this notion to some standard model selection algorithms (Forward Stepwise, LASSO, LARS) in the case where p ≫ n. We find that 1) the breakdown point is welldefined for random X-models and low noise, 2) increasing noise shifts the breakdown point to lower levels of sparsity, and reduces the model recovery ability of the algorithm in a systematic way, and 3) below breakdown, the size of coefficient errors follows the theoretical error distribution for the classical linear model. I
Observed Universality of Phase Transitions in High-Dimensional Geometry, with Implications for Modern Data Analysis and Signal Processing
We review connections between phase transitions in high-dimensional
combinatorial geometry and phase transitions occurring in modern
high-dimensional data analysis and signal processing. In data analysis, such
transitions arise as abrupt breakdown of linear model selection, robust data
fitting or compressed sensing reconstructions, when the complexity of the model
or the number of outliers increases beyond a threshold. In combinatorial
geometry these transitions appear as abrupt changes in the properties of face
counts of convex polytopes when the dimensions are varied. The thresholds in
these very different problems appear in the same critical locations after
appropriate calibration of variables.
These thresholds are important in each subject area: for linear modelling,
they place hard limits on the degree to which the now-ubiquitous
high-throughput data analysis can be successful; for robustness, they place
hard limits on the degree to which standard robust fitting methods can tolerate
outliers before breaking down; for compressed sensing, they define the sharp
boundary of the undersampling/sparsity tradeoff in undersampling theorems.
Existing derivations of phase transitions in combinatorial geometry assume
the underlying matrices have independent and identically distributed (iid)
Gaussian elements. In applications, however, it often seems that Gaussianity is
not required. We conducted an extensive computational experiment and formal
inferential analysis to test the hypothesis that these phase transitions are
{\it universal} across a range of underlying matrix ensembles. The experimental
results are consistent with an asymptotic large- universality across matrix
ensembles; finite-sample universality can be rejected.Comment: 47 pages, 24 figures, 10 table
Advances in the analysis of event-related potential data with factor analytic methods
Researchers are often interested in comparing brain activity between experimental contexts. Event-related potentials (ERPs) are a common electrophysiological measure of brain activity that is time-locked to an event (e.g., a stimulus presented to the participant). A variety of decomposition methods has been used for ERP data among them temporal exploratory factor analysis (EFA). Essentially, temporal EFA decomposes the ERP waveform into a set of latent factors where the factor loadings reflect the time courses of the latent factors, and the amplitudes are represented by the factor scores.
An important methodological concern is to ensure the estimates of the condition effects are unbiased and the term variance misallocation has been introduced in reference to the case of biased estimates. The aim of the present thesis was to explore how exploratory factor analytic methods can be made less prone to variance misallocation. These efforts resulted in a series of three publications in which variance misallocation in EFA was described as a consequence of the properties of ERP data, ESEM was proposed as an extension of EFA that acknowledges the structure of ERP data sets, and regularized estimation was suggested as an alternative to simple structure rotation with desirable properties.
The presence of multiple sources of (co-)variance, the factor scoring step, and high temporal overlap of the factors were identified as major causes of variance misallocation in EFA for ERP data. It was shown that ESEM is capable of separating the (co-)variance sources and that it avoids biases due to factor scoring. Further, regularized estimation was shown to be a suitable alternative for factor rotation that is able to recover factor loading patterns in which only a subset of the variables follow a simple structure. Based on these results, regSEMs and ESEMs with ERP-specific rotation have been proposed as promising extensions of the EFA approach that might be less prone to variance misallocation. Future research should provide a direct comparison of regSEM and ESEM, and conduct simulation studies with more physiologically motivated data generation algorithms
Studying the ability of finding single and interaction effects with Random Forest, and its application in psychiatric genetics
Psychotic disorders such as schizophrenia and bipolar disorder have a strong genetic
component. The aetiology of psychoses is known to be complex, including additive
effects from multiple susceptibility genes, interactions between genes, environmental
risk factors, and gene by environment interactions. With the development of new
technologies such as genome-wide association studies and imputation of ungenotyped
variants, the amount of genomic data has increased dramatically leading to the
necessary use of Machine Learning techniques. Random Forest has been widely used
to study the underlying genetic factors of psychiatric disorders such as epistasis and
gene-gene interactions. Several authors have investigated the ability of this algorithm
in finding single and interaction effects, but have reported contradictory results.
Therefore, in order to examine Random Forest ability of detecting single and
interaction effects based on different variable importance measures, I conducted a
simulation study assessing whether the algorithm was able to detect single and
interaction models under different correlation conditions. The results suggest that the
optimal Variable Importance Measures to use in real situations under correlation is the
unconditional unscaled permutation variable importance measure. Several studies
have shown bias in one of the most popular variable importance measures, the Gini
importance. Hence, in a second simulation study I study whether the Gini variable
importance is influenced by the variability of predictors, the precision of measuring
them, and the variability of the error. Evidence of other biases in this variable
importance was found. The results from the first simulation study were used to study
whether genes related to 29 molecular biomarkers, which have been associated with
schizophrenia, influence risk for schizophrenia in a case-control study of 26476 cases
and 31804 controls from 39 different European ancestry cohorts. Single effects from
ACAT2 and TNC genes were detected to contribute risk for schizophrenia. ACAT2 is a
gene in the chromosome 6 which is related to energy metabolism. Transcriptional
differences have been shown in schizophrenia brain tissue studies. TNC is expressed
in the brain where is involved in the migration of the neurons and axons. In addition,
we also used the simulation results to examine whether interactions between genes
associated with abnormal emotion/affect behaviour influence risk for psychosis and
cognition in humans, in a case-control study of 2049 cases and 1794 controls. Before
correcting for multiple testing, significant interactions between CRHR1 and ESR1, and
between MAPT and ESR1, and among CRHR1, ESR1 and TOM1L2, and among MAPT,
ESR1 and TOM1L2 were observed in abnormal fear/anxiety-related behaviour
pathway. There was no evidence for epistasis after Bonferroni correction