Search CORE

11,216 research outputs found

Consistent Estimation of Low-Dimensional Latent Structure in High-Dimensional Data

Author: Chen Xiongzhi
Storey John D.
Publication venue
Publication date: 12/10/2015
Field of study

We consider the problem of extracting a low-dimensional, linear latent variable structure from high-dimensional random variables. Specifically, we show that under mild conditions and when this structure manifests itself as a linear space that spans the conditional means, it is possible to consistently recover the structure using only information up to the second moments of these random variables. This finding, specialized to one-parameter exponential families whose variance function is quadratic in their means, allows for the derivation of an explicit estimator of such latent structure. This approach serves as a latent variable model estimator and as a tool for dimension reduction for a high-dimensional matrix of data composed of many related variables. Our theoretical results are verified by simulation studies and an application to genomic data

arXiv.org e-Print Archive

Princeton University Open Access Repository

Statistical significance of variables driving systematic variation

Author: Chung Neo Christopher
Storey John D.
Publication venue: 'Oxford University Press (OUP)'
Publication date: 27/08/2013
Field of study

There are a number of well-established methods such as principal components analysis (PCA) for automatically capturing systematic variation due to latent variables in large-scale genomic data. PCA and related methods may directly provide a quantitative characterization of a complex biological variable that is otherwise difficult to precisely define or model. An unsolved problem in this context is how to systematically identify the genomic variables that are drivers of systematic variation captured by PCA. Principal components (and other estimates of systematic variation) are directly constructed from the genomic variables themselves, making measures of statistical significance artificially inflated when using conventional methods due to over-fitting. We introduce a new approach called the jackstraw that allows one to accurately identify genomic variables that are statistically significantly associated with any subset or linear combination of principal components (PCs). The proposed method can greatly simplify complex significance testing problems encountered in genomics and can be utilized to identify the genomic variables significantly associated with latent variables. Using simulation, we demonstrate that our method attains accurate measures of statistical significance over a range of relevant scenarios. We consider yeast cell-cycle gene expression data, and show that the proposed method can be used to straightforwardly identify statistically significant genes that are cell-cycle regulated. We also analyze gene expression data from post-trauma patients, allowing the gene expression data to provide a molecularly-driven phenotype. We find a greater enrichment for inflammatory-related gene sets compared to using a clinically defined phenotype. The proposed method provides a useful bridge between large-scale quantifications of systematic variation and gene-level significance analyses.Comment: 35 pages, 1 table, 6 main figures, 7 supplementary figure

arXiv.org e-Print Archive

Princeton University Open Access Repository

Crossref

PubMed Central

Effects of electrostatic correlations on electrokinetic phenomena

Author: Bazant Martin Z.
Storey Brian D.
Publication venue: 'American Physical Society (APS)'
Publication date: 31/08/2012
Field of study

Classical theory of the electric double layer is based on the fundamental assumption of a dilute solution of point ions. There are a number of situations such as high applied voltages, high concentration of electrolytes, systems with multivalent ions, or solvent-free ionic liquids where the classical theory is often applied but the fundamental assumptions cannot be justified. Perhaps the most basic assumption underlying continuum models in electrokinetics is the mean-field approximation, that the electric field acting on each discrete ion is self-consistently determined by the local mean charge density. This paper considers situations where the mean-field approximation breaks down and electrostatic correlations become important. A fourth-order modified Poisson equation is developed that accounts for electrostatic correlations and captures the essential features in a simple continuum framework. The theory is derived variationally as a gradient approximation for non-local electrostatics, in which the dielectric permittivity becomes a differential operator. The only new parameter is a characteristic length scale for correlated ion pairs. The model is able to capture subtle aspects of more detailed simulations based on Monte Carlo, molecular dynamics, or density functional theory and allows for the straightforward calculation of electrokinetic flows in correlated liquids, for the first time. Departures from classical Helmholtz-Smoluchowski theory are controlled by the dimensionless ratio of the correlation length to the Debye screening length. Charge-density oscillations tend to reduce electro-osmotic flow and streaming current, and over-screening of the surface charge can lead to flow reversal. These effects also help to explain the apparent charge-induced thickening of double layers in induced-charge electrokinetic phenomena

arXiv.org e-Print Archive

DSpace@MIT

Crossref

Multiple locus linkage analysis of genomewide expression in yeast.

Author: Akey Joshua M
Kruglyak Leonid
Storey John D
Publication venue: eScholarship, University of California
Publication date: 26/07/2005
Field of study

With the ability to measure thousands of related phenotypes from a single biological sample, it is now feasible to genetically dissect systems-level biological phenomena. The genetics of transcriptional regulation and protein abundance are likely to be complex, meaning that genetic variation at multiple loci will influence these phenotypes. Several recent studies have investigated the role of genetic variation in transcription by applying traditional linkage analysis methods to genomewide expression data, where each gene expression level was treated as a quantitative trait and analyzed separately from one another. Here, we develop a new, computationally efficient method for simultaneously mapping multiple gene expression quantitative trait loci that directly uses all of the available data. Information shared across gene expression traits is captured in a way that makes minimal assumptions about the statistical properties of the data. The method produces easy-to-interpret measures of statistical significance for both individual loci and the overall joint significance of multiple loci selected for a given expression trait. We apply the new method to a cross between two strains of the budding yeast Saccharomyces cerevisiae, and estimate that at least 37% of all gene expression traits show two simultaneous linkages, where we have allowed for epistatic interactions. Pairs of jointly linking quantitative trait loci are identified with high confidence for 170 gene expression traits, where it is expected that both loci are true positives for at least 153 traits. In addition, we are able to show that epistatic interactions contribute to gene expression variation for at least 14% of all traits. We compare the proposed approach to an exhaustive two-dimensional scan over all pairs of loci. Surprisingly, we demonstrate that an exhaustive two-dimensional scan is less powerful than the sequential search used here. In addition, we show that a two-dimensional scan does not truly allow one to test for simultaneous linkage, and the statistical significance measured from this existing method cannot be interpreted among many traits

CiteSeerX

Directory of Open Access Journals

PubMed Central

eScholarship - University of California

The Effect of Business Regulations on Nascent and Young Business Entrepreneurship

Author: Stel A.J. van
Storey D.
Thurik A.R.
Publication venue
Publication date
Field of study

We examine the relationship, across 39 countries, between regulation and entrepreneurship using a new two-equation model. We find the minimum capital requirement required to start a business lowers entrepreneurship rates across countries, as do labour market regulations. However the administrative considerations of starting a business â€“ such as the time, the cost, or the number of procedures required â€“ are unrelated to the formation rate of either nascent or young businesses. Given the explicit link made by Djankov et al. (2002) between the speed and ease with which businesses may be established in a country and its economic performance â€“ and the enthusiasm with which this link has been grasped by European Union policy makers â€“ our findings imply this link needs reconsidering.Global Entrepreneurship Monitor;Nascent Entrepreneurship;Business Regulations;World Bank Doing Business;Young Businesses

Research Papers in Economics

Craton-scale variations in crustal evolution: new insights from Scottish Highland detrital zircon

Author: Hawkesworth Chris J.
Lancaster Penelope J.
Storey Craig D.
Publication venue
Publication date: 01/01/2010
Field of study

Portsmouth University Research Portal (Pure)

Tau-aggregation inhibitor therapy for Alzheimer's disease

Author: Harrington Charles R
Storey John M D
Wischik Claude M
Publication venue: 'Elsevier BV'
Publication date: 19/12/2013
Field of study

Aberdeen University Research

Elsevier - Publisher Connector

Crossref

The Optimal Discovery Procedure: A New Approach to Simultaneous Significance Testing

Author: Storey John D.
Publication venue: Collection of Biostatistics Research Archive
Publication date: 06/09/2005
Field of study

Significance testing is one of the main objectives of statistics. The Neyman-Pearson lemma provides a simple rule for optimally testing a single hypothesis when the null and alternative distributions are known. This result has played a major role in the development of significance testing strategies that are used in practice. Most of the work extending single testing strategies to multiple tests has focused on formulating and estimating new types of significance measures, such as the false discovery rate. These methods tend to be based on p-values that are calculated from each test individually, ignoring information from the other tests. As shrinkage estimation borrows strength across point estimates to improve their overall performance, I show here that borrowing strength across multiple significance tests can improve their performance as well. The optimal discovery procedure (ODP) is introduced, which shows how to maximize the number of expected true positives for each fixed number of expected false positives. The optimality achieved by this procedure is shown to be closely related to optimality in terms of the false discovery rate. The ODP motivates a new approach to testing multiple hypotheses, especially when the tests are related. As a simple example, a new simultaneous procedure for testing several Normal means is defined; this is surprisingly demonstrated to outperform the optimal single test procedure, showing that an optimal method for single tests may no longer be optimal in the multiple test setting. Connections to other concepts in statistics are discussed, including Stein\u27s paradox, shrinkage estimation, and Bayesian classification theory

Collection Of Biostatistics Research Archive