Search CORE

32,906 research outputs found

Challenges of Big Data Analysis

Author: Fan Jianqing
Han Fang
Liu Han
Publication venue: 'Oxford University Press (OUP)'
Publication date: 05/02/2014
Field of study

Big Data bring new opportunities to modern society and challenges to data scientists. On one hand, Big Data hold great promises for discovering subtle population patterns and heterogeneities that are not possible with small-scale data. On the other hand, the massive sample size and high dimensionality of Big Data introduce unique computational and statistical challenges, including scalability and storage bottleneck, noise accumulation, spurious correlation, incidental endogeneity, and measurement errors. These challenges are distinguished and require new computational and statistical paradigm. This article give overviews on the salient features of Big Data and how these features impact on paradigm change on statistical and computational methods as well as computing architectures. We also provide various new perspectives on the Big Data analysis and computation. In particular, we emphasis on the viability of the sparsest solution in high-confidence set and point out that exogeneous assumptions in most statistical methods for Big Data can not be validated due to incidental endogeneity. They can lead to wrong statistical inferences and consequently wrong scientific conclusions

arXiv.org e-Print Archive

CiteSeerX

Princeton University Open Access Repository

Crossref

PubMed Central

Generic continuous spectrum for multi-dimensional quasi periodic Schr\"odinger operators with rough potentials

Author: Han Rui
Yang Fan
Publication venue
Publication date: 05/12/2017
Field of study

We study the multi-dimensional operator

(H_x u)_n=\sum_{|m-n|=1}u_{m}+f(T^n(x))u_n

, where

T

is the shift of the torus \T^d. When

d=2

, we show the spectrum of

H_x

is almost surely purely continuous for a.e.

\alpha

and generic continuous potentials. When

d\geq 3

, the same result holds for frequencies under an explicit arithmetic criterion. We also show that general multi-dimensional operators with measurable potentials do not have eigenvalue for generic

\alpha

arXiv.org e-Print Archive

LSU Scholarly Repository (Louisiana State Univ.)

Robust Inference of Risks of Large Portfolios

Author: Fan Jianqing
Han Fang
Liu Han
Vickers Byron
Publication venue
Publication date: 01/01/2015
Field of study

We propose a bootstrap-based robust high-confidence level upper bound (Robust H-CLUB) for assessing the risks of large portfolios. The proposed approach exploits rank-based and quantile-based estimators, and can be viewed as a robust extension of the H-CLUB method (Fan et al., 2015). Such an extension allows us to handle possibly misspecified models and heavy-tailed data. Under mixing conditions, we analyze the proposed approach and demonstrate its advantage over the H-CLUB. We further provide thorough numerical results to back up the developed theory. We also apply the proposed method to analyze a stock market dataset.Comment: 45 pages, 2 figure

arXiv.org e-Print Archive

Princeton University Open Access Repository

Crossref

PubMed Central

Estimating False Discovery Proportion Under Arbitrary Covariance Dependence

Author: Fan Jianqing
Gu Weijie
Han Xu
Publication venue
Publication date: 15/11/2011
Field of study

Multiple hypothesis testing is a fundamental problem in high dimensional inference, with wide applications in many scientific fields. In genome-wide association studies, tens of thousands of tests are performed simultaneously to find if any SNPs are associated with some traits and those tests are correlated. When test statistics are correlated, false discovery control becomes very challenging under arbitrary dependence. In the current paper, we propose a novel method based on principal factor approximation, which successfully subtracts the common dependence and weakens significantly the correlation structure, to deal with an arbitrary dependence structure. We derive an approximate expression for false discovery proportion (FDP) in large scale multiple testing when a common threshold is used and provide a consistent estimate of realized FDP. This result has important applications in controlling FDR and FDP. Our estimate of realized FDP compares favorably with Efron (2007)'s approach, as demonstrated in the simulated examples. Our approach is further illustrated by some real data applications. We also propose a dependence-adjusted procedure, which is more powerful than the fixed threshold procedure.Comment: 51 pages, 7 figures. arXiv admin note: substantial text overlap with arXiv:1012.439

arXiv.org e-Print Archive

CiteSeerX

Princeton University Open Access Repository

Crossref

PubMed Central