5,149 research outputs found
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
Fast Inference for Quantile Regression with Tens of Millions of Observations
Big data analytics has opened new avenues in economic research, but the
challenge of analyzing datasets with tens of millions of observations is
substantial. Conventional econometric methods based on extreme estimators
require large amounts of computing resources and memory, which are often not
readily available. In this paper, we focus on linear quantile regression
applied to ``ultra-large'' datasets, such as U.S. decennial censuses. A fast
inference framework is presented, utilizing stochastic sub-gradient descent
(S-subGD) updates. The inference procedure handles cross-sectional data
sequentially: (i) updating the parameter estimate with each incoming "new
observation", (ii) aggregating it as a Polyak-Ruppert average, and (iii)
computing a pivotal statistic for inference using only a solution path. The
methodology draws from time series regression to create an asymptotically
pivotal statistic through random scaling. Our proposed test statistic is
calculated in a fully online fashion and critical values are calculated without
resampling. We conduct extensive numerical studies to showcase the
computational merits of our proposed inference. For inference problems as large
as , where is the sample size and is the
number of regressors, our method generates new insights, surpassing current
inference methods in computation. Our method specifically reveals trends in the
gender gap in the U.S. college wage premium using millions of observations,
while controlling over covariates to mitigate confounding effects.Comment: 45 pages, 6 figure
The role of learning on industrial simulation design and analysis
The capability of modeling real-world system operations has turned simulation into an indispensable problemsolving methodology for business system design and analysis. Today, simulation supports decisions ranging
from sourcing to operations to finance, starting at the strategic level and proceeding towards tactical and
operational levels of decision-making. In such a dynamic setting, the practice of simulation goes beyond
being a static problem-solving exercise and requires integration with learning. This article discusses the role
of learning in simulation design and analysis motivated by the needs of industrial problems and describes
how selected tools of statistical learning can be utilized for this purpose
- …