2,745 research outputs found
Challenges of Big Data Analysis
Big Data bring new opportunities to modern society and challenges to data
scientists. On one hand, Big Data hold great promises for discovering subtle
population patterns and heterogeneities that are not possible with small-scale
data. On the other hand, the massive sample size and high dimensionality of Big
Data introduce unique computational and statistical challenges, including
scalability and storage bottleneck, noise accumulation, spurious correlation,
incidental endogeneity, and measurement errors. These challenges are
distinguished and require new computational and statistical paradigm. This
article give overviews on the salient features of Big Data and how these
features impact on paradigm change on statistical and computational methods as
well as computing architectures. We also provide various new perspectives on
the Big Data analysis and computation. In particular, we emphasis on the
viability of the sparsest solution in high-confidence set and point out that
exogeneous assumptions in most statistical methods for Big Data can not be
validated due to incidental endogeneity. They can lead to wrong statistical
inferences and consequently wrong scientific conclusions
Massively-Parallel Break Detection for Satellite Data
The field of remote sensing is nowadays faced with huge amounts of data.
While this offers a variety of exciting research opportunities, it also yields
significant challenges regarding both computation time and space requirements.
In practice, the sheer data volumes render existing approaches too slow for
processing and analyzing all the available data. This work aims at accelerating
BFAST, one of the state-of-the-art methods for break detection given satellite
image time series. In particular, we propose a massively-parallel
implementation for BFAST that can effectively make use of modern parallel
compute devices such as GPUs. Our experimental evaluation shows that the
proposed GPU implementation is up to four orders of magnitude faster than the
existing publicly available implementation and up to ten times faster than a
corresponding multi-threaded CPU execution. The dramatic decrease in running
time renders the analysis of significantly larger datasets possible in seconds
or minutes instead of hours or days. We demonstrate the practical benefits of
our implementations given both artificial and real datasets.Comment: 10 page
GlobalSearchRegression.jl: Building bridges between Machine Learning and Econometrics in Fat-Data scenarios
The aim of this paper is twofold. The first one is to describe a novel research-project designed for building bridges between machine learning and econometric worlds (ModelSelection.jl).The second one is to introduce the main characteristics and comparative performance of the first Julia-native all-subset regression algorithm included in GlobalSearchRegression.jl (v1.0.5). As other available alternatives, this algorithm allows researchers to obtain the best model specification among all possible covariate combinations - in terms of user defined information criteria-, but up to 3165 and 197 times faster than STATA and R alternatives, respectively.Fil: Panigo, Demian Tupac. Universidad Nacional de la Plata. Facultad de Ingenieria. Instituto Malvinas.; Argentina. Consejo Nacional de Investigaciones CientĂficas y TĂ©cnicas. Oficina de CoordinaciĂłn Administrativa Saavedra 15. Centro de InnovaciĂłn de los Trabajadores. Universidad Metropolitana para la EducaciĂłn y el Trabajo. Centro de InnovaciĂłn de los Trabajadores; ArgentinaFil: Gluzmann, Pablo Alfredo. Universidad Nacional de La Plata. Facultad de Ciencias EconĂłmicas. Departamento de Ciencias EconĂłmicas. Centro de Estudios Distributivos Laborales y Sociales; Argentina. Consejo Nacional de Investigaciones CientĂficas y TĂ©cnicas. Centro CientĂfico TecnolĂłgico Conicet - La Plata; ArgentinaFil: Mocskos, Esteban Eduardo. Consejo Nacional de Investigaciones CientĂficas y TĂ©cnicas; Argentina. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Departamento de ComputaciĂłn; ArgentinaFil: Mauri Ungaro, Adán. Universidad Nacional de La Plata; ArgentinaFil: Mari, Valentin. Universidad Nacional de La Plata; ArgentinaFil: Monzon, Nicolás. Universidad Nacional de La Plata; Argentina. Universidad Nacional de Avellaneda; Argentin
Parallel Deterministic and Stochastic Global Minimization of Functions with Very Many Minima
The optimization of three problems with high dimensionality and many local minima are investigated
under five different optimization algorithms: DIRECT, simulated annealing, Spall’s SPSA algorithm, the KNITRO
package, and QNSTOP, a new algorithm developed at Indiana University
Best-Effort Communication Improves Performance and Scales Robustly on Conventional Hardware
Here, we test the performance and scalability of fully-asynchronous,
best-effort communication on existing, commercially-available HPC hardware.
A first set of experiments tested whether best-effort communication
strategies can benefit performance compared to the traditional perfect
communication model. At high CPU counts, best-effort communication improved
both the number of computational steps executed per unit time and the solution
quality achieved within a fixed-duration run window.
Under the best-effort model, characterizing the distribution of quality of
service across processing components and over time is critical to understanding
the actual computation being performed. Additionally, a complete picture of
scalability under the best-effort model requires analysis of how such quality
of service fares at scale. To answer these questions, we designed and measured
a suite of quality of service metrics: simulation update period, message
latency, message delivery failure rate, and message delivery coagulation. Under
a lower communication-intensivity benchmark parameterization, we found that
median values for all quality of service metrics were stable when scaling from
64 to 256 process. Under maximal communication intensivity, we found only minor
-- and, in most cases, nil -- degradation in median quality of service.
In an additional set of experiments, we tested the effect of an apparently
faulty compute node on performance and quality of service. Despite extreme
quality of service degradation among that node and its clique, median
performance and quality of service remained stable
Minimum Epistasis Interpolation for Sequence-Function Relationships
Massively parallel phenotyping assays have provided unprecedented insight into how multiple mutations combine to determine biological function. While such assays can measure phenotypes for thousands to millions of genotypes in a single experiment, in practice these measurements are not exhaustive, so that there is a need for techniques to impute values for genotypes whose phenotypes have not been directly assayed. Here, we present an imputation method based on inferring the least epistatic possible sequence-function relationship compatible with the data. In particular, we infer the reconstruction where mutational effects change as little as possible across adjacent genetic backgrounds. The resulting models can capture complex higher-order genetic interactions near the data, but approach additivity where data is sparse or absent. We apply the method to high-throughput transcription factor binding assays and use it to explore a fitness landscape for protein G
Modelling the transcriptional regulation of androgen receptor in prostate cancer
Transcription of genes and production of proteins are essential functions of a normal cell. If disturbed, misregulation of crucial genes leads to aberrant cell behaviour and in some cases, leads to the development of diseased states such as cancer. One major transcriptional regulation tool involves the binding of transcription factor onto enhancer sequences that will encourage or repress transcription depending on the role of the transcription factor. In prostate cells, misregulation of the androgen receptor(AR), a key transcriptional regulator, leads to the development and maintenance of prostate cancer. Androgen receptor binds to numerous locations in the genome, but it is still unclear how and which other key transcription factors aid and repress AR-mediated transcription. Here I analyzed the data that contained the transcriptional activity of 4139 putative AR binding sites (ARBS) in the genome with and without the presence of hormone using the STARR-seq assay. Only a small fraction of ARBS showed significant differential expression when treated with hormone. To understand the underlying essential factors behind hormone-dependent behaviour, we developed both machine learning and biophysical models to identify active enhancers in prostate cancer cells. We also identify potentially crucial transcription factors for androgen-dependent behaviour and discuss the benefits and shortcomings of each modelling method
- …