Search CORE

2,745 research outputs found

Challenges of Big Data Analysis

Author: Fan Jianqing
Han Fang
Liu Han
Publication venue: 'Oxford University Press (OUP)'
Publication date: 06/02/2014
Field of study

Big Data bring new opportunities to modern society and challenges to data scientists. On one hand, Big Data hold great promises for discovering subtle population patterns and heterogeneities that are not possible with small-scale data. On the other hand, the massive sample size and high dimensionality of Big Data introduce unique computational and statistical challenges, including scalability and storage bottleneck, noise accumulation, spurious correlation, incidental endogeneity, and measurement errors. These challenges are distinguished and require new computational and statistical paradigm. This article give overviews on the salient features of Big Data and how these features impact on paradigm change on statistical and computational methods as well as computing architectures. We also provide various new perspectives on the Big Data analysis and computation. In particular, we emphasis on the viability of the sparsest solution in high-confidence set and point out that exogeneous assumptions in most statistical methods for Big Data can not be validated due to incidental endogeneity. They can lead to wrong statistical inferences and consequently wrong scientific conclusions

arXiv.org e-Print Archive

CiteSeerX

Princeton University Open Access Repository

Massively-Parallel Break Detection for Satellite Data

Author: Gieseke Fabian
Horion Stéphanie
Rosca Sabina
Verbesselt Jan
von Mehren Malte
Zeileis Achim
Publication venue
Publication date: 01/01/2018
Field of study

The field of remote sensing is nowadays faced with huge amounts of data. While this offers a variety of exciting research opportunities, it also yields significant challenges regarding both computation time and space requirements. In practice, the sheer data volumes render existing approaches too slow for processing and analyzing all the available data. This work aims at accelerating BFAST, one of the state-of-the-art methods for break detection given satellite image time series. In particular, we propose a massively-parallel implementation for BFAST that can effectively make use of modern parallel compute devices such as GPUs. Our experimental evaluation shows that the proposed GPU implementation is up to four orders of magnitude faster than the existing publicly available implementation and up to ten times faster than a corresponding multi-threaded CPU execution. The dramatic decrease in running time renders the analysis of significantly larger datasets possible in seconds or minutes instead of hours or days. We demonstrate the practical benefits of our implementations given both artificial and real datasets.Comment: 10 page

arXiv.org e-Print Archive

Crossref

Copenhagen University Research Information System

GlobalSearchRegression.jl: Building bridges between Machine Learning and Econometrics in Fat-Data scenarios

Author: Gluzmann Pablo Alfredo
Mari Valentin
Mauri Ungaro Adán
Mocskos Esteban Eduardo
Monzon Nicolás
Panigo Demian Tupac
Publication venue: Juliacon
Publication date: 01/06/2020
Field of study

The aim of this paper is twofold. The first one is to describe a novel research-project designed for building bridges between machine learning and econometric worlds (ModelSelection.jl).The second one is to introduce the main characteristics and comparative performance of the first Julia-native all-subset regression algorithm included in GlobalSearchRegression.jl (v1.0.5). As other available alternatives, this algorithm allows researchers to obtain the best model specification among all possible covariate combinations - in terms of user defined information criteria-, but up to 3165 and 197 times faster than STATA and R alternatives, respectively.Fil: Panigo, Demian Tupac. Universidad Nacional de la Plata. Facultad de Ingenieria. Instituto Malvinas.; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Saavedra 15. Centro de Innovación de los Trabajadores. Universidad Metropolitana para la Educación y el Trabajo. Centro de Innovación de los Trabajadores; ArgentinaFil: Gluzmann, Pablo Alfredo. Universidad Nacional de La Plata. Facultad de Ciencias Económicas. Departamento de Ciencias Económicas. Centro de Estudios Distributivos Laborales y Sociales; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - La Plata; ArgentinaFil: Mocskos, Esteban Eduardo. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Departamento de Computación; ArgentinaFil: Mauri Ungaro, Adán. Universidad Nacional de La Plata; ArgentinaFil: Mari, Valentin. Universidad Nacional de La Plata; ArgentinaFil: Monzon, Nicolás. Universidad Nacional de La Plata; Argentina. Universidad Nacional de Avellaneda; Argentin

CONICET Digital

Parallel Deterministic and Stochastic Global Minimization of Functions with Very Many Minima

Author: Castle Brent S.
Easterling David R.
Madigan Michael L.
Trosset Michael W.
Watson Layne T.
Publication venue
Publication date: 01/01/2011
Field of study

The optimization of three problems with high dimensionality and many local minima are investigated under five different optimization algorithms: DIRECT, simulated annealing, Spall’s SPSA algorithm, the KNITRO package, and QNSTOP, a new algorithm developed at Indiana University

Computer Science Technical Reports @Virginia Tech

Best-Effort Communication Improves Performance and Scales Robustly on Conventional Hardware

Author: Moreno Matthew Andres
Ofria Charles
Publication venue
Publication date: 20/11/2022
Field of study

Here, we test the performance and scalability of fully-asynchronous, best-effort communication on existing, commercially-available HPC hardware. A first set of experiments tested whether best-effort communication strategies can benefit performance compared to the traditional perfect communication model. At high CPU counts, best-effort communication improved both the number of computational steps executed per unit time and the solution quality achieved within a fixed-duration run window. Under the best-effort model, characterizing the distribution of quality of service across processing components and over time is critical to understanding the actual computation being performed. Additionally, a complete picture of scalability under the best-effort model requires analysis of how such quality of service fares at scale. To answer these questions, we designed and measured a suite of quality of service metrics: simulation update period, message latency, message delivery failure rate, and message delivery coagulation. Under a lower communication-intensivity benchmark parameterization, we found that median values for all quality of service metrics were stable when scaling from 64 to 256 process. Under maximal communication intensivity, we found only minor -- and, in most cases, nil -- degradation in median quality of service. In an additional set of experiments, we tested the effect of an apparently faulty compute node on performance and quality of service. Despite extreme quality of service degradation among that node and its clique, median performance and quality of service remained stable

arXiv.org e-Print Archive

Minimum Epistasis Interpolation for Sequence-Function Relationships

Author: McCandlish D. M.
Zhou J.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 14/04/2020
Field of study

Massively parallel phenotyping assays have provided unprecedented insight into how multiple mutations combine to determine biological function. While such assays can measure phenotypes for thousands to millions of genotypes in a single experiment, in practice these measurements are not exhaustive, so that there is a need for techniques to impute values for genotypes whose phenotypes have not been directly assayed. Here, we present an imputation method based on inferring the least epistatic possible sequence-function relationship compatible with the data. In particular, we infer the reconstruction where mutational effects change as little as possible across adjacent genetic backgrounds. The resulting models can capture complex higher-order genetic interactions near the data, but approach additivity where data is sparse or absent. We apply the method to high-throughput transcription factor binding assays and use it to explore a fitness landscape for protein G

Cold Spring Harbor Laboratory Institutional Repository

Modelling the transcriptional regulation of androgen receptor in prostate cancer

Author: Hu Yuqian
Publication venue
Publication date: 26/04/2016
Field of study

Transcription of genes and production of proteins are essential functions of a normal cell. If disturbed, misregulation of crucial genes leads to aberrant cell behaviour and in some cases, leads to the development of diseased states such as cancer. One major transcriptional regulation tool involves the binding of transcription factor onto enhancer sequences that will encourage or repress transcription depending on the role of the transcription factor. In prostate cells, misregulation of the androgen receptor(AR), a key transcriptional regulator, leads to the development and maintenance of prostate cancer. Androgen receptor binds to numerous locations in the genome, but it is still unclear how and which other key transcription factors aid and repress AR-mediated transcription. Here I analyzed the data that contained the transcriptional activity of 4139 putative AR binding sites (ARBS) in the genome with and without the presence of hormone using the STARR-seq assay. Only a small fraction of ARBS showed significant differential expression when treated with hormone. To understand the underlying essential factors behind hormone-dependent behaviour, we developed both machine learning and biophysical models to identify active enhancers in prostate cancer cells. We also identify potentially crucial transcription factors for androgen-dependent behaviour and discuss the benefits and shortcomings of each modelling method

Simon Fraser University Institutional Repository