2,745 research outputs found

    Challenges of Big Data Analysis

    Full text link
    Big Data bring new opportunities to modern society and challenges to data scientists. On one hand, Big Data hold great promises for discovering subtle population patterns and heterogeneities that are not possible with small-scale data. On the other hand, the massive sample size and high dimensionality of Big Data introduce unique computational and statistical challenges, including scalability and storage bottleneck, noise accumulation, spurious correlation, incidental endogeneity, and measurement errors. These challenges are distinguished and require new computational and statistical paradigm. This article give overviews on the salient features of Big Data and how these features impact on paradigm change on statistical and computational methods as well as computing architectures. We also provide various new perspectives on the Big Data analysis and computation. In particular, we emphasis on the viability of the sparsest solution in high-confidence set and point out that exogeneous assumptions in most statistical methods for Big Data can not be validated due to incidental endogeneity. They can lead to wrong statistical inferences and consequently wrong scientific conclusions

    Massively-Parallel Break Detection for Satellite Data

    Full text link
    The field of remote sensing is nowadays faced with huge amounts of data. While this offers a variety of exciting research opportunities, it also yields significant challenges regarding both computation time and space requirements. In practice, the sheer data volumes render existing approaches too slow for processing and analyzing all the available data. This work aims at accelerating BFAST, one of the state-of-the-art methods for break detection given satellite image time series. In particular, we propose a massively-parallel implementation for BFAST that can effectively make use of modern parallel compute devices such as GPUs. Our experimental evaluation shows that the proposed GPU implementation is up to four orders of magnitude faster than the existing publicly available implementation and up to ten times faster than a corresponding multi-threaded CPU execution. The dramatic decrease in running time renders the analysis of significantly larger datasets possible in seconds or minutes instead of hours or days. We demonstrate the practical benefits of our implementations given both artificial and real datasets.Comment: 10 page

    GlobalSearchRegression.jl: Building bridges between Machine Learning and Econometrics in Fat-Data scenarios

    Get PDF
    The aim of this paper is twofold. The first one is to describe a novel research-project designed for building bridges between machine learning and econometric worlds (ModelSelection.jl).The second one is to introduce the main characteristics and comparative performance of the first Julia-native all-subset regression algorithm included in GlobalSearchRegression.jl (v1.0.5). As other available alternatives, this algorithm allows researchers to obtain the best model specification among all possible covariate combinations - in terms of user defined information criteria-, but up to 3165 and 197 times faster than STATA and R alternatives, respectively.Fil: Panigo, Demian Tupac. Universidad Nacional de la Plata. Facultad de Ingenieria. Instituto Malvinas.; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Saavedra 15. Centro de Innovación de los Trabajadores. Universidad Metropolitana para la Educación y el Trabajo. Centro de Innovación de los Trabajadores; ArgentinaFil: Gluzmann, Pablo Alfredo. Universidad Nacional de La Plata. Facultad de Ciencias Económicas. Departamento de Ciencias Económicas. Centro de Estudios Distributivos Laborales y Sociales; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - La Plata; ArgentinaFil: Mocskos, Esteban Eduardo. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Departamento de Computación; ArgentinaFil: Mauri Ungaro, Adán. Universidad Nacional de La Plata; ArgentinaFil: Mari, Valentin. Universidad Nacional de La Plata; ArgentinaFil: Monzon, Nicolás. Universidad Nacional de La Plata; Argentina. Universidad Nacional de Avellaneda; Argentin

    Parallel Deterministic and Stochastic Global Minimization of Functions with Very Many Minima

    Get PDF
    The optimization of three problems with high dimensionality and many local minima are investigated under five different optimization algorithms: DIRECT, simulated annealing, Spall’s SPSA algorithm, the KNITRO package, and QNSTOP, a new algorithm developed at Indiana University

    Best-Effort Communication Improves Performance and Scales Robustly on Conventional Hardware

    Full text link
    Here, we test the performance and scalability of fully-asynchronous, best-effort communication on existing, commercially-available HPC hardware. A first set of experiments tested whether best-effort communication strategies can benefit performance compared to the traditional perfect communication model. At high CPU counts, best-effort communication improved both the number of computational steps executed per unit time and the solution quality achieved within a fixed-duration run window. Under the best-effort model, characterizing the distribution of quality of service across processing components and over time is critical to understanding the actual computation being performed. Additionally, a complete picture of scalability under the best-effort model requires analysis of how such quality of service fares at scale. To answer these questions, we designed and measured a suite of quality of service metrics: simulation update period, message latency, message delivery failure rate, and message delivery coagulation. Under a lower communication-intensivity benchmark parameterization, we found that median values for all quality of service metrics were stable when scaling from 64 to 256 process. Under maximal communication intensivity, we found only minor -- and, in most cases, nil -- degradation in median quality of service. In an additional set of experiments, we tested the effect of an apparently faulty compute node on performance and quality of service. Despite extreme quality of service degradation among that node and its clique, median performance and quality of service remained stable

    Minimum Epistasis Interpolation for Sequence-Function Relationships

    Get PDF
    Massively parallel phenotyping assays have provided unprecedented insight into how multiple mutations combine to determine biological function. While such assays can measure phenotypes for thousands to millions of genotypes in a single experiment, in practice these measurements are not exhaustive, so that there is a need for techniques to impute values for genotypes whose phenotypes have not been directly assayed. Here, we present an imputation method based on inferring the least epistatic possible sequence-function relationship compatible with the data. In particular, we infer the reconstruction where mutational effects change as little as possible across adjacent genetic backgrounds. The resulting models can capture complex higher-order genetic interactions near the data, but approach additivity where data is sparse or absent. We apply the method to high-throughput transcription factor binding assays and use it to explore a fitness landscape for protein G

    Modelling the transcriptional regulation of androgen receptor in prostate cancer

    Get PDF
    Transcription of genes and production of proteins are essential functions of a normal cell. If disturbed, misregulation of crucial genes leads to aberrant cell behaviour and in some cases, leads to the development of diseased states such as cancer. One major transcriptional regulation tool involves the binding of transcription factor onto enhancer sequences that will encourage or repress transcription depending on the role of the transcription factor. In prostate cells, misregulation of the androgen receptor(AR), a key transcriptional regulator, leads to the development and maintenance of prostate cancer. Androgen receptor binds to numerous locations in the genome, but it is still unclear how and which other key transcription factors aid and repress AR-mediated transcription. Here I analyzed the data that contained the transcriptional activity of 4139 putative AR binding sites (ARBS) in the genome with and without the presence of hormone using the STARR-seq assay. Only a small fraction of ARBS showed significant differential expression when treated with hormone. To understand the underlying essential factors behind hormone-dependent behaviour, we developed both machine learning and biophysical models to identify active enhancers in prostate cancer cells. We also identify potentially crucial transcription factors for androgen-dependent behaviour and discuss the benefits and shortcomings of each modelling method
    • …
    corecore