1,070 research outputs found

    Solving Sequences of Generalized Least-Squares Problems on Multi-threaded Architectures

    Full text link
    Generalized linear mixed-effects models in the context of genome-wide association studies (GWAS) represent a formidable computational challenge: the solution of millions of correlated generalized least-squares problems, and the processing of terabytes of data. We present high performance in-core and out-of-core shared-memory algorithms for GWAS: By taking advantage of domain-specific knowledge, exploiting multi-core parallelism, and handling data efficiently, our algorithms attain unequalled performance. When compared to GenABEL, one of the most widely used libraries for GWAS, on a 12-core processor we obtain 50-fold speedups. As a consequence, our routines enable genome studies of unprecedented size

    Computing Petaflops over Terabytes of Data: The Case of Genome-Wide Association Studies

    Full text link
    In many scientific and engineering applications, one has to solve not one but a sequence of instances of the same problem. Often times, the problems in the sequence are linked in a way that allows intermediate results to be reused. A characteristic example for this class of applications is given by the Genome-Wide Association Studies (GWAS), a widely spread tool in computational biology. GWAS entails the solution of up to trillions (101210^{12}) of correlated generalized least-squares problems, posing a daunting challenge: the performance of petaflops (101510^{15} floating-point operations) over terabytes of data. In this paper, we design an algorithm for performing GWAS on multi-core architectures. This is accomplished in three steps. First, we show how to exploit the relation among successive problems, thus reducing the overall computational complexity. Then, through an analysis of the required data transfers, we identify how to eliminate any overhead due to input/output operations. Finally, we study how to decompose computation into tasks to be distributed among the available cores, to attain high performance and scalability. With our algorithm, a GWAS that currently requires the use of a supercomputer may now be performed in matter of hours on a single multi-core node. The discussion centers around the methodology to develop the algorithm rather than the specific application. We believe the paper contributes valuable guidelines of general applicability for computational scientists on how to develop and optimize numerical algorithms

    Large-scale linear regression: Development of high-performance routines

    Full text link
    In statistics, series of ordinary least squares problems (OLS) are used to study the linear correlation among sets of variables of interest; in many studies, the number of such variables is at least in the millions, and the corresponding datasets occupy terabytes of disk space. As the availability of large-scale datasets increases regularly, so does the challenge in dealing with them. Indeed, traditional solvers---which rely on the use of black-box" routines optimized for one single OLS---are highly inefficient and fail to provide a viable solution for big-data analyses. As a case study, in this paper we consider a linear regression consisting of two-dimensional grids of related OLS problems that arise in the context of genome-wide association analyses, and give a careful walkthrough for the development of {\sc ols-grid}, a high-performance routine for shared-memory architectures; analogous steps are relevant for tailoring OLS solvers to other applications. In particular, we first illustrate the design of efficient algorithms that exploit the structure of the OLS problems and eliminate redundant computations; then, we show how to effectively deal with datasets that do not fit in main memory; finally, we discuss how to cast the computation in terms of efficient kernels and how to achieve scalability. Importantly, each design decision along the way is justified by simple performance models. {\sc ols-grid} enables the solution of 101110^{11} correlated OLS problems operating on terabytes of data in a matter of hours

    Algorithms for Large-scale Whole Genome Association Analysis

    Full text link
    In order to associate complex traits with genetic polymorphisms, genome-wide association studies process huge datasets involving tens of thousands of individuals genotyped for millions of polymorphisms. When handling these datasets, which exceed the main memory of contemporary computers, one faces two distinct challenges: 1) Millions of polymorphisms come at the cost of hundreds of Gigabytes of genotype data, which can only be kept in secondary storage; 2) the relatedness of the test population is represented by a covariance matrix, which, for large populations, can only fit in the combined main memory of a distributed architecture. In this paper, we present solutions for both challenges: The genotype data is streamed from and to secondary storage using a double buffering technique, while the covariance matrix is kept across the main memory of a distributed memory system. We show that these methods sustain high-performance and allow the analysis of enormous datase

    High Performance Solutions for Big-data GWAS

    Full text link
    In order to associate complex traits with genetic polymorphisms, genome-wide association studies process huge datasets involving tens of thousands of individuals genotyped for millions of polymorphisms. When handling these datasets, which exceed the main memory of contemporary computers, one faces two distinct challenges: 1) Millions of polymorphisms and thousands of phenotypes come at the cost of hundreds of gigabytes of data, which can only be kept in secondary storage; 2) the relatedness of the test population is represented by a relationship matrix, which, for large populations, can only fit in the combined main memory of a distributed architecture. In this paper, by using distributed resources such as Cloud or clusters, we address both challenges: The genotype and phenotype data is streamed from secondary storage using a double buffer- ing technique, while the relationship matrix is kept across the main memory of a distributed memory system. With the help of these solutions, we develop separate algorithms for studies involving only one or a multitude of traits. We show that these algorithms sustain high-performance and allow the analysis of enormous datasets.Comment: Submitted to Parallel Computing. arXiv admin note: substantial text overlap with arXiv:1304.227

    parallel parity games a multicore attractor for the zielonka recursive algorithm

    Get PDF
    Abstract Parity games are abstract infinite-duration two-player games, widely studied in computer science. Several solution algorithms have been proposed and also implemented in the community tool of choice called PGSolver, which has declared the Zielonka Recursive (ZR) algorithm the best performing on randomly generated games. With the aim of scaling and solving wider classes of parity games, several improvements and optimizations have been proposed over the existing algorithms. However, no one has yet explored the benefit of using the full computational power of which even common modern multicore processors are capable of. This is even more surprisingly by considering that most of the advanced algorithms in PGSolver are sequential. In this paper we introduce and implement, on a multicore architecture, a parallel version of the Attractor algorithm, that is the main kernel of the ZR algorithm. This choice follows our investigation that more of the 99% of the execution time of the ZR algorithm is spent in this module. We provide testing on graphs up to 20K nodes generated through PGSolver and we discuss performance analysis in terms of strong and weak scaling

    Performance Modeling and Prediction for Dense Linear Algebra

    Full text link
    This dissertation introduces measurement-based performance modeling and prediction techniques for dense linear algebra algorithms. As a core principle, these techniques avoid executions of such algorithms entirely, and instead predict their performance through runtime estimates for the underlying compute kernels. For a variety of operations, these predictions allow to quickly select the fastest algorithm configurations from available alternatives. We consider two scenarios that cover a wide range of computations: To predict the performance of blocked algorithms, we design algorithm-independent performance models for kernel operations that are generated automatically once per platform. For various matrix operations, instantaneous predictions based on such models both accurately identify the fastest algorithm, and select a near-optimal block size. For performance predictions of BLAS-based tensor contractions, we propose cache-aware micro-benchmarks that take advantage of the highly regular structure inherent to contraction algorithms. At merely a fraction of a contraction's runtime, predictions based on such micro-benchmarks identify the fastest combination of tensor traversal and compute kernel
    • …
    corecore