25 research outputs found
Application-tailored Linear Algebra Algorithms: A search-based Approach
In this paper, we tackle the problem of automatically generating algorithms
for linear algebra operations by taking advantage of problem-specific
knowledge. In most situations, users possess much more information about the
problem at hand than what current libraries and computing environments accept;
evidence shows that if properly exploited, such information leads to
uncommon/unexpected speedups. We introduce a knowledge-aware linear algebra
compiler that allows users to input matrix equations together with properties
about the operands and the problem itself; for instance, they can specify that
the equation is part of a sequence, and how successive instances are related to
one another. The compiler exploits all this information to guide the generation
of algorithms, to limit the size of the search space, and to avoid redundant
computations. We applied the compiler to equations arising as part of
sensitivity and genome studies; the algorithms produced exhibit, respectively,
100- and 1000-fold speedups
Computing Petaflops over Terabytes of Data: The Case of Genome-Wide Association Studies
In many scientific and engineering applications, one has to solve not one but
a sequence of instances of the same problem. Often times, the problems in the
sequence are linked in a way that allows intermediate results to be reused. A
characteristic example for this class of applications is given by the
Genome-Wide Association Studies (GWAS), a widely spread tool in computational
biology. GWAS entails the solution of up to trillions () of correlated
generalized least-squares problems, posing a daunting challenge: the
performance of petaflops ( floating-point operations) over terabytes
of data.
In this paper, we design an algorithm for performing GWAS on multi-core
architectures. This is accomplished in three steps. First, we show how to
exploit the relation among successive problems, thus reducing the overall
computational complexity. Then, through an analysis of the required data
transfers, we identify how to eliminate any overhead due to input/output
operations. Finally, we study how to decompose computation into tasks to be
distributed among the available cores, to attain high performance and
scalability. With our algorithm, a GWAS that currently requires the use of a
supercomputer may now be performed in matter of hours on a single multi-core
node.
The discussion centers around the methodology to develop the algorithm rather
than the specific application. We believe the paper contributes valuable
guidelines of general applicability for computational scientists on how to
develop and optimize numerical algorithms
Knowledge-Based Automatic Generation of Linear Algebra Algorithms and Code
This dissertation focuses on the design and the implementation of
domain-specific compilers for linear algebra matrix equations. The development
of efficient libraries for such equations, which lie at the heart of most
software for scientific computing, is a complex process that requires expertise
in a variety of areas, including the application domain, algorithms, numerical
analysis and high-performance computing. Moreover, the process involves the
collaboration of several people for a considerable amount of time. With our
compilers, we aim to relieve the developers from both designing algorithms and
writing code, and to generate routines that match or even surpass the
performance of those written by human experts.Comment: Dissertatio
Large-scale linear regression: Development of high-performance routines
In statistics, series of ordinary least squares problems (OLS) are used to
study the linear correlation among sets of variables of interest; in many
studies, the number of such variables is at least in the millions, and the
corresponding datasets occupy terabytes of disk space. As the availability of
large-scale datasets increases regularly, so does the challenge in dealing with
them. Indeed, traditional solvers---which rely on the use of black-box"
routines optimized for one single OLS---are highly inefficient and fail to
provide a viable solution for big-data analyses. As a case study, in this paper
we consider a linear regression consisting of two-dimensional grids of related
OLS problems that arise in the context of genome-wide association analyses, and
give a careful walkthrough for the development of {\sc ols-grid}, a
high-performance routine for shared-memory architectures; analogous steps are
relevant for tailoring OLS solvers to other applications. In particular, we
first illustrate the design of efficient algorithms that exploit the structure
of the OLS problems and eliminate redundant computations; then, we show how to
effectively deal with datasets that do not fit in main memory; finally, we
discuss how to cast the computation in terms of efficient kernels and how to
achieve scalability. Importantly, each design decision along the way is
justified by simple performance models. {\sc ols-grid} enables the solution of
correlated OLS problems operating on terabytes of data in a matter of
hours
High Performance Solutions for Big-data GWAS
In order to associate complex traits with genetic polymorphisms, genome-wide
association studies process huge datasets involving tens of thousands of
individuals genotyped for millions of polymorphisms. When handling these
datasets, which exceed the main memory of contemporary computers, one faces two
distinct challenges: 1) Millions of polymorphisms and thousands of phenotypes
come at the cost of hundreds of gigabytes of data, which can only be kept in
secondary storage; 2) the relatedness of the test population is represented by
a relationship matrix, which, for large populations, can only fit in the
combined main memory of a distributed architecture. In this paper, by using
distributed resources such as Cloud or clusters, we address both challenges:
The genotype and phenotype data is streamed from secondary storage using a
double buffer- ing technique, while the relationship matrix is kept across the
main memory of a distributed memory system. With the help of these solutions,
we develop separate algorithms for studies involving only one or a multitude of
traits. We show that these algorithms sustain high-performance and allow the
analysis of enormous datasets.Comment: Submitted to Parallel Computing. arXiv admin note: substantial text
overlap with arXiv:1304.227
Solving Sequences of Generalized Least-Squares Problems on Multi-threaded Architectures
Generalized linear mixed-effects models in the context of genome-wide
association studies (GWAS) represent a formidable computational challenge: the
solution of millions of correlated generalized least-squares problems, and the
processing of terabytes of data. We present high performance in-core and
out-of-core shared-memory algorithms for GWAS: By taking advantage of
domain-specific knowledge, exploiting multi-core parallelism, and handling data
efficiently, our algorithms attain unequalled performance. When compared to
GenABEL, one of the most widely used libraries for GWAS, on a 12-core processor
we obtain 50-fold speedups. As a consequence, our routines enable genome
studies of unprecedented size
Accelerating scientific codes by performance and accuracy modeling
Scientific software is often driven by multiple parameters that affect both
accuracy and performance. Since finding the optimal configuration of these
parameters is a highly complex task, it extremely common that the software is
used suboptimally. In a typical scenario, accuracy requirements are imposed,
and attained through suboptimal performance. In this paper, we present a
methodology for the automatic selection of parameters for simulation codes, and
a corresponding prototype tool. To be amenable to our methodology, the target
code must expose the parameters affecting accuracy and performance, and there
must be formulas available for error bounds and computational complexity of the
underlying methods. As a case study, we consider the particle-particle
particle-mesh method (PPPM) from the LAMMPS suite for molecular dynamics, and
use our tool to identify configurations of the input parameters that achieve a
given accuracy in the shortest execution time. When compared with the
configurations suggested by expert users, the parameters selected by our tool
yield reductions in the time-to-solution ranging between 10% and 60%. In other
words, for the typical scenario where a fixed number of core-hours are granted
and simulations of a fixed number of timesteps are to be run, usage of our tool
may allow up to twice as many simulations. While we develop our ideas using
LAMMPS as computational framework and use the PPPM method for dispersion as
case study, the methodology is general and valid for a range of software tools
and methods
Towards an Efficient Use of the BLAS Library for Multilinear Tensor Contractions
Mathematical operators whose transformation rules constitute the building
blocks of a multi-linear algebra are widely used in physics and engineering
applications where they are very often represented as tensors. In the last
century, thanks to the advances in tensor calculus, it was possible to uncover
new research fields and make remarkable progress in the existing ones, from
electromagnetism to the dynamics of fluids and from the mechanics of rigid
bodies to quantum mechanics of many atoms. By now, the formal mathematical and
geometrical properties of tensors are well defined and understood; conversely,
in the context of scientific and high-performance computing, many tensor-
related problems are still open. In this paper, we address the problem of
efficiently computing contractions among two tensors of arbitrary dimension by
using kernels from the highly optimized BLAS library. In particular, we
establish precise conditions to determine if and when GEMM, the kernel for
matrix products, can be used. Such conditions take into consideration both the
nature of the operation and the storage scheme of the tensors, and induce a
classification of the contractions into three groups. For each group, we
provide a recipe to guide the users towards the most effective use of BLAS.Comment: 27 Pages, 7 figures and additional tikz generated diagrams. Submitted
to Applied Mathematics and Computatio
Automatic Generation of Efficient Linear Algebra Programs
The level of abstraction at which application experts reason about linear
algebra computations and the level of abstraction used by developers of
high-performance numerical linear algebra libraries do not match. The former is
conveniently captured by high-level languages and libraries such as Matlab and
Eigen, while the latter expresses the kernels included in the BLAS and LAPACK
libraries. Unfortunately, the translation from a high-level computation to an
efficient sequence of kernels is a task, far from trivial, that requires
extensive knowledge of both linear algebra and high-performance computing.
Internally, almost all high-level languages and libraries use efficient
kernels; however, the translation algorithms are too simplistic and thus lead
to a suboptimal use of said kernels, with significant performance losses. In
order to both achieve the productivity that comes with high-level languages,
and make use of the efficiency of low level kernels, we are developing Linnea,
a code generator for linear algebra problems. As input, Linnea takes a
high-level description of a linear algebra problem and produces as output an
efficient sequence of calls to high-performance kernels. In 25 application
problems, the code generated by Linnea always outperforms Matlab, Julia, Eigen
and Armadillo, with speedups up to and exceeding 10x