382 research outputs found
Graphulo Implementation of Server-Side Sparse Matrix Multiply in the Accumulo Database
The Apache Accumulo database excels at distributed storage and indexing and
is ideally suited for storing graph data. Many big data analytics compute on
graph data and persist their results back to the database. These graph
calculations are often best performed inside the database server. The GraphBLAS
standard provides a compact and efficient basis for a wide range of graph
applications through a small number of sparse matrix operations. In this
article, we implement GraphBLAS sparse matrix multiplication server-side by
leveraging Accumulo's native, high-performance iterators. We compare the
mathematics and performance of inner and outer product implementations, and
show how an outer product implementation achieves optimal performance near
Accumulo's peak write rate. We offer our work as a core component to the
Graphulo library that will deliver matrix math primitives for graph analytics
within Accumulo.Comment: To be presented at IEEE HPEC 2015: http://www.ieee-hpec.org
Format Abstraction for Sparse Tensor Algebra Compilers
This paper shows how to build a sparse tensor algebra compiler that is
agnostic to tensor formats (data layouts). We develop an interface that
describes formats in terms of their capabilities and properties, and show how
to build a modular code generator where new formats can be added as plugins. We
then describe six implementations of the interface that compose to form the
dense, CSR/CSF, COO, DIA, ELL, and HASH tensor formats and countless variants
thereof. With these implementations at hand, our code generator can generate
code to compute any tensor algebra expression on any combination of the
aforementioned formats.
To demonstrate our technique, we have implemented it in the taco tensor
algebra compiler. Our modular code generator design makes it simple to add
support for new tensor formats, and the performance of the generated code is
competitive with hand-optimized implementations. Furthermore, by extending taco
to support a wider range of formats specialized for different application and
data characteristics, we can improve end-user application performance. For
example, if input data is provided in the COO format, our technique allows
computing a single matrix-vector multiplication directly with the data in COO,
which is up to 3.6 faster than by first converting the data to CSR.Comment: Presented at OOPSLA 201
Distributed Triangle Counting in the Graphulo Matrix Math Library
Triangle counting is a key algorithm for large graph analysis. The Graphulo
library provides a framework for implementing graph algorithms on the Apache
Accumulo distributed database. In this work we adapt two algorithms for
counting triangles, one that uses the adjacency matrix and another that also
uses the incidence matrix, to the Graphulo library for server-side processing
inside Accumulo. Cloud-based experiments show a similar performance profile for
these different approaches on the family of power law Graph500 graphs, for
which data skew increasingly bottlenecks. These results motivate the design of
skew-aware hybrid algorithms that we propose for future work.Comment: Honorable mention in the 2017 IEEE HPEC's Graph Challeng
Algorithms and data structures for matrix-free finite element operators with MPI-parallel sparse multi-vectors
Traditional solution approaches for problems in quantum mechanics scale as
, where is the number of electrons. Various methods have
been proposed to address this issue and obtain linear scaling .
One promising formulation is the direct minimization of energy. Such methods
take advantage of physical localization of the solution, namely that the
solution can be sought in terms of non-orthogonal orbitals with local support.
In this work a numerically efficient implementation of sparse parallel vectors
within the open-source finite element library deal.II is proposed. The main
algorithmic ingredient is the matrix-free evaluation of the Hamiltonian
operator by cell-wise quadrature. Based on an a-priori chosen support for each
vector we develop algorithms and data structures to perform (i) matrix-free
sparse matrix multivector products (SpMM), (ii) the projection of an operator
onto a sparse sub-space (inner products), and (iii) post-multiplication of a
sparse multivector with a square matrix. The node-level performance is analyzed
using a roofline model. Our matrix-free implementation of finite element
operators with sparse multivectors achieves the performance of 157 GFlop/s on
Intel Cascade Lake architecture. Strong and weak scaling results are reported
for a typical benchmark problem using quadratic and quartic finite element
bases.Comment: 29 pages, 12 figure
Doctor of Philosophy
dissertationSparse matrix codes are found in numerous applications ranging from iterative numerical solvers to graph analytics. Achieving high performance on these codes has however been a significant challenge, mainly due to array access indirection, for example, of the form A[B[i]]. Indirect accesses make precise dependence analysis impossible at compile-time, and hence prevent many parallelizing and locality optimizing transformations from being applied. The expert user relies on manually written libraries to tailor the sparse code and data representations best suited to the target architecture from a general sparse matrix representation. However libraries have limited composability, address very specific optimization strategies, and have to be rewritten as new architectures emerge. In this dissertation, we explore the use of the inspector/executor methodology to accomplish the code and data transformations to tailor high performance sparse matrix representations. We devise and embed abstractions for such inspector/executor transformations within a compiler framework so that they can be composed with a rich set of existing polyhedral compiler transformations to derive complex transformation sequences for high performance. We demonstrate the automatic generation of inspector/executor code, which orchestrates code and data transformations to derive high performance representations for the Sparse Matrix Vector Multiply kernel in particular. We also show how the same transformations may be integrated into sparse matrix and graph applications such as Sparse Matrix Matrix Multiply and Stochastic Gradient Descent, respectively. The specific constraints of these applications, such as problem size and dependence structure, necessitate unique sparse matrix representations that can be realized using our transformations. Computations such as Gauss Seidel, with loop carried dependences at the outer most loop necessitate different strategies for high performance. Specifically, we organize the computation into level sets or wavefronts of irregular size, such that iterations of a wavefront may be scheduled in parallel but different wavefronts have to be synchronized. We demonstrate automatic code generation of high performance inspectors that do explicit dependence testing and level set construction at runtime, as well as high performance executors, which are the actual parallelized computations. For the above sparse matrix applications, we automatically generate inspector/executor code comparable in performance to manually tuned libraries
Polyhedral+Dataflow Graphs
This research presents an intermediate compiler representation that is designed for optimization, and emphasizes the temporary storage requirements and execution schedule of a given computation to guide optimization decisions. The representation is expressed as a dataflow graph that describes computational statements and data mappings within the polyhedral compilation model. The targeted applications include both the regular and irregular scientific domains.
The intermediate representation can be integrated into existing compiler infrastructures. A specification language implemented as a domain specific language in C++ describes the graph components and the transformations that can be applied. The visual representation allows users to reason about optimizations. Graph variants can be translated into source code or other representation. The language, intermediate representation, and associated transformations have been applied to improve the performance of differential equation solvers, or sparse matrix operations, tensor decomposition, and structured multigrid methods
DOLFIN: Automated Finite Element Computing
We describe here a library aimed at automating the solution of partial differential equations using the finite element method. By employing novel techniques for automated code generation, the library combines a high level of expressiveness with efficient computation. Finite element variational forms may be expressed in near mathematical notation, from which low-level code is automatically generated, compiled and seamlessly integrated with efficient implementations of
computational meshes and high-performance linear algebra. Easy-to-use object-oriented interfaces to the library are provided in the form of a C++ library and a Python module. This paper discusses the mathematical abstractions and methods used in the design of the library and its implementation. A number of examples are presented to demonstrate the use of the library in application code
- ā¦