2,364 research outputs found
Accurate, Fast and Scalable Kernel Ridge Regression on Parallel and Distributed Systems
We propose two new methods to address the weak scaling problems of KRR: the
Balanced KRR (BKRR) and K-means KRR (KKRR). These methods consider alternative
ways to partition the input dataset into p different parts, generating p
different models, and then selecting the best model among them. Compared to a
conventional implementation, KKRR2 (optimized version of KKRR) improves the
weak scaling efficiency from 0.32% to 38% and achieves a 591times speedup for
getting the same accuracy by using the same data and the same hardware (1536
processors). BKRR2 (optimized version of BKRR) achieves a higher accuracy than
the current fastest method using less training time for a variety of datasets.
For the applications requiring only approximate solutions, BKRR2 improves the
weak scaling efficiency to 92% and achieves 3505 times speedup (theoretical
speedup: 4096 times).Comment: This paper has been accepted by ACM International Conference on
Supercomputing (ICS) 201
A scalable H-matrix approach for the solution of boundary integral equations on multi-GPU clusters
In this work, we consider the solution of boundary integral equations by
means of a scalable hierarchical matrix approach on clusters equipped with
graphics hardware, i.e. graphics processing units (GPUs). To this end, we
extend our existing single-GPU hierarchical matrix library hmglib such that it
is able to scale on many GPUs and such that it can be coupled to arbitrary
application codes. Using a model GPU implementation of a boundary element
method (BEM) solver, we are able to achieve more than 67 percent relative
parallel speed-up going from 128 to 1024 GPUs for a model geometry test case
with 1.5 million unknowns and a real-world geometry test case with almost 1.2
million unknowns. On 1024 GPUs of the cluster Titan, it takes less than 6
minutes to solve the 1.5 million unknowns problem, with 5.7 minutes for the
setup phase and 20 seconds for the iterative solver. To the best of the
authors' knowledge, we here discuss the first fully GPU-based
distributed-memory parallel hierarchical matrix Open Source library using the
traditional H-matrix format and adaptive cross approximation with an
application to BEM problems
Algorithmic patterns for -matrices on many-core processors
In this work, we consider the reformulation of hierarchical ()
matrix algorithms for many-core processors with a model implementation on
graphics processing units (GPUs). matrices approximate specific
dense matrices, e.g., from discretized integral equations or kernel ridge
regression, leading to log-linear time complexity in dense matrix-vector
products. The parallelization of matrix operations on many-core
processors is difficult due to the complex nature of the underlying algorithms.
While previous algorithmic advances for many-core hardware focused on
accelerating existing matrix CPU implementations by many-core
processors, we here aim at totally relying on that processor type. As main
contribution, we introduce the necessary parallel algorithmic patterns allowing
to map the full matrix construction and the fast matrix-vector
product to many-core hardware. Here, crucial ingredients are space filling
curves, parallel tree traversal and batching of linear algebra operations. The
resulting model GPU implementation hmglib is the, to the best of the authors
knowledge, first entirely GPU-based Open Source matrix library of
this kind. We conclude this work by an in-depth performance analysis and a
comparative performance study against a standard matrix library,
highlighting profound speedups of our many-core parallel approach
Recommended from our members
Preparing sparse solvers for exascale computing.
Sparse solvers provide essential functionality for a wide variety of scientific applications. Highly parallel sparse solvers are essential for continuing advances in high-fidelity, multi-physics and multi-scale simulations, especially as we target exascale platforms. This paper describes the challenges, strategies and progress of the US Department of Energy Exascale Computing project towards providing sparse solvers for exascale computing platforms. We address the demands of systems with thousands of high-performance node devices where exposing concurrency, hiding latency and creating alternative algorithms become essential. The efforts described here are works in progress, highlighting current success and upcoming challenges. This article is part of a discussion meeting issue 'Numerical algorithms for high-performance computational science'
Training very large scale nonlinear SVMs using Alternating Direction Method of Multipliers coupled with the Hierarchically Semi-Separable kernel approximations
Typically, nonlinear Support Vector Machines (SVMs) produce significantly
higher classification quality when compared to linear ones but, at the same
time, their computational complexity is prohibitive for large-scale datasets:
this drawback is essentially related to the necessity to store and manipulate
large, dense and unstructured kernel matrices. Despite the fact that at the
core of training a SVM there is a \textit{simple} convex optimization problem,
the presence of kernel matrices is responsible for dramatic performance
reduction, making SVMs unworkably slow for large problems. Aiming to an
efficient solution of large-scale nonlinear SVM problems, we propose the use of
the \textit{Alternating Direction Method of Multipliers} coupled with
\textit{Hierarchically Semi-Separable} (HSS) kernel approximations. As shown in
this work, the detailed analysis of the interaction among their algorithmic
components unveils a particularly efficient framework and indeed, the presented
experimental results demonstrate a significant speed-up when compared to the
\textit{state-of-the-art} nonlinear SVM libraries (without significantly
affecting the classification accuracy)
- …