1,233 research outputs found
GPU Acceleration of ADMM for Large-Scale Quadratic Programming
The alternating direction method of multipliers (ADMM) is a powerful operator
splitting technique for solving structured convex optimization problems. Due to
its relatively low per-iteration computational cost and ability to exploit
sparsity in the problem data, it is particularly suitable for large-scale
optimization. However, the method may still take prohibitively long to compute
solutions to very large problem instances. Although ADMM is known to be
parallelizable, this feature is rarely exploited in real implementations. In
this paper we exploit the parallel computing architecture of a graphics
processing unit (GPU) to accelerate ADMM. We build our solver on top of OSQP, a
state-of-the-art implementation of ADMM for quadratic programming. Our
open-source CUDA C implementation has been tested on many large-scale problems
and was shown to be up to two orders of magnitude faster than the CPU
implementation
GHOST: Building blocks for high performance sparse linear algebra on heterogeneous systems
While many of the architectural details of future exascale-class high
performance computer systems are still a matter of intense research, there
appears to be a general consensus that they will be strongly heterogeneous,
featuring "standard" as well as "accelerated" resources. Today, such resources
are available as multicore processors, graphics processing units (GPUs), and
other accelerators such as the Intel Xeon Phi. Any software infrastructure that
claims usefulness for such environments must be able to meet their inherent
challenges: massive multi-level parallelism, topology, asynchronicity, and
abstraction. The "General, Hybrid, and Optimized Sparse Toolkit" (GHOST) is a
collection of building blocks that targets algorithms dealing with sparse
matrix representations on current and future large-scale systems. It implements
the "MPI+X" paradigm, has a pure C interface, and provides hybrid-parallel
numerical kernels, intelligent resource management, and truly heterogeneous
parallelism for multicore CPUs, Nvidia GPUs, and the Intel Xeon Phi. We
describe the details of its design with respect to the challenges posed by
modern heterogeneous supercomputers and recent algorithmic developments.
Implementation details which are indispensable for achieving high efficiency
are pointed out and their necessity is justified by performance measurements or
predictions based on performance models. The library code and several
applications are available as open source. We also provide instructions on how
to make use of GHOST in existing software packages, together with a case study
which demonstrates the applicability and performance of GHOST as a component
within a larger software stack.Comment: 32 pages, 11 figure
SPICE²: A Spatial, Parallel Architecture for Accelerating the Spice Circuit Simulator
Spatial processing of sparse, irregular floating-point computation using a single FPGA enables up to an order of magnitude speedup (mean 2.8X speedup) over a conventional microprocessor for the SPICE circuit simulator. We deliver this speedup using a hybrid parallel architecture that spatially implements the heterogeneous forms of parallelism available in SPICE. We decompose SPICE into its three constituent phases: Model-Evaluation, Sparse Matrix-Solve, and Iteration Control and parallelize each phase independently. We exploit data-parallel device evaluations in the Model-Evaluation phase, sparse dataflow parallelism in the Sparse Matrix-Solve phase and compose the complete design in streaming fashion. We name our parallel architecture SPICE²: Spatial Processors Interconnected for Concurrent Execution for accelerating the SPICE circuit simulator. We program the parallel architecture with a high-level, domain-specific framework that identifies, exposes and exploits parallelism available in the SPICE circuit simulator. This design is optimized with an auto-tuner that can scale the design to use larger FPGA capacities without expert intervention and can even target other parallel architectures with the assistance of automated code-generation. This FPGA architecture is able to outperform conventional processors due to a combination of factors including high utilization of statically-scheduled resources, low-overhead dataflow scheduling of fine-grained tasks, and overlapped processing of the control algorithms.
We demonstrate that we can independently accelerate Model-Evaluation by a mean factor of 6.5X(1.4--23X) across a range of non-linear device models and Matrix-Solve by 2.4X(0.6--13X) across various benchmark matrices while delivering a mean combined speedup of 2.8X(0.2--11X) for the two together when comparing a Xilinx Virtex-6 LX760 (40nm) with an Intel Core i7 965 (45nm). With our high-level framework, we can also accelerate Single-Precision Model-Evaluation on NVIDIA GPUs, ATI GPUs, IBM Cell, and Sun Niagara 2 architectures.
We expect approaches based on exploiting spatial parallelism to become important as frequency scaling slows down and modern processing architectures turn to parallelism (\eg multi-core, GPUs) due to constraints of power consumption. This thesis shows how to express, exploit and optimize spatial parallelism for an important class of problems that are challenging to parallelize.</p
Performance Modeling and Prediction for the Scalable Solution of Partial Differential Equations on Unstructured Grids
This dissertation studies the sources of poor performance in scientific computing codes based on partial differential equations (PDEs), which typically perform at a computational rate well below other scientific simulations (e.g., those with dense linear algebra or N-body kernels) on modern architectures with deep memory hierarchies. We identify that the primary factors responsible for this relatively poor performance are: insufficient available memory bandwidth, low ratio of work to data size (good algorithmic efficiency), and nonscaling cost of synchronization and gather/scatter operations (for a fixed problem size scaling). This dissertation also illustrates how to reuse the legacy scientific and engineering software within a library framework.
Specifically, a three-dimensional unstructured grid incompressible Euler code from NASA has been parallelized with the Portable Extensible Toolkit for Scientific Computing (PETSc) library for distributed memory architectures. Using this newly instrumented code (called PETSc-FUN3D) as an example of a typical PDE solver, we demonstrate some strategies that are effective in tolerating the latencies arising from the hierarchical memory system and the network. Even on a single processor from each of the major contemporary architectural families, the PETSc-FUN3D code runs from 2.5 to 7.5 times faster than the legacy code on a medium-sized data set (with approximately 105 degrees of freedom). The major source of performance improvement is the increased locality in data reference patterns achieved through blocking, interlacing, and edge reordering. To explain these performance gains, we provide simple performance models based on memory bandwidth and instruction issue rates.
Experimental evidence, in terms of translation lookaside buffer (TLB) and data cache miss rates, achieved memory bandwidth, and graduated floating point instructions per memory reference, is provided through accurate measurements with hardware counters. The performance models and experimental results motivate algorithmic and software practices that lead to improvements in both parallel scalability and per-node performance. We identify the bottlenecks to scalability (algorithmic as well as implementation) for a fixed-size problem when the number of processors grows to several thousands (the expected level of concurrency on terascale architectures). We also evaluate the hybrid programming model (mixed distributed/shared) from a performance standpoint
Parallel implementation of the finite element method on shared memory multiprocessors
PhD ThesisThe work presented in this thesis concerns parallel methods for finite element
analysis. The research has been funded by British Gas and some of the presented
material involves work on their software. Practical problems involving the finite
element method can use a large amount of processing power and the execution
times can be very large. It is consequently important to investigate the possibilities
for the parallel implementation of the method. The research has been carried out
on an Encore Multimax, a shared memory multiprocessor with 14 identical CPU's.
We firstly experimented on autoparallelising a large British Gas finite element
program (GASP4) using Encore's parallelising Fortran compiler (epf). The par-
allel program generated by epj proved not to be efficient. The main reasons are
the complexity of the code and small grain parallelism. Since the program is hard
to analyse for the compiler at high levels, only small grain parallelism has been
inserted automatically into the code. This involves a great deal of low level syn-
chronisations which produce large overheads and cause inefficiency. A detailed
analysis of the autoparallelised code has been made with a view to determining
the reasons for the inefficiency. Suggestions have also been made about writing
programs such that they are suitable for efficient autoparallelisation.
The finite element method consists of the assembly of a stiffness matrix and
the solution of a set of simultaneous linear equations. A sparse representation of
the stiffness matrix has been used to allow experimentation on large problems.
Parallel assembly techniques for the sparse representation have been developed.
Some of these methods have proved to be very efficient giving speed ups that are
near ideal.
For the solution phase, we have used the preconditioned conjugate gradient
method (PCG). An incomplete LU factorization ofthe stiffness matrix with no fill-
in (ILU(O)) has been found to be an effective preconditioner. The factors can be
obtained at a low cost. We have parallelised all the steps of the PCG method. The
main bottleneck is the triangular solves (preconditioning operations) at each step.
Two parallel methods of triangular solution have been implemented. One is based
on level scheduling (row-oriented parallelism) and the other is a new approach
called independent columns (column-oriented parallelism). The algorithms have
been tested for row and red-black orderings of the nodal unknowns in the finite
element meshes considered.
The best speed ups obtained are 7.29 (on 12 processors) for level scheduling
and 7.11 (on 12 processors) for independent columns. Red-black ordering gives
rise to better parallel performance than row ordering in general. An analysis of
methods for the improvement of the parallel efficiency has been made.British Ga
Book of Abstracts of the Sixth SIAM Workshop on Combinatorial Scientific Computing
Book of Abstracts of CSC14 edited by Bora UçarInternational audienceThe Sixth SIAM Workshop on Combinatorial Scientific Computing, CSC14, was organized at the Ecole Normale Supérieure de Lyon, France on 21st to 23rd July, 2014. This two and a half day event marked the sixth in a series that started ten years ago in San Francisco, USA. The CSC14 Workshop's focus was on combinatorial mathematics and algorithms in high performance computing, broadly interpreted. The workshop featured three invited talks, 27 contributed talks and eight poster presentations. All three invited talks were focused on two interesting fields of research specifically: randomized algorithms for numerical linear algebra and network analysis. The contributed talks and the posters targeted modeling, analysis, bisection, clustering, and partitioning of graphs, applied in the context of networks, sparse matrix factorizations, iterative solvers, fast multi-pole methods, automatic differentiation, high-performance computing, and linear programming. The workshop was held at the premises of the LIP laboratory of ENS Lyon and was generously supported by the LABEX MILYON (ANR-10-LABX-0070, Université de Lyon, within the program ''Investissements d'Avenir'' ANR-11-IDEX-0007 operated by the French National Research Agency), and by SIAM
- …