13,977 research outputs found
Task-based adaptive multiresolution for time-space multi-scale reaction-diffusion systems on multi-core architectures
A new solver featuring time-space adaptation and error control has been
recently introduced to tackle the numerical solution of stiff
reaction-diffusion systems. Based on operator splitting, finite volume adaptive
multiresolution and high order time integrators with specific stability
properties for each operator, this strategy yields high computational
efficiency for large multidimensional computations on standard architectures
such as powerful workstations. However, the data structure of the original
implementation, based on trees of pointers, provides limited opportunities for
efficiency enhancements, while posing serious challenges in terms of parallel
programming and load balancing. The present contribution proposes a new
implementation of the whole set of numerical methods including Radau5 and
ROCK4, relying on a fully different data structure together with the use of a
specific library, TBB, for shared-memory, task-based parallelism with
work-stealing. The performance of our implementation is assessed in a series of
test-cases of increasing difficulty in two and three dimensions on multi-core
and many-core architectures, demonstrating high scalability
Fast Hardware Implementations of Static P Systems
In this article we present a simulator of non-deterministic static P systems
using Field Programmable Gate Array (FPGA) technology. Its major feature
is a high performance, achieving a constant processing time for each transition. Our
approach is based on representing all possible applications as words of some regular
context-free language. Then, using formal power series it is possible to obtain the
number of possibilities and select one of them following a uniform distribution, in
a fair and non-deterministic way. According to these ideas, we yield an implementation
whose results show an important speed-up, with a strong independence from
the size of the P system.Ministry of Science and Innovation of the Spanish Government under the project TEC2011-27936 (HIPERSYS)European Regional Development Fund (ERDF)Ministry of Education of Spain (FPU grant AP2009-3625)ANR project SynBioTI
QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment
Previous studies have reported that common dense linear algebra operations do
not achieve speed up by using multiple geographical sites of a computational
grid. Because such operations are the building blocks of most scientific
applications, conventional supercomputers are still strongly predominant in
high-performance computing and the use of grids for speeding up large-scale
scientific problems is limited to applications exhibiting parallelism at a
higher level. We have identified two performance bottlenecks in the distributed
memory algorithms implemented in ScaLAPACK, a state-of-the-art dense linear
algebra library. First, because ScaLAPACK assumes a homogeneous communication
network, the implementations of ScaLAPACK algorithms lack locality in their
communication pattern. Second, the number of messages sent in the ScaLAPACK
algorithms is significantly greater than other algorithms that trade flops for
communication. In this paper, we present a new approach for computing a QR
factorization -- one of the main dense linear algebra kernels -- of tall and
skinny matrices in a grid computing environment that overcomes these two
bottlenecks. Our contribution is to articulate a recently proposed algorithm
(Communication Avoiding QR) with a topology-aware middleware (QCG-OMPI) in
order to confine intensive communications (ScaLAPACK calls) within the
different geographical sites. An experimental study conducted on the Grid'5000
platform shows that the resulting performance increases linearly with the
number of geographical sites on large-scale problems (and is in particular
consistently higher than ScaLAPACK's).Comment: Accepted at IPDPS10. (IEEE International Parallel & Distributed
Processing Symposium 2010 in Atlanta, GA, USA.
Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions
In the past decade, Convolutional Neural Networks (CNNs) have demonstrated
state-of-the-art performance in various Artificial Intelligence tasks. To
accelerate the experimentation and development of CNNs, several software
frameworks have been released, primarily targeting power-hungry CPUs and GPUs.
In this context, reconfigurable hardware in the form of FPGAs constitutes a
potential alternative platform that can be integrated in the existing deep
learning ecosystem to provide a tunable balance between performance, power
consumption and programmability. In this paper, a survey of the existing
CNN-to-FPGA toolflows is presented, comprising a comparative study of their key
characteristics which include the supported applications, architectural
choices, design space exploration methods and achieved performance. Moreover,
major challenges and objectives introduced by the latest trends in CNN
algorithmic research are identified and presented. Finally, a uniform
evaluation methodology is proposed, aiming at the comprehensive, complete and
in-depth evaluation of CNN-to-FPGA toolflows.Comment: Accepted for publication at the ACM Computing Surveys (CSUR) journal,
201
Extensible Component Based Architecture for FLASH, A Massively Parallel, Multiphysics Simulation Code
FLASH is a publicly available high performance application code which has
evolved into a modular, extensible software system from a collection of
unconnected legacy codes. FLASH has been successful because its capabilities
have been driven by the needs of scientific applications, without compromising
maintainability, performance, and usability. In its newest incarnation, FLASH3
consists of inter-operable modules that can be combined to generate different
applications. The FLASH architecture allows arbitrarily many alternative
implementations of its components to co-exist and interchange with each other,
resulting in greater flexibility. Further, a simple and elegant mechanism
exists for customization of code functionality without the need to modify the
core implementation of the source. A built-in unit test framework providing
verifiability, combined with a rigorous software maintenance process, allow the
code to operate simultaneously in the dual mode of production and development.
In this paper we describe the FLASH3 architecture, with emphasis on solutions
to the more challenging conflicts arising from solver complexity, portable
performance requirements, and legacy codes. We also include results from user
surveys conducted in 2005 and 2007, which highlight the success of the code.Comment: 33 pages, 7 figures; revised paper submitted to Parallel Computin
- …