1,908 research outputs found
Julia: A Fresh Approach to Numerical Computing
Bridging cultures that have often been distant, Julia combines expertise from
the diverse fields of computer science and computational science to create a
new approach to numerical computing. Julia is designed to be easy and fast.
Julia questions notions generally held as "laws of nature" by practitioners of
numerical computing:
1. High-level dynamic programs have to be slow.
2. One must prototype in one language and then rewrite in another language
for speed or deployment, and
3. There are parts of a system for the programmer, and other parts best left
untouched as they are built by the experts.
We introduce the Julia programming language and its design --- a dance
between specialization and abstraction. Specialization allows for custom
treatment. Multiple dispatch, a technique from computer science, picks the
right algorithm for the right circumstance. Abstraction, what good computation
is really about, recognizes what remains the same after differences are
stripped away. Abstractions in mathematics are captured as code through another
technique from computer science, generic programming.
Julia shows that one can have machine performance without sacrificing human
convenience.Comment: 37 page
Execution models for mapping programs onto distributed memory parallel computers
The problem of exploiting the parallelism available in a program to efficiently employ the resources of the target machine is addressed. The problem is discussed in the context of building a mapping compiler for a distributed memory parallel machine. The paper describes using execution models to drive the process of mapping a program in the most efficient way onto a particular machine. Through analysis of the execution models for several mapping techniques for one class of programs, we show that the selection of the best technique for a particular program instance can make a significant difference in performance. On the other hand, the results of benchmarks from an implementation of a mapping compiler show that our execution models are accurate enough to select the best mapping technique for a given program
Parameterized Construction of Program Representations for Sparse Dataflow Analyses
Data-flow analyses usually associate information with control flow regions.
Informally, if these regions are too small, like a point between two
consecutive statements, we call the analysis dense. On the other hand, if these
regions include many such points, then we call it sparse. This paper presents a
systematic method to build program representations that support sparse
analyses. To pave the way to this framework we clarify the bibliography about
well-known intermediate program representations. We show that our approach, up
to parameter choice, subsumes many of these representations, such as the SSA,
SSI and e-SSA forms. In particular, our algorithms are faster, simpler and more
frugal than the previous techniques used to construct SSI - Static Single
Information - form programs. We produce intermediate representations isomorphic
to Choi et al.'s Sparse Evaluation Graphs (SEG) for the family of data-flow
problems that can be partitioned per variables. However, contrary to SEGs, we
can handle - sparsely - problems that are not in this family
Accelerating Homomorphic Encryption in the Cloud Environment through High-Level Synthesis and Reconfigurable Resources
The recent surge in cloud services is revolutionizing the way that data is stored and processed. Everyone with an internet connection, from large corporations to small companies and private individuals, now have access to cutting-edge processing power and vast amounts of data storage. This rise in cloud computing and storage, however, has brought with it a need for a new type of security. In order to have access to cloud services, users must allow the service provider to have full access to their private, unencrypted data. Users are required to trust the integrity of the service provider and the security of its data centers. The recent development of fully homomorphic encryption schemes can offer a solution to this dilemma. These algorithms allow encrypted data to be used in computations without ever stripping the data of the protection of encryption. Unfortunately, the demanding memory requirements and computational complexity of the proposed schemes has hindered their wide-scale use. Custom hardware accelerators for homomorphic encryption could be implemented on the increasing number of reconfigurable hardware resources in the cloud, but the long development time required for these processors would lead to high production costs. This research seeks to develop a strategy for faster development of homomorphic encryption hardware accelerators using the process of High-Level Synthesis. Insights from existing number theory software libraries and custom hardware accelerators are used to develop a scalable, proof-of-concept software implementation of Karatsuba modular polynomial multiplication. This implementation was designed to be used with High-Level Synthesis to accelerate the large modular polynomial multiplication operations required by homomorphic encryption. The accelerator generated from this implementation by the High-Level Synthesis tool Vivado HLS achieved significant speedup over the implementations available in the highly-optimized FLINT software library
Static Analysis of OpenStream Programs
International audienceThis paper studies the applicability of polyhedral techniques to the parallel language Open- Stream [25]. When applicable, polyhedral techniques are invaluable for compile-time debugging and for generating efficient code well suited to a target architecture. OpenStream is a two-level language in which a control program directs the initialization of parallel task instances that communicate through streams, with possibly multiple writers and readers. It has a fairly complex semantics in its most general setting, but we restrict ourselves to the case where the control program is sequential, which is representative of the majority of the OpenStream applications. This restriction offers deterministic concurrency by construction, but deadlocks are still possible. We show that, if the control program is polyhedral, one may statically compute, for each task instance, the read and write indices to each of its streams, and thus reason statically about the dependences among task instances (the only scheduling constraints in this polyhedral subset). These indices may be polynomials of arbitrary degree, thus requiring to extend to polynomials the standard polyhedral techniques for dependence analysis, scheduling, and deadlock detection. Modern SMT allow to solve polynomial problems, albeit with no guarantee of success; the approach of Feautrier [10] may offer an alternative solution. We also establish two important results related to deadlocks in OpenStream: 1) a characterization of deadlocks in terms of dependence paths, which implies that streams can be safely bounded as soon as a schedule exists with such sizes, 2) the proof that deadlock detection is undecidable, even for polyhedral OpenStream
Reliability for exascale computing : system modelling and error mitigation for task-parallel HPC applications
As high performance computing (HPC) systems continue to grow, their fault rate increases. Applications running on these systems have to deal with rates on the order of hours or days. Furthermore, some studies for future Exascale systems predict the rates to be on the order of minutes. As a result, efficient fault tolerance solutions are needed to be able to tolerate frequent failures.
A fault tolerance solution for future HPC and Exascale systems must be low-cost, efficient and highly scalable. It should have low overhead in fault-free execution and provide fast restart because long-running applications are expected to experience many faults during the execution. Meanwhile task-based dataflow parallel programming models (PM) are becoming a popular paradigm in HPC applications at large scale. For instance, we see the adaptation of task-based dataflow parallelism in OpenMP 4.0, OmpSs PM, Argobots and Intel Threading Building Blocks.
In this thesis we propose fault-tolerance solutions for task-parallel dataflow HPC applications. Specifically, first we design and implement a checkpoint/restart and message-logging framework to recover from errors. We then develop performance models to investigate the benefits of our task-level frameworks when integrated with system-wide checkpointing. Moreover, we design and implement selective task replication mechanisms to detect and recover from silent data corruptions in task-parallel dataflow HPC applications. Finally, we introduce a runtime-based coding scheme to detect and recover from memory errors in these applications.
Considering the span of all of our schemes, we see that they provide a fairly high failure coverage where both computation and memory is protected against errors.A medida que los Sistemas de Cómputo de Alto rendimiento (HPC por sus siglas en inglés) siguen creciendo, también las tasas de fallos aumentan. Las aplicaciones que se ejecutan en estos sistemas tienen una tasa de fallos que pueden estar en el orden de horas o días. Además, algunos estudios predicen que los fallos estarán en el orden de minutos en los Sistemas Exascale. Por lo tanto, son necesarias soluciones eficientes para la tolerancia a fallos que puedan tolerar fallos frecuentes.
Las soluciones para tolerancia a fallos en los Sistemas futuros de HPC y Exascale tienen que ser de bajo costo, eficientes y altamente escalable.
El sobrecosto en la ejecución sin fallos debe ser bajo y también se debe proporcionar reinicio rápido, ya que se espera que las aplicaciones de larga duración experimenten muchos fallos durante la ejecución. Por otra parte, los modelos de programación paralelas basados en tareas ordenadas de acuerdo a sus dependencias de datos, se están convirtiendo en un paradigma popular en aplicaciones HPC a gran escala. Por ejemplo, los siguientes modelos de programación paralela incluyen este tipo de modelo de programación OpenMP 4.0, OmpSs, Argobots e Intel Threading Building Blocks.
En esta tesis proponemos soluciones de tolerancia a fallos para aplicaciones de HPC programadas en un modelo de programación paralelo basado tareas. Específicamente, en primer lugar, diseñamos e implementamos mecanismos “checkpoint/restart” y “message-logging” para recuperarse de los errores.
Para investigar los beneficios de nuestras herramientas a nivel de tarea cuando se integra con los “system-wide checkpointing” se han desarrollado modelos de rendimiento. Por otra parte, diseñamos e implementamos mecanismos de replicación selectiva de tareas que permiten detectar y recuperarse de daños de datos silenciosos en aplicaciones programadas siguiendo el modelo de programación paralela basadas en tareas. Por último, se introduce un esquema de codificación que funciona en tiempo de ejecución para detectar y recuperarse de los errores de la memoria en estas aplicaciones.
Todos los esquemas propuestos, en conjunto, proporcionan una cobertura bastante alta a los fallos tanto si estos se producen el cálculo o en la memoria.Postprint (published version
Partitioning SKA Dataflows for Optimal Graph Execution
Optimizing data-intensive workflow execution is essential to many modern
scientific projects such as the Square Kilometre Array (SKA), which will be the
largest radio telescope in the world, collecting terabytes of data per second
for the next few decades. At the core of the SKA Science Data Processor is the
graph execution engine, scheduling tens of thousands of algorithmic components
to ingest and transform millions of parallel data chunks in order to solve a
series of large-scale inverse problems within the power budget. To tackle this
challenge, we have developed the Data Activated Liu Graph Engine (DALiuGE) to
manage data processing pipelines for several SKA pathfinder projects. In this
paper, we discuss the DALiuGE graph scheduling sub-system. By extending
previous studies on graph scheduling and partitioning, we lay the foundation on
which we can develop polynomial time optimization methods that minimize both
workflow execution time and resource footprint while satisfying resource
constraints imposed by individual algorithms. We show preliminary results
obtained from three radio astronomy data pipelines.Comment: Accepted in HPDC ScienceCloud 2018 Worksho
Exploring the acceleration of Nekbone on reconfigurable architectures
Hardware technological advances are struggling to match scientific ambition,
and a key question is how we can use the transistors that we already have more
effectively. This is especially true for HPC, where the tendency is often to
throw computation at a problem whereas codes themselves are commonly bound,
at-least to some extent, by other factors. By redesigning an algorithm and
moving from a Von Neumann to dataflow style, then potentially there is more
opportunity to address these bottlenecks on reconfigurable architectures,
compared to more general-purpose architectures.
In this paper we explore the porting of Nekbone's AX kernel, a widely popular
HPC mini-app, to FPGAs using High Level Synthesis via Vitis. Whilst computation
is an important part of this code, it is also memory bound on CPUs, and a key
question is whether one can ameliorate this by leveraging FPGAs. We first
explore optimisation strategies for obtaining good performance, with over a
4000 times runtime difference between the first and final version of our kernel
on FPGAs. Subsequently, performance and power efficiency of our approach on an
Alveo U280 are compared against a 24 core Xeon Platinum CPU and NVIDIA V100
GPU, with the FPGA outperforming the CPU by around four times, achieving almost
three quarters the GPU performance, and significantly more power efficient than
both. The result of this work is a comparison and set of techniques that both
apply to Nekbone on FPGAs specifically and are also of interest more widely in
accelerating HPC codes on reconfigurable architectures.Comment: Pre-print of paper accepted to IEEE/ACM International Workshop on
Heterogeneous High-performance Reconfigurable Computing (H2RC
Inferring Energy Bounds via Static Program Analysis and Evolutionary Modeling of Basic Blocks
The ever increasing number and complexity of energy-bound devices (such as
the ones used in Internet of Things applications, smart phones, and mission
critical systems) pose an important challenge on techniques to optimize their
energy consumption and to verify that they will perform their function within
the available energy budget. In this work we address this challenge from the
software point of view and propose a novel parametric approach to estimating
tight bounds on the energy consumed by program executions that are practical
for their application to energy verification and optimization. Our approach
divides a program into basic (branchless) blocks and estimates the maximal and
minimal energy consumption for each block using an evolutionary algorithm. Then
it combines the obtained values according to the program control flow, using
static analysis, to infer functions that give both upper and lower bounds on
the energy consumption of the whole program and its procedures as functions on
input data sizes. We have tested our approach on (C-like) embedded programs
running on the XMOS hardware platform. However, our method is general enough to
be applied to other microprocessor architectures and programming languages. The
bounds obtained by our prototype implementation can be tight while remaining on
the safe side of budgets in practice, as shown by our experimental evaluation.Comment: Pre-proceedings paper presented at the 27th International Symposium
on Logic-Based Program Synthesis and Transformation (LOPSTR 2017), Namur,
Belgium, 10-12 October 2017 (arXiv:1708.07854). Improved version of the one
presented at the HIP3ES 2016 workshop (v1): more experimental results (added
benchmark to Table 1, added figure for new benchmark, added Table 3),
improved Fig. 1, added Fig.
Automatic Creation of High-Bandwidth Memory Architectures from Domain-Specific Languages: The Case of Computational Fluid Dynamics
Numerical simulations can help solve complex problems. Most of these algorithms are massively parallel and thus good candidates for FPGA acceleration thanks to spatial parallelism. Modern FPGA devices can leverage high-bandwidth memory technologies, but when applications are memory-bound designers must craft advanced communication and memory architectures for efficient data movement and on-chip storage. This development process requires hardware design skills that are uncommon in domain-specific experts.
In this paper, we propose an automated tool flow from a domain-specific language (DSL) for tensor expressions to generate massively-parallel accelerators on HBM-equipped FPGAs. Designers can use this flow to integrate and evaluate various compiler or hardware optimizations. We use computational fluid dynamics (CFD) as a paradigmatic example.
Our flow starts from the high-level specification of tensor operations and combines an MLIR-based compiler with an in-house hardware generation flow to generate systems with parallel accelerators and a specialized memory architecture that moves data efficiently, aiming at fully exploiting the available CPU-FPGA bandwidth.
We simulated applications with millions of elements, achieving up to 103 GFLOPS with one compute unit and custom precision when targeting a Xilinx Alveo U280. Our FPGA implementation is up to 25x more energy efficient than expert-crafted Intel CPU implementations
- …