2,724 research outputs found
APOLLO: Automatic speculative POLyhedral Loop Optimizer
International audienceA few weeks ago, we were glad to announce the first release of Apollo, the Automatic speculative POLyhedral Loop Opti-mizer. Apollo applies polyhedral optimizations on-the-fly to loop nests, whose control flow and memory access patterns cannot be determined at compile-time. In contrast to existing tools, Apollo can handle any kind of loop nest, whose memory accesses can be performed through pointers and in-directions. At runtime, Apollo builds a predictive polyhedral model, which is used for speculative optimization including parallelization. Being a dynamic system, Apollo can even apply the polyhedral model to nonlinear loops. This paper describes Apollo from the perspective of a user, as well as some of its main contributions and mechanisms, including the just-in-time polyhedral compilation, that significantly extends the scope of polyhedral techniques
Loo.py: transformation-based code generation for GPUs and CPUs
Today's highly heterogeneous computing landscape places a burden on
programmers wanting to achieve high performance on a reasonably broad
cross-section of machines. To do so, computations need to be expressed in many
different but mathematically equivalent ways, with, in the worst case, one
variant per target machine.
Loo.py, a programming system embedded in Python, meets this challenge by
defining a data model for array-style computations and a library of
transformations that operate on this model. Offering transformations such as
loop tiling, vectorization, storage management, unrolling, instruction-level
parallelism, change of data layout, and many more, it provides a convenient way
to capture, parametrize, and re-unify the growth among code variants. Optional,
deep integration with numpy and PyOpenCL provides a convenient computing
environment where the transition from prototype to high-performance
implementation can occur in a gradual, machine-assisted form
AutoParallel: A Python module for automatic parallelization and distributed execution of affine loop nests
The last improvements in programming languages, programming models, and
frameworks have focused on abstracting the users from many programming issues.
Among others, recent programming frameworks include simpler syntax, automatic
memory management and garbage collection, which simplifies code re-usage
through library packages, and easily configurable tools for deployment. For
instance, Python has risen to the top of the list of the programming languages
due to the simplicity of its syntax, while still achieving a good performance
even being an interpreted language. Moreover, the community has helped to
develop a large number of libraries and modules, tuning them to obtain great
performance.
However, there is still room for improvement when preventing users from
dealing directly with distributed and parallel computing issues. This paper
proposes and evaluates AutoParallel, a Python module to automatically find an
appropriate task-based parallelization of affine loop nests to execute them in
parallel in a distributed computing infrastructure. This parallelization can
also include the building of data blocks to increase task granularity in order
to achieve a good execution performance. Moreover, AutoParallel is based on
sequential programming and only contains a small annotation in the form of a
Python decorator so that anyone with little programming skills can scale up an
application to hundreds of cores.Comment: Accepted to the 8th Workshop on Python for High-Performance and
Scientific Computing (PyHPC 2018
Plastic deformation of dense nanocrystalline yttrium oxide at elevated temperatures
Nanocrystalline yttrium oxide, Y2O3 with 110 nm average grain size was plastically deformed between 800 ◦C and 1100 ◦C by compression at different strain rates and by creep at different stresses. The onset temperature for plasticity was at 1000 ◦C. Yield stress was strongly temperature dependent and the strain hardening disappeared at 1100 ◦C. The polyhedral and equiaxed grain morphology were preserved in the deformed specimens. The experimentally measured and theoretically calculated stress exponent n = 2 was consistent with the plastic deformation by grain boundary sliding. Decrease in the grain size was consistent with decrease in the brittle to ductile transition temperature
Symbolic and analytic techniques for resource analysis of Java bytecode
Recent work in resource analysis has translated the idea of amortised resource analysis to imperative languages using a program logic that allows mixing of assertions about heap shapes, in the tradition of separation logic, and assertions about consumable resources. Separately, polyhedral methods have been used to calculate bounds on numbers of iterations in loop-based programs. We are attempting to combine these ideas to deal with Java programs involving both data structures and loops, focusing on the bytecode level rather than on source code
AutoAccel: Automated Accelerator Generation and Optimization with Composable, Parallel and Pipeline Architecture
CPU-FPGA heterogeneous architectures are attracting ever-increasing attention
in an attempt to advance computational capabilities and energy efficiency in
today's datacenters. These architectures provide programmers with the ability
to reprogram the FPGAs for flexible acceleration of many workloads.
Nonetheless, this advantage is often overshadowed by the poor programmability
of FPGAs whose programming is conventionally a RTL design practice. Although
recent advances in high-level synthesis (HLS) significantly improve the FPGA
programmability, it still leaves programmers facing the challenge of
identifying the optimal design configuration in a tremendous design space.
This paper aims to address this challenge and pave the path from software
programs towards high-quality FPGA accelerators. Specifically, we first propose
the composable, parallel and pipeline (CPP) microarchitecture as a template of
accelerator designs. Such a well-defined template is able to support efficient
accelerator designs for a broad class of computation kernels, and more
importantly, drastically reduce the design space. Also, we introduce an
analytical model to capture the performance and resource trade-offs among
different design configurations of the CPP microarchitecture, which lays the
foundation for fast design space exploration. On top of the CPP
microarchitecture and its analytical model, we develop the AutoAccel framework
to make the entire accelerator generation automated. AutoAccel accepts a
software program as an input and performs a series of code transformations
based on the result of the analytical-model-based design space exploration to
construct the desired CPP microarchitecture. Our experiments show that the
AutoAccel-generated accelerators outperform their corresponding software
implementations by an average of 72x for a broad class of computation kernels
- …