435 research outputs found
MetaFork: A Compilation Framework for Concurrency Models Targeting Hardware Accelerators
Parallel programming is gaining ground in various domains due to the tremendous computational power that it brings; however, it also requires a substantial code crafting effort to achieve performance improvement. Unfortunately, in most cases, performance tuning has to be accomplished manually by programmers. We argue that automated tuning is necessary due to the combination of the following factors. First, code optimization is machine-dependent. That is, optimization preferred on one machine may be not suitable for another machine. Second, as the possible optimization search space increases, manually finding an optimized configuration is hard. Therefore, developing new compiler techniques for optimizing applications is of considerable interest. This thesis aims at generating new techniques that will help programmers develop efficient algorithms and code targeting hardware acceleration technologies, in a more effective manner. Our work is organized around a compilation framework, called MetaFork, for concurrency platforms and its application to automatic parallelization. MetaFork is a high-level programming language extending C/C++, which combines several models of concurrency including fork-join, SIMD and pipelining parallelism. MetaFork is also a compilation framework which aims at facilitating the design and implementation of concurrent programs through four key features which make MetaFork unique and novel: (1) Perform automatic code translation between concurrency platforms targeting multi-core architectures. (2) Provide a high-level language for expressing concurrency as in the fork-join model, the SIMD paradigm and the pipelining parallelism. (3) Generate parallel code from serial code with an emphasis on code depending on machine or program parameters (e.g. cache size, number of processors, number of threads per thread block). (4) Optimize code depending on parameters that are unknown at compile-time
Loo.py: From Fortran to performance via transformation and substitution rules
A large amount of numerically-oriented code is written and is being written
in legacy languages. Much of this code could, in principle, make good use of
data-parallel throughput-oriented computer architectures. Loo.py, a
transformation-based programming system targeted at GPUs and general
data-parallel architectures, provides a mechanism for user-controlled
transformation of array programs. This transformation capability is designed to
not just apply to programs written specifically for Loo.py, but also those
imported from other languages such as Fortran. It eases the trade-off between
achieving high performance, portability, and programmability by allowing the
user to apply a large and growing family of transformations to an input
program. These transformations are expressed in and used from Python and may be
applied from a variety of settings, including a pragma-like manner from other
languages.Comment: ARRAY 2015 - 2nd ACM SIGPLAN International Workshop on Libraries,
Languages and Compilers for Array Programming (ARRAY 2015
Automatic translation of non-repetitive OpenMP to MPI
Cluster platforms with distributed-memory architectures are becoming increasingly available low-cost solutions for high performance computing. Delivering a productive programming environment that hides the complexity of clusters and allows writing efficient programs is urgently needed. Despite multiple efforts to provide shared memory abstraction, message-passing (MPI) is still the state-of-the-art programming model for distributed-memory architectures. ^ Writing efficient MPI programs is challenging. In contrast, OpenMP is a shared-memory programming model that is known for its programming productivity. Researchers introduced automatic source-to-source translation schemes from OpenMP to MPI so that programmers can use OpenMP while targeting clusters. Those schemes limited their focus on OpenMP programs with repetitive communication patterns (where the analysis of communication can be simplified). This dissertation reduces this limitation and presents a novel OpenMP-to-MPI translation scheme that covers OpenMP programs with both repetitive and non-repetitive communication patterns. We target laboratory-size clusters of ten to hundred nodes (commonly found in research laboratories and small enterprises). ^ With our translation scheme, six non-repetitive and four repetitive OpenMP benchmarks have been efficiently scaled to a cluster of 64 cores. By contrast, the state-of-the-art translator scaled only the four repetitive benchmarks. In addition, our translation scheme was shown to outperform or perform as well as the state-of-the-art translator. We also compare the translation scheme with available hand-coded MPI and Unified Parallel C (UPC) programs
Improving Utility of GPU in Accelerating Industrial Applications with User-centred Automatic Code Translation
SMEs (Small and medium-sized enterprises), particularly those whose business is focused on developing innovative produces, are limited by a major bottleneck on the speed of computation in many applications. The recent developments in GPUs have been the marked increase in their versatility in many computational areas. But due to the lack of specialist GPU (Graphics processing units) programming skills, the explosion of GPU power has not been fully utilized in general SME applications by inexperienced users. Also, existing automatic CPU-to-GPU code translators are mainly designed for research purposes with poor user interface design and hard-to-use. Little attentions have been paid to the applicability, usability and learnability of these tools for normal users. In this paper, we present an online automated CPU-to-GPU source translation system, (GPSME) for inexperienced users to utilize GPU capability in accelerating general SME applications. This system designs and implements a directive programming model with new kernel generation scheme and memory management hierarchy to optimize its performance. A web-service based interface is designed for inexperienced users to easily and flexibly invoke the automatic resource translator. Our experiments with non-expert GPU users in 4 SMEs reflect that GPSME system can efficiently accelerate real-world applications with at least 4x and have a better applicability, usability and learnability than existing automatic CPU-to-GPU source translators
CU2CL: A CUDA-to-OpenCL Translator for Multi- and Many-core Architectures
The use of graphics processing units (GPUs) in
high-performance parallel computing continues to become more
prevalent, often as part of a heterogeneous system. For years,
CUDA has been the de facto programming environment for
nearly all general-purpose GPU (GPGPU) applications. In spite
of this, the framework is available only on NVIDIA GPUs,
traditionally requiring reimplementation in other frameworks
in order to utilize additional multi- or many-core devices.
On the other hand, OpenCL provides an open and vendorneutral
programming environment and runtime system. With
implementations available for CPUs, GPUs, and other types of
accelerators, OpenCL therefore holds the promise of a “write
once, run anywhere” ecosystem for heterogeneous computing.
Given the many similarities between CUDA and OpenCL,
manually porting a CUDA application to OpenCL is typically
straightforward, albeit tedious and error-prone. In response
to this issue, we created CU2CL, an automated CUDA-to-
OpenCL source-to-source translator that possesses a novel design
and clever reuse of the Clang compiler framework. Currently,
the CU2CL translator covers the primary constructs found in
CUDA runtime API, and we have successfully translated many
applications from the CUDA SDK and Rodinia benchmark suite.
The performance of our automatically translated applications via
CU2CL is on par with their manually ported countparts
Towards A Quasi High Level Compiler Comparative and Attributive Model for OpenMP Programs
In order to understand the behavior of OpenMP programs, special tools and adaptive techniques are needed for performance analysis. However, these tools provide low level profile information at the assembly and functions boundaries via instrumentation at the binary or code level, which are very hard to interpret. Hence, this thesis proposes a new model for OpenMP enabled compilers that assesses the performance differences in well defined formulations by dividing OpenMP program conditions into four distinct states which account for all the possible cases that an OpenMP program can take. An improved version of the standard performance metrics is proposed: speedup, overhead and efficiency based on the model categorization that is state\u27s aware. Moreover, an algorithmic approach to find patterns between OpenMP compilers is proposed which is verified along with the model formulations experimentally. Finally, the thesis reveals the mathematical model behind the optimum performance for any OpenMP program
- …