Search CORE

101 research outputs found

Transforming non textually aligned SPMD programs into textually aligned SPMD programs by using rewriting rules

Author: Bousdira Wadoud
Publication venue: HAL CCSD
Publication date: 15/07/2019
Field of study

International audienceThe problem of analyzing parallel programs that access shared memory and use barrier synchronization is known to be hard to study. For a special case of those programs with minimal SPMD (Single Program Multiple Data) constructs, a formal definition of textually aligned barriers with an operational semantics has been proposed in previous work. Then, the textual alignement of the synchronization barriers that is defined prevents deadlocks. However, the textual alignement property is not verified by all SPMD programs. We propose a set of transformation rules using rewriting techniques which allows to turn a non-textually aligned program to be textually aligned. So, we can benefit of a simple static analysis for deadlock detection. We show that the rewrite rules form a terminating confluent system and we prove that the transformation rules preserve the semantics of the programs

Language Constructs for Data Partitioning and Distribution

Author
Publication venue: 'Hindawi Limited'
Publication date: 01/01/1995
Field of study

Crossref

An automated OpenCL FPGA compilation framework targeting a configurable, VLIW chip multiprocessor

Author: Samuel J. Parker (7203041)
Publication venue
Publication date: 01/01/2015
Field of study

Modern system-on-chips augment their baseline CPU with coprocessors and accelerators to increase overall computational capacity and power efficiency, and thus have evolved into heterogeneous systems. Several languages have been developed to enable this paradigm shift, including CUDA and OpenCL. This thesis discusses a unified compilation environment to enable heterogeneous system design through the use of OpenCL and a customised VLIW chip multiprocessor (CMP) architecture, known as the LE1. An LLVM compilation framework was researched and a prototype developed to enable the execution of OpenCL applications on the LE1 CPU. The framework fully automates the compilation flow and supports work-item coalescing to better utilise the CPU cores and alleviate the effects of thread divergence. This thesis discusses in detail both the software stack and target hardware architecture and evaluates the scalability of the proposed framework on a highly precise cycle-accurate simulator. This is achieved through the execution of 12 benchmarks across 240 different machine configurations, as well as further results utilising an incomplete development branch of the compiler. It is shown that the problems generally scale well with the LE1 architecture, up to eight cores, when the memory system becomes a serious bottleneck. Results demonstrate superlinear performance on certain benchmarks (x9 for the bitonic sort benchmark with 8 dual-issue cores) with further improvements from compiler optimisations (x14 for bitonic with the same configuration

Loughborough University Institutional Repository

Scalable Applications on Heterogeneous System Architectures: A Systematic Performance Analysis Framework

Author: Dietrich Robert
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 15/11/2019
Field of study

The efficient parallel execution of scientific applications is a key challenge in high-performance computing (HPC). With growing parallelism and heterogeneity of compute resources as well as increasingly complex software, performance analysis has become an indispensable tool in the development and optimization of parallel programs. This thesis presents a framework for systematic performance analysis of scalable, heterogeneous applications. Based on event traces, it automatically detects the critical path and inefficiencies that result in waiting or idle time, e.g. due to load imbalances between parallel execution streams. As a prerequisite for the analysis of heterogeneous programs, this thesis specifies inefficiency patterns for computation offloading. Furthermore, an essential contribution was made to the development of tool interfaces for OpenACC and OpenMP, which enable a portable data acquisition and a subsequent analysis for programs with offload directives. At present, these interfaces are already part of the latest OpenACC and OpenMP API specification. The aforementioned work, existing preliminary work, and established analysis methods are combined into a generic analysis process, which can be applied across programming models. Based on the detection of wait or idle states, which can propagate over several levels of parallelism, the analysis identifies wasted computing resources and their root cause as well as the critical-path share for each program region. Thus, it determines the influence of program regions on the load balancing between execution streams and the program runtime. The analysis results include a summary of the detected inefficiency patterns and a program trace, enhanced with information about wait states, their cause, and the critical path. In addition, a ranking, based on the amount of waiting time a program region caused on the critical path, highlights program regions that are relevant for program optimization. The scalability of the proposed performance analysis and its implementation is demonstrated using High-Performance Linpack (HPL), while the analysis results are validated with synthetic programs. A scientific application that uses MPI, OpenMP, and CUDA simultaneously is investigated in order to show the applicability of the analysis

Qucosa

HSSS - Hochschulschriftenserver der SLUB

Technische Universität Dresden: Qucosa

Visual programming in a heterogeneous multi-core environment

Author: Guerreiro Pedro Miguel Rito
Publication venue: 'Universidade de Evora'
Publication date: 01/01/2009
Field of study

É do conhecimento geral de que, hoje em dia, a tecnologia evolui rapidamente. São criadas novas arquitecturas para resolver determinadas limitações ou problemas. Por vezes, essa evolução é pacífica e não requer necessidade de adaptação e, por outras, essa evolução pode Implicar mudanças. As linguagens de programação são, desde sempre, o principal elo de comunicação entre o programador e o computador. Novas linguagens continuam a aparecer e outras estão sempre em desenvolvimento para se adaptarem a novos conceitos e paradigmas. Isto requer um esforço extra para o programador, que tem de estar sempre atento a estas mudanças. A Programação Visual pode ser uma solução para este problema. Exprimir funções como módulos que recebem determinado Input e retomam determinado output poderá ajudar os programadores espalhados pelo mundo, através da possibilidade de lhes dar uma margem para se abstraírem de pormenores de baixo nível relacionados com uma arquitectura específica. Esta tese não só mostra como combinar as capacidades do CeII/B.E. (que tem uma arquitectura multiprocessador heterogénea) com o OpenDX (que tem um ambiente de programação visual), como também demonstra que tal pode ser feito sem grande perda de performance. ABSTRACT; lt is known that nowadays technology develops really fast. New architectures are created ln order to provide new solutions for different technology limitations and problems. Sometimes, this evolution is pacific and there is no need to adapt to new technologies, but things also may require a change every once ln a while. Programming languages have always been the communication bridge between the programmer and the computer. New ones keep coming and other ones keep improving ln order to adapt to new concepts and paradigms. This requires an extra-effort for the programmer, who always needs to be aware of these changes. Visual Programming may be a solution to this problem. Expressing functions as module boxes which receive determined Input and return determined output may help programmers across the world by giving them the possibility to abstract from specific low-level hardware issues. This thesis not only shows how the CeII/B.E. (which has a heterogeneous multi-core architecture) capabilities can be combined with OpenDX (which has a visual programming environment), but also demonstrates that lt can be done without losing much performance

Repositório Científico da Universidade de Évora

Recommended from our members

A visual programming tool for Fortran D

Author: Kondapaneni Prasanna K.
Pancake Cherri M.
Ward Christopher
Publication venue: 'Oregon State University'
Publication date
Field of study

Visual Fortran D (VFD) is a graphical tool to assist parallel programmers in specifying data distributions. Its target is Fortran D, an extension to Fortran77 or Fortran90 which supports data parallelism. VFD provides an intuitive framework where the user employs simple, fast graphical manipulations to specify how data is to be organized for distribution across multiple processors. The corresponding Fortran D statement is generated automatically from the graphical representation and displayed alongside it. Initial experimentation by users indicates that VFD improves the accuracy of data distributions. The ability to observe how a specification statement varies as the graphical representation is changed appears to make VFD a useful tool for teaching Fortran D concepts well

ScholarsArchive@OSU

FPGA-Based Acceleration of the Self-Organizing Map (SOM) Algorithm using High-Level Synthesis

Author: Oninda Mohammad Abdul Moin
Publication venue: 'University of Windsor Leddy Library'
Publication date: 17/11/2019
Field of study

One of the fastest growing and the most demanding areas of computer science is Machine Learning (ML). Self-Organizing Map (SOM), categorized as unsupervised ML, is a popular data-mining algorithm widely used in Artificial Neural Network (ANN) for mapping high dimensional data into low dimensional feature maps. SOM, being computationally intensive, requires high computational time and power when dealing with large datasets. Acceleration of many computationally intensive algorithms can be achieved using Field-Programmable Gate Arrays (FPGAs) but it requires extensive hardware knowledge and longer development time when employing traditional Hardware Description Language (HDL) based design methodology. Open Computing Language (OpenCL) is a standard framework for writing parallel computing programs that execute on heterogeneous computing systems. Intel FPGA Software Development Kit for OpenCL (IFSO) is a High-Level Synthesis (HLS) tool that provides a more efficient alternative to HDL-based design. This research presents an optimized OpenCL implementation of SOM algorithm on Stratix V and Arria 10 FPGAs using IFSO. Compared to recent SOM implementations on Central Processing Unit (CPU) and Graphics Processing Unit (GPU), our OpenCL implementation on FPGAs provides superior speed performance and power consumption results. Stratix V achieves speedup of 1.41x - 16.55x compared to AMD and Intel CPU and 2.18x compared to Nvidia GPU whereas Arria 10 achieves speedup of 1.63x - 19.15x compared to AMD and Intel CPU and 2.52x compared to Nvidia GPU. In terms of power consumption, Stratix V is 35.53x and 42.53x whereas Arria 10 is 15.82x and 15.93x more power efficient compared to CPU and GPU respectively

Scholarship at UWindsor

Automatic parallelization by pattern-matching

Author: C. Lawson
Christopher W. W. Fraser
J. J. Dongarra
J. Saltz
K. Knobe
S. Wholey
V. Balasundaram
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Eastern Michigan University Graduate Catalog, 2013-2014

Author: Office of the Registrar
Publication venue: DigitalCommons@EMU
Publication date: 01/01/2013
Field of study

Eastern Michigan University: Digital Commons@EMU