33 research outputs found
A Compiler and Runtime Infrastructure for Automatic Program Distribution
This paper presents the design and the implementation of a compiler and runtime infrastructure for automatic program distribution. We are building a research infrastructure that enables experimentation with various program partitioning and mapping strategies and the study of automatic distribution's effect on resource consumption (e.g., CPU, memory, communication). Since many optimization techniques are faced with conflicting optimization targets (e.g., memory and communication), we believe that it is important to be able to study their interaction.
We present a set of techniques that enable flexible resource modeling and program distribution. These are: dependence analysis, weighted graph partitioning, code and communication generation, and profiling. We have developed these ideas in the context of the Java language. We present in detail the design and implementation of each of the techniques as part of our compiler and runtime infrastructure. Then, we evaluate our design and present preliminary experimental data for each component, as well as for the entire system
Performance Evaluation of Automatically Generated Data Parallel Programs
International audienceIn this paper, the problem of evaluating the performance of parallel programs generated by data parallel compilers is studied. These compilers take as input an application written in a sequential language augmented with data distribution directives and produce a parallel version based on the specifed partitioning of data. A methodology for evaluating the relationships existing among the program characteristics, the data distribution adopted, and the performance indices measured during the program execution is described. It consists of three phases: a "static" description of the program under study, a "dynamic" description, based on the measurement and the analysis of its execution on a real system, and the construction of a workload model, by using workload characterization techniques. Following such a methodology, decisions related to the selection of the data distribution to be adopted can be facilitated. The approach is exposed through the use of the Pandore environment, designed for the execution of sequential programs on distributed memory parallel computers. It is composed of a compiler, a runtime system and tools for trace and profile generation. The results of an experiment explaining the methodology are presented
A Survey on Parallel Architecture and Parallel Programming Languages and Tools
In this paper, we have presented a brief review on the evolution of parallel computing to multi - core architecture. The survey briefs more than 45 languages, libraries and tools used till date to increase performance through parallel programming. We ha ve given emphasis more on the architecture of parallel system in the survey
Automatic Runtime Calculation of Communications for Data-Parallel Expressions with Periodic Conditions
Producción CientíficaMany real-world applications feature data accesses on periodic domains. Manually implementing the synchronizations and communications associated to the data dependences on each case is cumbersome and error-prone. It is increasingly interesting to support these applications in high-level parallel programming languages or parallelizing compilers. In this paper, we present a technique that, for distributed-memory systems, calculates the specific communications derived from data-parallel codes with or without periodic boundary conditions on affine access expressions. It makes transparent to the programmer the management of aggregated communications for the chosen data partition. Our technique moves to runtime part of the compile-time analysis typically used to generate the communication code for affine expressions, introducing a complete new technique that also supports the periodic boundary conditions. We present an experimental study to evaluate our proposal using several study cases. Our experimental results show that our approach can automatically obtain communication codes as efficient as those found in MPI reference codes, reducing the development effort.2019-01-01MICINN (Spain) and ERDF program of the European Union: HomProg-HetSys project (TIN2014-58876-P), CAPAP-H6 Network (TIN2016-81840-REDT), and COST Program Action IC1305: Network for Sustainable Ultrascale Computing (NESUS). By the computing facilities of Extremadura Research Centre for Advanced Technologies (CETA-CIEMAT), funded by the European Regional Development Fund (ERDF). CETA-CIEMAT belongs to CIEMAT and the Government of Spain
ARTICLE NO. PC971367 A Library-Based Approach to Task Parallelism in a Data-Parallel Language
Pure data-parallel languages such as High Performance Fortran version 1 (HPF) do not allow efficient expression of mixed task/data-parallel computations or the coupling of separately compiled data-parallel modules. In this paper, we show how these common parallel program structures can be represented, with only minor extensions to the HPF model, by using a coordination library based on the Message Passing Interface (MPI). This library allows data-parallel tasks to exchange distributed data structures using calls to simple communication functions. We present microbenchmark results that characterize the performance of this library and that quantify the impact of optimizations that allow reuse of communication schedules in common situations. In addition, results from two-dimensional FFT, convolution, and multiblock programs demonstrate that the HPF/ MPI library can provide performance superior to that of pure HPF. We conclude that this synergistic combination of two parallel programming standards represents a useful approach to task parallelism in a data-parallel framework, increasing the range of problems addressable in HPF without requiring complex compile
Reducing fine-grain communication overhead in multithread code generation for heterogeneous MPSoC
Heterogeneous MPSoCs present unique opportunities
for emerging embedded applications, which require both
high-performance and programmability. Although,
software programming for these MPSoC architectures
requires tedious and error-prone tasks, thereby automatic
code generation tools are required. A code generation
method based on fine-grain specification can provide
more design space and optimization opportunities, such as
exploiting fine-level parallelism and more efficient
partitions. However, when partitioned, fine-grain models
may require a large number of inter-processor communications,
decreasing the overall system performance. This
paper presents a Simulink-based multithread code
generation method, which applies Message Aggregation
optimization technique to reduce the number of interprocessor
communications. This technique reduces the
communication overheads in terms of execution time by
reduction on the number of messages exchanged and in
terms of memory size by the reduction on the number of
channels. The paper also presents experiment results for
one multimedia application, showing performance
improvements and memory reduction obtained with
Message Aggregation technique
A framework for argument-based task synchronization with automatic detection of dependencies
[Abstract] Synchronization in parallel applications can be achieved either implicitly or explicitly. Implicit synchronization is typical of programming environments that provide predefined, and often simple, patterns of parallelism such as data-parallel libraries and languages and skeletal operations. Nevertheless, more flexible approaches that allow to express arbitrary task-level parallel computations without a predefined structure request in turn that the user explicitly specifies the synchronization needed among the parallel tasks.
In this paper we present a library-based approach that enables arbitrary patterns of parallelism with minimal effort for the user. Our proposal is the first generic approach to express parallelism we know of that requires neither explicit synchronizations nor a detail of the dependencies of the parallel tasks. Our strategy relies on expressing the parallel tasks as functions that convey their dependencies implicitly by means of their arguments. These function arguments are analyzed by our library, called DepSpawn, when a parallel task is spawned in order to enforce its dependencies. Our experiments indicate that DepSpawn is very competitive, both in terms of performance and programmability, with respect to a widespread high-level approach like OpenMP.Xunta de Galicia; INCITE08PXIB105161PRMinisterio de Ciencia e Innovación; TIN2010-16735Ministerio de Educación de España; AP2009-475