Search CORE

6 research outputs found

Recommended from our members

Automation of Determination of Optimal Intra-Compute Node Parallelism

Author: Brown James C.
Gómez-Iglesias Antonio
Publication venue
Publication date: 01/01/2016
Field of study

Maximizing the productivity of modern multicore and manycore chips requires optimizing parallelism at the compute node level. This is, however, a complex multi-step process. It is an iterative method requiring determining optimal degrees of parallel scalability and optimizing memory access behavior. Further, there are multiple cases to be considered, programs which use only MPI or OpenMP and hybrid (MPI +OpenMP) programs. This paper presents a set of three coordinated workﬂows for determining the optimal parallelism at the program level for MPI programs and at the loop level for hybrid (MPI+OpenMP) cases. The paper also details mostly automated implementations of these workﬂows using the PerfExpert infrastructure. Finally the paper presents case studies demonstrating both the applicability and the effectiveness of optimizing parallelism at the compute node level. The results shown in the paper will provide valuable information to further advance in the full automation of the workﬂows. The software implementing the parallelism scalability optimization is open source and available for download.Texas Advanced Computing Center (TACC)Computer Science

Texas ScholarWorks

An efficient algorithm for pointer-to-array access conversion for compiling and optimizing DSP applications

Author
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2001
Field of study

Crossref

Automatic Parallelization of Affine Loops using Dependence and Cache analysis in a Binary Rewriter

Author: Kotha Aparna
Publication venue
Publication date: 01/01/2013
Field of study

Today, nearly all general-purpose computers are parallel, but nearly all software running on them is serial. Bridging this disconnect by manually rewriting source code in parallel is prohibitively expensive. Automatic parallelization technology is therefore an attractive alternative. We present a method to perform automatic parallelization in a binary rewriter. The input to the binary rewriter is the serial binary executable program and the output is a parallel binary executable. The advantages of parallelization in a binary rewriter versus a compiler include (i) compatibility with all compilers and languages; (ii) high economic feasibility from avoiding repeated compiler implementation; (iii) applicability to legacy binaries; and (iv) applicability to assembly-language programs. Adapting existing parallelizing compiler methods that work on source code to work on binary programs instead is a significant challenge. This is primarily because symbolic and array index information used in existing compiler parallelizers is not available in a binary. We show how to adapt existing parallelization methods to achieve equivalent parallelization from a binary without such information. We have also designed a affine cache reuse model that works inside a binary rewriter building on the parallelization techniques. It quantifies cache reuse in terms of the number of cache lines that will be required when a loop dimension is considered for the innermost position in a loop nest. This cache metric can be used to reason about affine code that results when affine code is transformed using affine transformations. Hence, it can be used to evaluate candidate transformation sequences to improve run-time directly from a binary. Results using our x86 binary rewriter called SecondWrite on a suite of dense- matrix regular programs from Polybench suite of benchmarks shows an geomean speedup of 6.81X from binary and 8.9X from source with 8 threads compared to the input serial binary on a x86 Xeon E5530 machine; and 8.31X from binary and 9.86X from source with 24 threads compared to the input serial binary on a x86 E7450 machine. Such regular loops are an important component of scientific and multi- media workloads, and are even present to a limited extent in otherwise non-regular programs. Further in this thesis we present a novel algorithm that enhances the past techniques significantly for loops with unknown loop bounds by guessing the loop bounds using only the memory expressions present in a loop. It then inserts run-time checks to see if these guesses were indeed correct and if correct executes the parallel version of the loop, else the serial version executes. These techniques are applied to the large affine benchmarks in SPEC2006 and OMP2001 and unlike previous methods the speedups from binary are as good as from source. We also present results on the number of loops parallelized directly from a binary with and without this algorithm. Among the 8 affine benchmarks among these suites, the best existing binary parallelization method achieves an geo-mean speedup of 1.33X, whereas our method achieves a speedup of 2.96X. This is close to the speedup from source code of 2.8X

CiteSeerX

Digital Repository at the University of Maryland

Aplicaciones del cómputo científico: mantenimiento del software heredado

Author: Méndez Mariano
Publication venue: 'Universidad Nacional de La Plata'
Publication date: 21/04/2016
Field of study

Las aplicaciones de cómputo científico pueden considerarse como el tipo de software más longevo que haya sido creado. Hoy en día se pueden encontrar grandes referentes de este tipo de software diseminado en varias disciplinas de la ciencia, como Física, Química, Matemáticas, Biología, Economía, etc. Uno de los ejemplos más vigentes en la actualidad son los llamados Modelos Climáticos Globales o Global Climate Models (en inglés) utilizados para el estudio climático. Los científicos han desarrollado software desde la aparición de los primeros lenguajes de programación ya hace mas de 76 años. Fortran es el primer lenguaje de alto nivel creado, el primer lenguaje en tener su propio estándar y el mas utilizado en HPC junto con C. En la tesis se introduce una nueva metodología de desarrollo de software llamada Change Driven Development (CDD), creada inicialmente para el proceso de mantenimiento, basada tres aspectos: aspectos esenciales del software (el cambio), herramientas de desarrollo altamente integradas y transformaciones de código fuente(restructuring y refactoring). En la misma se describe detalladamente la metodología y se valida mediante 4 casos de estudios de diversa índole.Facultad de Informátic

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Servicio de Difusión de la Creación Intelectual

Restructuring Fortran Programs for Cedar

Author: David Padua
Greg Jaxon
Jay Hoeflinger
Rudolf Eigenmann
Zhiyuan Li
Publication venue
Publication date: 01/01/1993
Field of study

This paper reports on the status of the Fortran translator for the Cedar computer at the end of March, 1991. A brief description of the Cedar Fortran language is followed by a discussion of the fortran77 to Cedar Fortran parallelizer that describes the techniques currently being implemented. A collection of experiments illustrate the effectiveness of the current implementation, and point toward new approaches to be incorporated into the system in the near future. 1 Introduction The University of Illinois has been a pioneer in the development of program translation techniques for vector and parallel computers since the late 1960s, when Illiac IV was developed. It is therefore natural that automatic parallelization has become one of the major concerns of the Cedar project, the latest machine building effort of the University of Illinois. The Cedar machine is a hierarchical multi-processor. It supports several levels of parallelism and provides data storage at the processor, cluster, an..

CiteSeerX

Restructuring Fortran Programs for Cedar

Author: David Padua
Greg Jaxon
Jay Hoeflinger
Rudolf Eigenmann
Zhiyuan Li
Publication venue
Publication date
Field of study

CiteSeerX