59,308 research outputs found
Analysis of Memory Latencies in Multi-Processor Systems
Predicting timing behavior is key to efficient embedded
real-time system design and verification. Current approaches
to determine end-to-end latencies in parallel heterogeneous
architectures focus on performance analysis either
on task or system level. Especially memory accesses,
basic operations of embedded application, cannot be accurately
captured on a single level alone: While task level
methods simplify system behavior, system level methods
simplify task behavior. Both perspectives lead to overly pessimistic
estimations.
To tackle these complex interactions we integrate task
and system level analysis. Each analysis level is provided
with the necessary data to allow precise computations,
while adequate abstraction prevents high time complexity
Application Development using Compositional Performance Analysis
A parallel programming archetype [Cha94, CMMM95] is an abstraction that captures the common features of a class of problems with a similar computational structure and combines them with a parallelization strategy to produce a pattern of dataflow and communication. Such abstractions are useful in application development, both as a conceptual framework and as a basis for tools and techniques. The efficiency of a parallel program can depend a great deal on how its data and tasks are decomposed and distributed. This thesis describes a simple performance evaluation methodology that includes an analytic model for predicting the performance of parallel and distributed computations developed for multicomputer machines and networked personal computers. This analytic model can be supplemented by a simulation infrastructure for application writers to use when developing parallel programs using archetypes. These performance evaluation tools were developed with the following restricted goal in mind: We require accuracy of the analytic model and simulation infrastructure only to the extent that they suggest directions for the programmer to make the appropriate optimizations. This restricted goal sacrifices some accuracy, but makes the tools simpler and easier to use. A programmer can use these tools to design programs with decomposition and distribution specialized to a given machine configuration. By instantiating a few architecture-based parameters, the model can be employed in the performance analysis of data-parallel applications, guiding process generation, communication, and mapping decisions. The model is language-independent and machine-independent; it can be applied to help programmers make decisions about performance-affecting parameters as programs are ported across architectures and languages. Furthermore, the model incorporates both platform-specific and application-specific aspects, and it allows programmers to experiment with tradeoffs better than either strictly simulation-based or purely theoretical models. In addition, the model was designed to be simple. In summary, this thesis outlines a simple method for benchmarking a parallel communication library and for using the results to model the performance of applications developed with that communication library. We use compositional performance analysis - decomposing a parallel program into its modular parts and analyzing their respective performances - to gain perspective on the performance of the whole program. This model is useful for predicting parallel program execution times for different types of program archetypes (e.g., mesh and mesh-spectral), using communication libraries built with different message-passing schemes (e.g., Fortran M and Fortran with MPI) running on different architectures (e.g., IBM SP2 and a network of Pentium personal computers)
Towards an Achievable Performance for the Loop Nests
Numerous code optimization techniques, including loop nest optimizations,
have been developed over the last four decades. Loop optimization techniques
transform loop nests to improve the performance of the code on a target
architecture, including exposing parallelism. Finding and evaluating an
optimal, semantic-preserving sequence of transformations is a complex problem.
The sequence is guided using heuristics and/or analytical models and there is
no way of knowing how close it gets to optimal performance or if there is any
headroom for improvement. This paper makes two contributions. First, it uses a
comparative analysis of loop optimizations/transformations across multiple
compilers to determine how much headroom may exist for each compiler. And
second, it presents an approach to characterize the loop nests based on their
hardware performance counter values and a Machine Learning approach that
predicts which compiler will generate the fastest code for a loop nest. The
prediction is made for both auto-vectorized, serial compilation and for
auto-parallelization. The results show that the headroom for state-of-the-art
compilers ranges from 1.10x to 1.42x for the serial code and from 1.30x to
1.71x for the auto-parallelized code. These results are based on the Machine
Learning predictions.Comment: Accepted at the 31st International Workshop on Languages and
Compilers for Parallel Computing (LCPC 2018
Recommended from our members
Solving large scale linear programming
The interior point method (IPM) is now well established as a competitive technique for solving very large scale linear programming problems. The leading variant of the interior point method is the primal dual - predictor corrector algorithm due to Mehrotra. The main computational steps of this algorithm are the repeated calculation and solution of a large sparse positive definite system of equations.
We describe an implementation of the predictor corrector IPM algorithm on MasPar, a massively parallel SIMD computer. At the heart of the implemen-tation is a parallel Cholesky factorization algorithm for sparse matrices. Our implementation uses a new scheme of mapping the matrix onto the processor grid of the MasPar, that results in a more efficient Cholesky factorization than previously suggested schemes.
The IPM implementation uses the parallel unit of MasPar to speed up the factorization and other computationally intensive parts of the IPM. An impor-tant part of this implementation is the judicious division of data and computation between the front-end computer, that runs the main IPM algorithm, and the par-allel unit. Performanc
Futility Analysis in the Cross-Validation of Machine Learning Models
Many machine learning models have important structural tuning parameters that
cannot be directly estimated from the data. The common tactic for setting these
parameters is to use resampling methods, such as cross--validation or the
bootstrap, to evaluate a candidate set of values and choose the best based on
some pre--defined criterion. Unfortunately, this process can be time consuming.
However, the model tuning process can be streamlined by adaptively resampling
candidate values so that settings that are clearly sub-optimal can be
discarded. The notion of futility analysis is introduced in this context. An
example is shown that illustrates how adaptive resampling can be used to reduce
training time. Simulation studies are used to understand how the potential
speed--up is affected by parallel processing techniques.Comment: 22 pages, 5 figure
HAEC-SIM: A Simulation Framework for Highly Adaptive Energy-Efficient Computing Platforms
This work presents a new trace-based parallel discrete event simulation framework designed for predicting the behavior of a novel computing platform running energy-aware parallel applications. Discrete event traces capture the runtime be- havior of parallel applications on existing systems and form the basis for the simulation. The simulation framework pro- cesses the events of the input trace by applying simulation models that modify event properties. Thus, the output are again event traces that describe the predicted application behavior on the simulated target platform. Both input and simulated traces can be visualized and analyzed with estab- lished tools. The modular design of the framework enables the simulation of different aspects such as temporal perfor- mance and energy efficiency by applying distinct simulation models e.g.: (i) A performance model for communication that allows to evaluate the target communication topology and link properties. (ii) An energy model for computations that is based on measurements of current hardware. We showcase the potential of this simulation by simulating the execution of benchmark applications to explore design al- ternatives of highly adaptive and energy-efficient computing applications and platforms
- …