29 research outputs found

    Revisiting Matrix Product on Master-Worker Platforms

    Get PDF
    This paper is aimed at designing efficient parallel matrix-product algorithms for heterogeneous master-worker platforms. While matrix-product is well-understood for homogeneous 2D-arrays of processors (e.g., Cannon algorithm and ScaLAPACK outer product algorithm), there are three key hypotheses that render our work original and innovative: - Centralized data. We assume that all matrix files originate from, and must be returned to, the master. - Heterogeneous star-shaped platforms. We target fully heterogeneous platforms, where computational resources have different computing powers. - Limited memory. Because we investigate the parallelization of large problems, we cannot assume that full matrix panels can be stored in the worker memories and re-used for subsequent updates (as in ScaLAPACK). We have devised efficient algorithms for resource selection (deciding which workers to enroll) and communication ordering (both for input and result messages), and we report a set of numerical experiments on various platforms at Ecole Normale Superieure de Lyon and the University of Tennessee. However, we point out that in this first version of the report, experiments are limited to homogeneous platforms

    Analytical Modeling of High Performance Reconfigurable Computers: Prediction and Analysis of System Performance.

    Get PDF
    The use of a network of shared, heterogeneous workstations each harboring a Reconfigurable Computing (RC) system offers high performance users an inexpensive platform for a wide range of computationally demanding problems. However, effectively using the full potential of these systems can be challenging without the knowledge of the system’s performance characteristics. While some performance models exist for shared, heterogeneous workstations, none thus far account for the addition of Reconfigurable Computing systems. This dissertation develops and validates an analytic performance modeling methodology for a class of fork-join algorithms executing on a High Performance Reconfigurable Computing (HPRC) platform. The model includes the effects of the reconfigurable device, application load imbalance, background user load, basic message passing communication, and processor heterogeneity. Three fork-join class of applications, a Boolean Satisfiability Solver, a Matrix-Vector Multiplication algorithm, and an Advanced Encryption Standard algorithm are used to validate the model with homogeneous and simulated heterogeneous workstations. A synthetic load is used to validate the model under various loading conditions including simulating heterogeneity by making some workstations appear slower than others by the use of background loading. The performance modeling methodology proves to be accurate in characterizing the effects of reconfigurable devices, application load imbalance, background user load and heterogeneity for applications running on shared, homogeneous and heterogeneous HPRC resources. The model error in all cases was found to be less than five percent for application runtimes greater than thirty seconds and less than fifteen percent for runtimes less than thirty seconds. The performance modeling methodology enables us to characterize applications running on shared HPRC resources. Cost functions are used to impose system usage policies and the results of vii the modeling methodology are utilized to find the optimal (or near-optimal) set of workstations to use for a given application. The usage policies investigated include determining the computational costs for the workstations and balancing the priority of the background user load with the parallel application. The applications studied fall within the Master-Worker paradigm and are well suited for a grid computing approach. A method for using NetSolve, a grid middleware, with the model and cost functions is introduced whereby users can produce optimal workstation sets and schedules for Master-Worker applications running on shared HPRC resources

    Evaluation of the PlayStation 2 as a cluster computing node

    Get PDF
    Cluster computing is currently a popular, cost-effective solution to the increasing computational demands of many applications in scientific computing and image processing. A cluster computer is comprised of several networked computers known as nodes. Since the goal of cluster computing is to provide a cost-effective means to processing computationally demanding applications, nodes that can be obtained at a low price with minimal performance tradeoff are always attractive. Presently, the most common cluster computers are comprised of networks of workstations constructed from commodity components. Recent trends have shown that computers being developed and deployed for purposes other than traditional personal computers or workstations have presented new candidates for cluster computing nodes. The new computing node candidates being considered may provide a competitive and even less expensive alternative to the cluster computing nodes being used today. Machines such as video game consoles, whose prices are kept extremely low due to intense marketplace competition, are a prime example of such machines. The Sony PlayStation 2, in particular, provides the user with low-level hardware devices that are often found in more expensive machines. This work presents and evaluation of the PlayStation 2 video game console as a cluster computing node for scientific and image processing applications. From this evaluation, a determination is made as to whether the PlayStation 2 is a viable alternative to the cluster computing nodes being used today

    Adaptive heterogeneous parallelism for semi-empirical lattice dynamics in computational materials science.

    Get PDF
    With the variability in performance of the multitude of parallel environments available today, the conceptual overhead created by the need to anticipate runtime information to make design-time decisions has become overwhelming. Performance-critical applications and libraries carry implicit assumptions based on incidental metrics that are not portable to emerging computational platforms or even alternative contemporary architectures. Furthermore, the significance of runtime concerns such as makespan, energy efficiency and fault tolerance depends on the situational context. This thesis presents a case study in the application of both Mattsons prescriptive pattern-oriented approach and the more principled structured parallelism formalism to the computational simulation of inelastic neutron scattering spectra on hybrid CPU/GPU platforms. The original ad hoc implementation as well as new patternbased and structured implementations are evaluated for relative performance and scalability. Two new structural abstractions are introduced to facilitate adaptation by lazy optimisation and runtime feedback. A deferred-choice abstraction represents a unified space of alternative structural program variants, allowing static adaptation through model-specific exhaustive calibration with regards to the extrafunctional concerns of runtime, average instantaneous power and total energy usage. Instrumented queues serve as mechanism for structural composition and provide a representation of extrafunctional state that allows realisation of a market-based decentralised coordination heuristic for competitive resource allocation and the Lyapunov drift algorithm for cooperative scheduling

    Toward Message Passing Failure Management

    Get PDF
    As machine sizes have increased and application runtimes have lengthened, research into fault tolerance has evolved alongside. Moving from result checking, to rollback recovery, and to algorithm based fault tolerance, the type of recovery being performed has changed, but the programming model in which it executes has remained virtually static since the publication of the original Message Passing Interface (MPI) Standard in 1992. Since that time, applications have used a message passing paradigm to communicate between processes, but they could not perform process recovery within an MPI implementation due to limitations of the MPI Standard. This dissertation describes a new protocol using the exiting MPI Standard called Checkpoint-on-Failure to perform limited fault tolerance within the current framework of MPI, and proposes a new platform titled User Level Failure Mitigation (ULFM) to build more complete and complex fault tolerance solutions with a true fault tolerant MPI implementation. We will demonstrate the overhead involved in using these fault tolerant solutions and give examples of applications and libraries which construct other fault tolerance mechanisms based on the constructs provided in ULFM

    REQUIREMENT- AWARE STRATEGIES FOR SCHEDULING MULTIPLE DIVISIBLE LOADS IN CLUSTER ENVIRONMENTS

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Library-based solutions for algorithms with complex patterns of parallelism

    Get PDF
    [Resumen] Con la llegada de los procesadores multinúucleo y la caída del crecimiento de la capacidad de procesamiento por núcleo en cada nueva generación, la paralelización es cada vez más crítica para mejorar el rendimiento de todo tipo de aplicaciones. Por otra parte, si bien hay un buen conocimiento y soporte de los patrones de paralelismo más sencillos, esto no es así para los patrones complejos e irregulares, cuya paralelización requiere o bien herramientas de bajo nivel que afectan negativamente a la productividad, o bien soluciones transaccionales con requisitos específicos de hardware o que implican grandes sobrecostes. El aumento del número de aplicaciones que exhiben estos patrones complejos hace que este sea un problema con importancia creciente. Esta tesis trata de mejorar la comprensión y el soporte de tres tipos de patrones complejos, mediante la identificación de abstracciones y semánticas claras que ayuden su paralelización en entornos de memoria compartida. El enfoque elegido fue la creación de librerías, ya que facilitan la reutilización de código, reducen los requisitos del compilador, y tienen una curva de aprendizaje relativamente corta. El lenguaje empleado para la implementación es C++, pues proporciona un buen rendimiento y capacidad para expresar las abstracciones necesarias. Los ejemplos y evaluaciones en esta tesis muestran que nuestras propuestas permiten expresar de manera elegante las aplicaciones que presentan estos patrones, mejorando su programabilidad al tiempo que proporcionan un rendimiento similar o superior al de otras soluciones existentes.[Abstract] With the arrival of multi-core processors and the reduction in the growth rate of the processing power per core in each new generation, parallelization is becoming increasingly critical to improve the performance of every kind of application. Also, while simple patterns of parallelism are well understood and supported, this is not the case for complex and irregular patterns, whose parallelization requires either low level tools that hurt programmers' productivity or transactional based approaches that need specific hardware or imply potentially large overheads. This is becoming an increasingly important problem as the number of applications that exhibit these latter patterns is steadily growing. This thesis tries to better understand and support three kinds of complex patterns through the identification of abstractions and clear semantics that help bring structure to them and the development of libraries based on our observations that facilitate their parallelization in shared memory environments. The library approach was chosen given its advantages for code reuse, reduced compiler requirements, and relatively short learning curve. The implementation language selected being C++ due to its good performance and capability to express abstractions. The examples and evaluations in this thesis show that our proposals allow to elegantly express the applications that present these patterns, improving their programmability while providing similar or even better performance than existing approaches.[Resumo] Coa chegada dos procesadores multinúcleo e a caída do crecemento da capacidade de procesamento por núcleo en cada nova xeración, a paralelización é cada vez máis crítica para mellorar o rendemento de todo tipo de aplicacións. Ademais, hai un bo co~necemento e soporte dos patróns de paralelismo máis sinxelos, mais non sendo así para patróns complexos e irregulares, cuxa paralelización require ben ferramentas de baixo nivel que afectan negativamente á produtividade, ben solucións transaccionais con requisitos específicos de hardware ou que implican grandes sobrecostes. O aumento do número de aplicacións que exhiben estes patróns complexos fai que este sexa un problema con importancia crecente. Esta tese trata de mellorar a comprensión e o soporte de tres tipos de patróns complexos mediante a identificación de abstraccións e semánticas claras que axuden a súa paralelización en entornos de memoria compartida. O enfoque elixido foi a creación de librarías, xa que facilitan a reutilización de código, reducen os requisitos do compilador, e teñen unha curva de aprendizaxe relativamente curta. A linguaxe empregada para a implementación é C++, pois proporciona un bo rendemento e capacidade para expresar as abstraccións necesarias. Os exemplos e avaliacións nesta tese mostran que as nosas propostas permiten expresar de xeito elegante as aplicacións que presentan estes patróns, mellorando a súa programabilidade ao tempo que proporcionan un rendemento similar ou superior ao de outras solucións existentes

    FPGA Based Acceleration of Matrix Decomposition and Clustering Algorithm Using High Level Synthesis

    Get PDF
    FPGAs have shown great promise for accelerating computationally intensive algorithms. However, FPGA-based accelerator design is tedious and time consuming if we rely on traditional HDL based design method. Recent introduction of Altera SDK for OpenCL (AOCL) high level synthesis tool enables developers to utilize FPGA’s potential without long development time and extensive hardware knowledge. AOCL is used in this thesis to accelerate computationally intensive algorithms in the field of machine learning and scientific computing. The algorithms studied are k-means clustering, k-nearest neighbour search, N-body simulation and LU decomposition. The performance and power consumption of the algorithms synthesized using AOCL for FPGA are evaluated against state of the art CPU and GPU implementations. The k-means clustering and k-nearest neighbor kernels designed for FPGA significantly out-performed optimized CPU implementations while achieving similar or better power efficiency than that of GPU
    corecore