19 research outputs found

    Computational Methods in Science and Engineering : Proceedings of the Workshop SimLabs@KIT, November 29 - 30, 2010, Karlsruhe, Germany

    Get PDF
    In this proceedings volume we provide a compilation of article contributions equally covering applications from different research fields and ranging from capacity up to capability computing. Besides classical computing aspects such as parallelization, the focus of these proceedings is on multi-scale approaches and methods for tackling algorithm and data complexity. Also practical aspects regarding the usage of the HPC infrastructure and available tools and software at the SCC are presented

    Methodology for malleable applications on distributed memory systems

    Get PDF
    A la portada logo BSC(English) The dominant programming approach for scientific and industrial computing on clusters is MPI+X. While there are a variety of approaches within the node, denoted by the ``X'', Message Passing interface (MPI) is the standard for programming multiple nodes with distributed memory. This thesis argues that the OmpSs-2 tasking model can be extended beyond the node to naturally support distributed memory, with three benefits: First, at small to medium scale the tasking model is a simpler and more productive alternative to MPI. It eliminates the need to distribute the data explicitly and convert all dependencies into explicit message passing. It also avoids the complexity of hybrid programming using MPI+X. Second, the ability to offload parts of the computation among the nodes enables the runtime to automatically balance the loads in a full-scale MPI+X program. This approach does not require a cost model, and it is able to transparently balance the computational loads across the whole program, on all its nodes. Third, because the runtime handles all low-level aspects of data distribution and communication, it can change the resource allocation dynamically, in a way that is transparent to the application. This thesis describes the design, development and evaluation of OmpSs-2@Cluster, a programming model and runtime system that extends the OmpSs-2 model to allow a virtually unmodified OmpSs-2 program to run across multiple distributed memory nodes. For well-balanced applications it provides similar performance to MPI+OpenMP on up to 16 nodes, and it improves performance by up to 2x for irregular and unbalanced applications like Cholesky factorization. This work also extended OmpSs-2@Cluster for interoperability with MPI and Barcelona Supercomputing Center (BSC)'s state-of-the-art Dynamic Load Balance (DLB) library in order to dynamically balance MPI+OmpSs-2 applications by transparently offloading tasks among nodes. This approach reduces the execution time of a microscale solid mechanics application by 46% on 64 nodes and on a synthetic benchmark, it is within 10% of perfect load balancing on up to 8 nodes. Finally, the runtime was extended to transparently support malleability for pure OmpSs-2@Cluster programs and interoperate with the Resources Management System (RMS). The only change to the application is to explicitly call an API function to control the addition or removal of nodes. In this regard we additionally provide the runtime with the ability to semi-transparently save and recover part of the application status to perform checkpoint and restart. Such a feature hides the complexity of data redistribution and parallel IO from the user while allowing the program to recover and continue previous executions. Our work is a starting point for future research on fault tolerance. In summary, OmpSs-2@Cluster expands the OmpSs-2 programming model to encompass distributed memory clusters. It allows an existing OmpSs-2 program, with few if any changes, to run across multiple nodes. OmpSs-2@Cluster supports transparent multi-node dynamic load balancing for MPI+OmpSs-2 programs, and enables semi-transparent malleability for OmpSs-2@Cluster programs. The runtime system has a high level of stability and performance, and it opens several avenues for future work.(Espa帽ol) El modelo de programaci贸n dominante para clusters tanto en ciencia como industria es actualmente MPI+X. A pesar de que hay alguna variedad de alternativas para programar dentro de un nodo (indicado por la "X"), el estandar para programar m煤ltiples nodos con memoria distribuida sigue siendo Message Passing Interface (MPI). Esta tesis propone la extensi贸n del modelo de programaci贸n basado en tareas OmpSs-2 para su funcionamiento en sistemas de memoria distribuida, destacando 3 beneficios principales: En primer lugar; a peque帽a y mediana escala, un modelo basado en tareas es m谩s simple y productivo que MPI y elimina la necesidad de distribuir los datos expl铆citamente y convertir todas las dependencias en mensajes. Adem谩s, evita la complejidad de la programacion h铆brida MPI+X. En segundo lugar; la capacidad de enviar partes del c谩lculo entre los nodos permite a la librer铆a balancear la carga de trabajo en programas MPI+X a gran escala. Este enfoque no necesita un modelo de coste y permite equilibrar cargas transversalmente en todo el programa y todos los nodos. En tercer lugar; teniendo en cuenta que es la librer铆a quien maneja todos los aspectos relacionados con distribuci贸n y transferencia de datos, es posible la modificaci贸n din谩mica y transparente de los recursos que utiliza la aplicaci贸n. Esta tesis describe el dise帽o, desarrollo y evaluaci贸n de OmpSs-2@Cluster; un modelo de programaci贸n y librer铆a que extiende OmpSs-2 permitiendo la ejecuci贸n de programas OmpSs-2 existentes en m煤ltiples nodos sin pr谩cticamente necesidad de modificarlos. Para aplicaciones balanceadas, este modelo proporciona un rendimiento similar a MPI+OpenMP hasta 16 nodos y duplica el rendimiento en aplicaciones irregulares o desbalanceadas como la factorizaci贸n de Cholesky. Este trabajo incluye la extensi贸n de OmpSs-2@Cluster para interactuar con MPI y la librer铆a de balanceo de carga Dynamic Load Balancing (DLB) desarrollada en el Barcelona Supercomputing Center (BSC). De este modo es posible equilibrar aplicaciones MPI+OmpSs-2 mediante la transferencia transparente de tareas entre nodos. Este enfoque reduce el tiempo de ejecuci贸n de una aplicaci贸n de mec谩nica de s贸lidos a micro-escala en un 46% en 64 nodos; en algunos experimentos hasta 8 nodos se pudo equilibrar perfectamente la carga con una diferencia inferior al 10% del equilibrio perfecto. Finalmente, se implement贸 otra extensi贸n de la librer铆a para realizar operaciones de maleabilidad en programas OmpSs-2@Cluster e interactuar con el Sistema de Manejo de Recursos (RMS). El 煤nico cambio requerido en la aplicaci贸n es la llamada explicita a una funci贸n de la interfaz que controla la adici贸n o eliminaci贸n de nodos. Adem谩s, se agreg贸 la funcionalidad de guardar y recuperar parte del estado de la aplicaci贸n de forma semitransparente con el objetivo de realizar operaciones de salva-reinicio. Dicha funcionalidad oculta al usuario la complejidad de la redistribuci贸n de datos y las operaciones de lectura-escritura en paralelo, mientras permite al programa recuperar y continuar ejecuciones previas. Este es un punto de partida para futuras investigaciones en tolerancia a fallos. En resumen, OmpSs-2@Cluster ampl铆a el modelo de programaci贸n de OmpSs-2 para abarcar sistemas de memoria distribuida. El modelo permite la ejecuci贸n de programas OmpSs-2 en m煤ltiples nodos pr谩cticamente sin necesidad de modificarlos. OmpSs-2@Cluster permite adem谩s el balanceo din谩mico de carga en aplicaciones h铆bridas MPI+OmpSs-2 ejecutadas en varios nodos y es capaz de realizar maleabilidad semi-transparente en programas OmpSs-2@Cluster puros. La librer铆a tiene un niveles de rendimiento y estabilidad altos y abre varios caminos para trabajos futuro.Arquitectura de computador

    An FPGA implementation of an investigative many-core processor, Fynbos : in support of a Fortran autoparallelising software pipeline

    Get PDF
    Includes bibliographical references.In light of the power, memory, ILP, and utilisation walls facing the computing industry, this work examines the hypothetical many-core approach to finding greater compute performance and efficiency. In order to achieve greater efficiency in an environment in which Moore鈥檚 law continues but TDP has been capped, a means of deriving performance from dark and dim silicon is needed. The many-core hypothesis is one approach to exploiting these available transistors efficiently. As understood in this work, it involves trading in hardware control complexity for hundreds to thousands of parallel simple processing elements, and operating at a clock speed sufficiently low as to allow the efficiency gains of near threshold voltage operation. Performance is there- fore dependant on exploiting a new degree of fine-grained parallelism such as is currently only found in GPGPUs, but in a manner that is not as restrictive in application domain range. While removing the complex control hardware of traditional CPUs provides space for more arithmetic hardware, a basic level of control is still required. For a number of reasons this work chooses to replace this control largely with static scheduling. This pushes the burden of control primarily to the software and specifically the compiler, rather not to the programmer or to an application specific means of control simplification. An existing legacy tool chain capable of autoparallelising sequential Fortran code to the degree of parallelism necessary for many-core exists. This work implements a many-core architecture to match it. Prototyping the design on an FPGA, it is possible to examine the real world performance of the compiler-architecture system to a greater degree than simulation only would allow. Comparing theoretical peak performance and real performance in a case study application, the system is found to be more efficient than any other reviewed, but to also significantly under perform relative to current competing architectures. This failing is apportioned to taking the need for simple hardware too far, and an inability to implement static scheduling mitigating tactics due to lack of support for such in the compiler

    Laboratory Directed Research and Development Annual Report - Fiscal Year 2000

    Full text link

    Laboratory Directed Research and Development FY 1998 Progress Report

    Full text link
    corecore