19 research outputs found
Computational Methods in Science and Engineering : Proceedings of the Workshop SimLabs@KIT, November 29 - 30, 2010, Karlsruhe, Germany
In this proceedings volume we provide a compilation of article contributions equally covering applications from different research fields and ranging from capacity up to capability computing. Besides classical computing aspects such as parallelization, the focus of these proceedings is on multi-scale approaches and methods for tackling algorithm and data complexity. Also practical aspects regarding the usage of the HPC infrastructure and available tools and software at the SCC are presented
Methodology for malleable applications on distributed memory systems
A la portada logo BSC(English) The dominant programming approach for scientific and industrial computing on clusters is MPI+X. While there are a variety of approaches within the node, denoted by the ``X'', Message Passing interface (MPI) is the standard for programming multiple nodes with distributed memory. This thesis argues that the OmpSs-2 tasking model can be extended beyond the node to naturally support distributed memory, with three benefits:
First, at small to medium scale the tasking model is a simpler and more productive alternative to MPI. It eliminates the need to distribute the data explicitly and convert all dependencies into explicit message passing. It also avoids the complexity of hybrid programming using MPI+X.
Second, the ability to offload parts of the computation among the nodes enables the runtime to automatically balance the loads in a full-scale MPI+X program. This approach does not require a cost model, and it is able to transparently balance the computational loads across the whole program, on all its nodes.
Third, because the runtime handles all low-level aspects of data distribution and communication, it can change the resource allocation dynamically, in a way that is transparent to the application.
This thesis describes the design, development and evaluation of OmpSs-2@Cluster, a programming model and runtime system that extends the OmpSs-2 model to allow a virtually unmodified OmpSs-2 program to run across multiple distributed memory nodes. For well-balanced applications it provides similar performance to MPI+OpenMP on up to 16 nodes, and it improves performance by up to 2x for irregular and unbalanced applications like Cholesky factorization.
This work also extended OmpSs-2@Cluster for interoperability with MPI and Barcelona Supercomputing Center (BSC)'s state-of-the-art Dynamic Load Balance (DLB) library in order to dynamically balance MPI+OmpSs-2 applications by transparently offloading tasks among nodes. This approach reduces the execution time of a microscale solid mechanics application by 46% on 64 nodes and on a synthetic benchmark, it is within 10% of perfect load balancing on up to 8 nodes.
Finally, the runtime was extended to transparently support malleability for pure OmpSs-2@Cluster programs and interoperate with the Resources Management System (RMS). The only change to the application is to explicitly call an API function to control the addition or removal of nodes. In this regard we additionally provide the runtime with the ability to semi-transparently save and recover part of the application status to perform checkpoint and restart. Such a feature hides the complexity of
data redistribution and parallel IO from the user while allowing the program to recover and continue previous executions. Our work is a starting point for future research on fault tolerance.
In summary, OmpSs-2@Cluster expands the OmpSs-2 programming model to encompass distributed memory clusters. It allows an existing OmpSs-2 program, with few if any changes, to run across multiple nodes. OmpSs-2@Cluster supports transparent multi-node dynamic load balancing for MPI+OmpSs-2 programs, and enables semi-transparent malleability for OmpSs-2@Cluster programs. The runtime system has a high level of stability and performance, and it opens several avenues for future work.(Espa帽ol) El modelo de programaci贸n dominante para clusters tanto en ciencia como industria es actualmente MPI+X. A pesar de que hay alguna variedad de alternativas para programar dentro de un nodo (indicado por la "X"), el estandar para programar m煤ltiples nodos con memoria distribuida sigue siendo Message Passing Interface (MPI). Esta tesis propone la extensi贸n del modelo de programaci贸n basado en tareas OmpSs-2 para su funcionamiento en sistemas de memoria distribuida, destacando 3 beneficios principales: En primer lugar; a peque帽a y mediana escala, un modelo basado en tareas es m谩s simple y productivo que MPI y elimina la necesidad de distribuir los datos expl铆citamente y convertir todas las dependencias en mensajes. Adem谩s, evita la complejidad de la programacion h铆brida MPI+X. En segundo lugar; la capacidad de enviar partes del c谩lculo entre los nodos permite a la librer铆a balancear la carga de trabajo en programas MPI+X a gran escala. Este enfoque no necesita un modelo de coste y permite equilibrar cargas transversalmente en todo el programa y todos los nodos. En tercer lugar; teniendo en cuenta que es la librer铆a quien maneja todos los aspectos relacionados con distribuci贸n y transferencia de datos, es posible la modificaci贸n din谩mica y transparente de los recursos que utiliza la aplicaci贸n. Esta tesis describe el dise帽o, desarrollo y evaluaci贸n de OmpSs-2@Cluster; un modelo de programaci贸n y librer铆a que extiende OmpSs-2 permitiendo la ejecuci贸n de programas OmpSs-2 existentes en m煤ltiples nodos sin pr谩cticamente necesidad de modificarlos. Para aplicaciones balanceadas, este modelo proporciona un rendimiento similar a MPI+OpenMP hasta 16 nodos y duplica el rendimiento en aplicaciones irregulares o desbalanceadas como la factorizaci贸n de Cholesky. Este trabajo incluye la extensi贸n de OmpSs-2@Cluster para interactuar con MPI y la librer铆a de balanceo de carga Dynamic Load Balancing (DLB) desarrollada en el Barcelona Supercomputing Center (BSC). De este modo es posible equilibrar aplicaciones MPI+OmpSs-2 mediante la transferencia transparente de tareas entre nodos. Este enfoque reduce el tiempo de ejecuci贸n de una aplicaci贸n de mec谩nica de s贸lidos a micro-escala en un 46% en 64 nodos; en algunos experimentos hasta 8 nodos se pudo equilibrar perfectamente la carga con una diferencia inferior al 10% del equilibrio perfecto. Finalmente, se implement贸 otra extensi贸n de la librer铆a para realizar operaciones de maleabilidad en programas OmpSs-2@Cluster e interactuar con el Sistema de Manejo de Recursos (RMS). El 煤nico cambio requerido en la aplicaci贸n es la llamada explicita a una funci贸n de la interfaz que controla la adici贸n o eliminaci贸n de nodos. Adem谩s, se agreg贸 la funcionalidad de guardar y recuperar parte del estado de la aplicaci贸n de forma semitransparente con el objetivo de realizar operaciones de salva-reinicio. Dicha funcionalidad oculta al usuario la complejidad de la redistribuci贸n de datos y las operaciones de lectura-escritura en paralelo, mientras permite al programa recuperar y continuar ejecuciones previas. Este es un punto de partida para futuras investigaciones en tolerancia a fallos. En resumen, OmpSs-2@Cluster ampl铆a el modelo de programaci贸n de OmpSs-2 para abarcar sistemas de memoria distribuida. El modelo permite la ejecuci贸n de programas OmpSs-2 en m煤ltiples nodos pr谩cticamente sin necesidad de modificarlos. OmpSs-2@Cluster permite adem谩s el balanceo din谩mico de carga en aplicaciones h铆bridas MPI+OmpSs-2 ejecutadas en varios nodos y es capaz de realizar maleabilidad semi-transparente en programas OmpSs-2@Cluster puros. La librer铆a tiene un niveles de rendimiento y estabilidad altos y abre varios caminos para trabajos futuro.Arquitectura de computador
An FPGA implementation of an investigative many-core processor, Fynbos : in support of a Fortran autoparallelising software pipeline
Includes bibliographical references.In light of the power, memory, ILP, and utilisation walls facing the computing industry, this work examines the hypothetical many-core approach to finding greater compute performance and efficiency. In order to achieve greater efficiency in an environment in which Moore鈥檚 law continues but TDP has been capped, a means of deriving performance from dark and dim silicon is needed. The many-core hypothesis is one approach to exploiting these available transistors efficiently. As understood in this work, it involves trading in hardware control complexity for hundreds to thousands of parallel simple processing elements, and operating at a clock speed sufficiently low as to allow the efficiency gains of near threshold voltage operation. Performance is there- fore dependant on exploiting a new degree of fine-grained parallelism such as is currently only found in GPGPUs, but in a manner that is not as restrictive in application domain range. While removing the complex control hardware of traditional CPUs provides space for more arithmetic hardware, a basic level of control is still required. For a number of reasons this work chooses to replace this control largely with static scheduling. This pushes the burden of control primarily to the software and specifically the compiler, rather not to the programmer or to an application specific means of control simplification. An existing legacy tool chain capable of autoparallelising sequential Fortran code to the degree of parallelism necessary for many-core exists. This work implements a many-core architecture to match it. Prototyping the design on an FPGA, it is possible to examine the real world performance of the compiler-architecture system to a greater degree than simulation only would allow. Comparing theoretical peak performance and real performance in a case study application, the system is found to be more efficient than any other reviewed, but to also significantly under perform relative to current competing architectures. This failing is apportioned to taking the need for simple hardware too far, and an inability to implement static scheduling mitigating tactics due to lack of support for such in the compiler
Recommended from our members
Laboratory Directed Research and Development Program FY 2004 Annual Report
The Oak Ridge National Laboratory (ORNL) Laboratory Directed Research and Development (LDRD) Program reports its status to the U.S. Department of Energy (DOE) in March of each year. The program operates under the authority of DOE Order 413.2A, 'Laboratory Directed Research and Development' (January 8, 2001), which establishes DOE's requirements for the program while providing the Laboratory Director broad flexibility for program implementation. LDRD funds are obtained through a charge to all Laboratory programs. This report describes all ORNL LDRD research activities supported during FY 2004 and includes final reports for completed projects and shorter progress reports for projects that were active, but not completed, during this period. The FY 2004 ORNL LDRD Self-Assessment (ORNL/PPA-2005/2) provides financial data about the FY 2004 projects and an internal evaluation of the program's management process. ORNL is a DOE multiprogram science, technology, and energy laboratory with distinctive capabilities in materials science and engineering, neutron science and technology, energy production and end-use technologies, biological and environmental science, and scientific computing. With these capabilities ORNL conducts basic and applied research and development (R&D) to support DOE's overarching national security mission, which encompasses science, energy resources, environmental quality, and national nuclear security. As a national resource, the Laboratory also applies its capabilities and skills to the specific needs of other federal agencies and customers through the DOE Work For Others (WFO) program. Information about the Laboratory and its programs is available on the Internet at <http://www.ornl.gov/>. LDRD is a relatively small but vital DOE program that allows ORNL, as well as other multiprogram DOE laboratories, to select a limited number of R&D projects for the purpose of: (1) maintaining the scientific and technical vitality of the Laboratory; (2) enhancing the Laboratory's ability to address future DOE missions; (3) fostering creativity and stimulating exploration of forefront science and technology; (4) serving as a proving ground for new research; and (5) supporting high-risk, potentially high-value R&D. Through LDRD the Laboratory is able to improve its distinctive capabilities and enhance its ability to conduct cutting-edge R&D for its DOE and WFO sponsors. To meet the LDRD objectives and fulfill the particular needs of the Laboratory, ORNL has established a program with two components: the Director's R&D Fund and the Seed Money Fund. As outlined in Table 1, these two funds are complementary. The Director's R&D Fund develops new capabilities in support of the Laboratory initiatives, while the Seed Money Fund is open to all innovative ideas that have the potential for enhancing the Laboratory's core scientific and technical competencies. Provision for multiple routes of access to ORNL LDRD funds maximizes the likelihood that novel and seminal ideas with scientific and technological merit will be recognized and supported
Recommended from our members
Laboratory directed research and development. Annual report, fiscal year 1995
This document is a compilation of the several research and development programs having been performed at the Pacific Northwest National Laboratory for the fiscal year 1995