154 research outputs found

    Dynamic Multigrain Parallelization on the Cell Broadband Engine

    Get PDF
    This paper addresses the problem of orchestrating and scheduling parallelism at multiple levels of granularity on heterogeneous multicore processors. We present policies and mechanisms for adaptive exploitation and scheduling of multiple layers of parallelism on the Cell Broadband Engine. Our policies combine event-driven task scheduling with malleable loop-level parallelism, which is exposed from the runtime system whenever task-level parallelism leaves cores idle. We present a runtime system for scheduling applications with layered parallelism on Cell and investigate its potential with RAxML, a computational biology application which infers large phylogenetic trees, using the Maximum Likelihood (ML) method. Our experiments show that the Cell benefits significantly from dynamic parallelization methods, that selectively exploit the layers of parallelism in the system, in response to workload characteristics. Our runtime environment outperforms naive parallelization and scheduling based on MPI and Linux by up to a factor of 2.6. We are able to execute RAxML on one Cell four times faster than on a dual-processor system with Hyperthreaded Xeon processors, and 5--10\% faster than on a single-processor system with a dual-core, quad-thread IBM Power5 processor

    Resource-aware Programming in a High-level Language - Improved performance with manageable effort on clustered MPSoCs

    Get PDF
    Bis 2001 bedeutete Moores und Dennards Gesetz eine Verdoppelung der Ausführungszeit alle 18 Monate durch verbesserte CPUs. Heute ist Nebenläufigkeit das dominante Mittel zur Beschleunigung von Supercomputern bis zu mobilen Geräten. Allerdings behindern neuere Phänomene wie "Dark Silicon" zunehmend eine weitere Beschleunigung durch Hardware. Um weitere Beschleunigung zu erreichen muss sich auch die Soft­ware mehr ihrer Hardware Resourcen gewahr werden. Verbunden mit diesem Phänomen ist eine immer heterogenere Hardware. Supercomputer integrieren Beschleuniger wie GPUs. Mobile SoCs (bspw. Smartphones) integrieren immer mehr Fähigkeiten. Spezialhardware auszunutzen ist eine bekannte Methode, um den Energieverbrauch zu senken, was ein weiterer wichtiger Aspekt ist, welcher mit der reinen Geschwindigkeit abgewogen werde muss. Zum Beispiel werden Supercomputer auch nach "Performance pro Watt" bewertet. Zur Zeit sind systemnahe low-level Programmierer es gewohnt über Hardware nachzudenken, während der gemeine high-level Programmierer es vorzieht von der Plattform möglichst zu abstrahieren (bspw. Cloud). "High-level" bedeutet nicht, dass Hardware irrelevant ist, sondern dass sie abstrahiert werden kann. Falls Sie eine Java-Anwendung für Android entwickeln, kann der Akku ein wichtiger Aspekt sein. Irgendwann müssen aber auch Hochsprachen resourcengewahr werden, um Geschwindigkeit oder Energieverbrauch zu verbessern. Innerhalb des Transregio "Invasive Computing" habe ich an diesen Problemen gearbeitet. In meiner Dissertation stelle ich ein Framework vor, mit dem man Hochsprachenanwendungen resourcengewahr machen kann, um so die Leistung zu verbessern. Das könnte beispielsweise erhöhte Effizienz oder schnellerer Ausführung für das System als Ganzes bringen. Ein Kerngedanke dabei ist, dass Anwendungen sich nicht selbst optimieren. Stattdessen geben sie alle Informationen an das Betriebssystem. Das Betriebssystem hat eine globale Sicht und trifft Entscheidungen über die Resourcen. Diesen Prozess nennen wir "Invasion". Die Aufgabe der Anwendung ist es, sich an diese Entscheidungen anzupassen, aber nicht selbst welche zu fällen. Die Herausforderung besteht darin eine Sprache zu definieren, mit der Anwendungen Resourcenbedingungen und Leistungsinformationen kommunizieren. So eine Sprache muss ausdrucksstark genug für komplexe Informationen, erweiterbar für neue Resourcentypen, und angenehm für den Programmierer sein. Die zentralen Beiträge dieser Dissertation sind: Ein theoretisches Modell der Resourcen-Verwaltung, um die Essenz des resourcengewahren Frameworks zu beschreiben, die Korrektheit der Entscheidungen des Betriebssystems bezüglich der Bedingungen einer Anwendung zu begründen und zum Beweis meiner Thesen von Effizienz und Beschleunigung in der Theorie. Ein Framework und eine Übersetzungspfad resourcengewahrer Programmierung für die Hochsprache X10. Zur Bewertung des Ansatzes haben wir Anwendungen aus dem High Performance Computing implementiert. Eine Beschleunigung von 5x konnte gemessen werden. Ein Speicherkonsistenzmodell für die X10 Programmiersprache, da dies ein notwendiger Schritt zu einer formalen Semantik ist, die das theoretische Modell und die konkrete Implementierung verknüpft. Zusammengefasst zeige ich, dass resourcengewahre Programmierung in Hoch\-sprachen auf zukünftigen Architekturen mit vielen Kernen mit vertretbarem Aufwand machbar ist und die Leistung verbessert

    Dynamic multigrain parallelization on the cell broadband engine

    Full text link

    Methodology for malleable applications on distributed memory systems

    Get PDF
    A la portada logo BSC(English) The dominant programming approach for scientific and industrial computing on clusters is MPI+X. While there are a variety of approaches within the node, denoted by the ``X'', Message Passing interface (MPI) is the standard for programming multiple nodes with distributed memory. This thesis argues that the OmpSs-2 tasking model can be extended beyond the node to naturally support distributed memory, with three benefits: First, at small to medium scale the tasking model is a simpler and more productive alternative to MPI. It eliminates the need to distribute the data explicitly and convert all dependencies into explicit message passing. It also avoids the complexity of hybrid programming using MPI+X. Second, the ability to offload parts of the computation among the nodes enables the runtime to automatically balance the loads in a full-scale MPI+X program. This approach does not require a cost model, and it is able to transparently balance the computational loads across the whole program, on all its nodes. Third, because the runtime handles all low-level aspects of data distribution and communication, it can change the resource allocation dynamically, in a way that is transparent to the application. This thesis describes the design, development and evaluation of OmpSs-2@Cluster, a programming model and runtime system that extends the OmpSs-2 model to allow a virtually unmodified OmpSs-2 program to run across multiple distributed memory nodes. For well-balanced applications it provides similar performance to MPI+OpenMP on up to 16 nodes, and it improves performance by up to 2x for irregular and unbalanced applications like Cholesky factorization. This work also extended OmpSs-2@Cluster for interoperability with MPI and Barcelona Supercomputing Center (BSC)'s state-of-the-art Dynamic Load Balance (DLB) library in order to dynamically balance MPI+OmpSs-2 applications by transparently offloading tasks among nodes. This approach reduces the execution time of a microscale solid mechanics application by 46% on 64 nodes and on a synthetic benchmark, it is within 10% of perfect load balancing on up to 8 nodes. Finally, the runtime was extended to transparently support malleability for pure OmpSs-2@Cluster programs and interoperate with the Resources Management System (RMS). The only change to the application is to explicitly call an API function to control the addition or removal of nodes. In this regard we additionally provide the runtime with the ability to semi-transparently save and recover part of the application status to perform checkpoint and restart. Such a feature hides the complexity of data redistribution and parallel IO from the user while allowing the program to recover and continue previous executions. Our work is a starting point for future research on fault tolerance. In summary, OmpSs-2@Cluster expands the OmpSs-2 programming model to encompass distributed memory clusters. It allows an existing OmpSs-2 program, with few if any changes, to run across multiple nodes. OmpSs-2@Cluster supports transparent multi-node dynamic load balancing for MPI+OmpSs-2 programs, and enables semi-transparent malleability for OmpSs-2@Cluster programs. The runtime system has a high level of stability and performance, and it opens several avenues for future work.(Español) El modelo de programación dominante para clusters tanto en ciencia como industria es actualmente MPI+X. A pesar de que hay alguna variedad de alternativas para programar dentro de un nodo (indicado por la "X"), el estandar para programar múltiples nodos con memoria distribuida sigue siendo Message Passing Interface (MPI). Esta tesis propone la extensión del modelo de programación basado en tareas OmpSs-2 para su funcionamiento en sistemas de memoria distribuida, destacando 3 beneficios principales: En primer lugar; a pequeña y mediana escala, un modelo basado en tareas es más simple y productivo que MPI y elimina la necesidad de distribuir los datos explícitamente y convertir todas las dependencias en mensajes. Además, evita la complejidad de la programacion híbrida MPI+X. En segundo lugar; la capacidad de enviar partes del cálculo entre los nodos permite a la librería balancear la carga de trabajo en programas MPI+X a gran escala. Este enfoque no necesita un modelo de coste y permite equilibrar cargas transversalmente en todo el programa y todos los nodos. En tercer lugar; teniendo en cuenta que es la librería quien maneja todos los aspectos relacionados con distribución y transferencia de datos, es posible la modificación dinámica y transparente de los recursos que utiliza la aplicación. Esta tesis describe el diseño, desarrollo y evaluación de OmpSs-2@Cluster; un modelo de programación y librería que extiende OmpSs-2 permitiendo la ejecución de programas OmpSs-2 existentes en múltiples nodos sin prácticamente necesidad de modificarlos. Para aplicaciones balanceadas, este modelo proporciona un rendimiento similar a MPI+OpenMP hasta 16 nodos y duplica el rendimiento en aplicaciones irregulares o desbalanceadas como la factorización de Cholesky. Este trabajo incluye la extensión de OmpSs-2@Cluster para interactuar con MPI y la librería de balanceo de carga Dynamic Load Balancing (DLB) desarrollada en el Barcelona Supercomputing Center (BSC). De este modo es posible equilibrar aplicaciones MPI+OmpSs-2 mediante la transferencia transparente de tareas entre nodos. Este enfoque reduce el tiempo de ejecución de una aplicación de mecánica de sólidos a micro-escala en un 46% en 64 nodos; en algunos experimentos hasta 8 nodos se pudo equilibrar perfectamente la carga con una diferencia inferior al 10% del equilibrio perfecto. Finalmente, se implementó otra extensión de la librería para realizar operaciones de maleabilidad en programas OmpSs-2@Cluster e interactuar con el Sistema de Manejo de Recursos (RMS). El único cambio requerido en la aplicación es la llamada explicita a una función de la interfaz que controla la adición o eliminación de nodos. Además, se agregó la funcionalidad de guardar y recuperar parte del estado de la aplicación de forma semitransparente con el objetivo de realizar operaciones de salva-reinicio. Dicha funcionalidad oculta al usuario la complejidad de la redistribución de datos y las operaciones de lectura-escritura en paralelo, mientras permite al programa recuperar y continuar ejecuciones previas. Este es un punto de partida para futuras investigaciones en tolerancia a fallos. En resumen, OmpSs-2@Cluster amplía el modelo de programación de OmpSs-2 para abarcar sistemas de memoria distribuida. El modelo permite la ejecución de programas OmpSs-2 en múltiples nodos prácticamente sin necesidad de modificarlos. OmpSs-2@Cluster permite además el balanceo dinámico de carga en aplicaciones híbridas MPI+OmpSs-2 ejecutadas en varios nodos y es capaz de realizar maleabilidad semi-transparente en programas OmpSs-2@Cluster puros. La librería tiene un niveles de rendimiento y estabilidad altos y abre varios caminos para trabajos futuro.Arquitectura de computador

    Memory Efficient Scheduling for Multicore Real-time Systems

    Get PDF
    Modern real-time systems are becoming increasingly complex and requiring significant computational power to meet their demands. Since the increase in uniprocessor speed has slowed down in the last decade, multicore processors are now the preferred way to supply the increased performance demand of real-time systems. A significant amount of work in the real-time community has focused on scheduling solutions for multicore processors for both sequential and parallel real-time tasks. Even though such solutions are able to provide strict timing guarantees on the overall response time of real-time tasks, they rely on the assumption that the worst-case execution time (WCET) of each individual task is known. However, physical shared resources such as main memory and I/O are heavily employed in multicore processors. These resources are limited and therefore subject to contention. In fact, the execution time of one task when run in parallel with other tasks is significantly larger than the execution time of the same task when run in isolation. In addition, the presence of shared resources increases the timing unpredictability due to the conflicts generated by multiple cores. As a result, the adoption of multicore processors for real-time systems is dependent upon solving such sources of unpredictability. In this dissertation, we investigate memory bus contention. In particular, two main problems are associated with memory contention: (1) unpredictable behavior and (2) hindrance of performance. We show how to mitigate these two problems through scheduling. Scheduling is an attractive tool that can be easily integrated into the system without the need for hardware modifications. We adopt an execution model that exposes memory as a resource to the scheduling algorithm. Thus, the theory of real-time multiprocessor scheduling, that has seen significant advances in recent years, can be utilized to schedule both processor cores and memory. Since the real-time workload on multicore processors can be modeled as sequential or parallel tasks, we also study parallel task scheduling by taking memory time into account

    Market-Based Resourse Management for Many-Core Systems

    Get PDF
    101 σ.Αντικείμενο της διπλωματικής αποτελεί η μελέτη και η ανάπτυξη μιας κλιμακώσιμης και κατανεμημένης πλατφόρμας (framework) διαχείρισης πόρων σε χρόνο εκτέλεσης για συστήματα πολλαπλών πυρήνων. Σε αυτήν την πλατφόρμα η διαχείριση πόρων είναι βασισμένη σε μοντέλα, τα οποία είναι εμπνευσμένα από την οικονομία. Παρουσιάζεται ένας διαχειριστής πόρων, ο οποίος προσφέρει ένα περιβάλλον διαχείρισης πόρων και εφαρμογών καθ ́ όλη τη διάρκεια ζωής τους, στο οποίο η κατανομή και δρομολόγηση των εφαρμογών στους πόρους πραγματοποιείται με αλγόριθμους βασισμένους σε κανόνες αγοράς. Η αποδοτικότητα κάθε μοντέλου αξιολογείται βάσει της πτώσης της αξιοπιστίας των πόρων (μετρική MTTF-Mean Time To Failure).The purpose of this diploma thesis is the design and development of a scalable and distributed run-time resource management framework for Many-core systems. In this framework, resource management is based on economy-inspired models. The presented resource management framework offers an environment that manages both application tasks and resources at run-time, matches and distributes application tasks across resources with algorithms which are based on market principles. The efficiency of each model is evaluated with respect to resource reliability degradation (metric MTTF-Mean Time to Failure).Θεμιστοκλής Γ. Μελισσάρη

    Scratchpad Memory Management For Multicore Real-Time Embedded Systems

    Get PDF
    Multicore systems will continue to spread in the domain of real-time embedded systems due to the increasing need for high-performance applications. This research discusses some of the challenges associated with employing multicore systems for safety-critical real-time applications. Mainly, this work is concerned with providing: 1) efficient inter-core timing isolation for independent tasks, and 2) predictable task communication for communicating tasks. Principally, we introduce a new task execution model, based on the 3-phase execution model, that exploits the Direct Memory Access (DMA) controllers available in modern embedded platforms along with ScratchPad Memories (SPMs) to enforce strong timing isolation between tasks. The DMA and the SPMs are explicitly managed to pre-load tasks from main memory into the local (private) scratchpad memories. Tasks are then executed from the local SPMs without accessing main memory. This model allows CPU execution to be overlapped with DMA loading/unloading operations from and to main memory. We show that by co-scheduling task execution on CPUs and using DMA to access memory and I/O, we can efficiently hide access latency to physical resources. In turn, this leads to significant improvements in system schedulability, compared to both the case of unregulated contention for access to physical resources and to previous cache and SPM management techniques for real-time systems. The presented SPM-centric scheduling algorithms and analyses cover single-core, partitioned, and global real-time systems. The proposed scheme is also extended to support large tasks that do not fit entirely into the local SPM. Moreover, the schedulability analysis considers the case of recovering from transient soft errors (bit flips caused by a single event upset) in several levels of memories, that cannot be automatically corrected in hardware by the ECC unit. The proposed SPM-centric scheduling is integrated at the OS level; thus it is transparent to applications. The proposed scheme is implemented and evaluated on an FPGA platform and a Commercial-Off-The-Shelf (COTS) platform. In regards to real-time task communication, two types of communication are considered. 1) Asynchronous inter-task communication, between either sequential tasks (single-threaded) or parallel tasks (multi-threaded). 2) Intra-task communication, where parallel threads of the same application exchange data. A new task scheduling model for parallel tasks (Bundled Scheduling) is proposed to facilitate intra-task communication and reduce synchronization overheads. We show that the proposed bundled scheduling model can be applied to several parallel programming models, such as fork-join and DAG-based applications, leading to improved system schedulability. Finally, intra-task communication is governed by a predictable inter-core communication platform. Specifically, we propose HopliteRT, a lean and predictable Network-on-Chip that connects the private SPMs

    Heterogeneity-aware scheduling and data partitioning for system performance acceleration

    Get PDF
    Over the past decade, heterogeneous processors and accelerators have become increasingly prevalent in modern computing systems. Compared with previous homogeneous parallel machines, the hardware heterogeneity in modern systems provides new opportunities and challenges for performance acceleration. Classic operating systems optimisation problems such as task scheduling, and application-specific optimisation techniques such as the adaptive data partitioning of parallel algorithms, are both required to work together to address hardware heterogeneity. Significant effort has been invested in this problem, but either focuses on a specific type of heterogeneous systems or algorithm, or a high-level framework without insight into the difference in heterogeneity between different types of system. A general software framework is required, which can not only be adapted to multiple types of systems and workloads, but is also equipped with the techniques to address a variety of hardware heterogeneity. This thesis presents approaches to design general heterogeneity-aware software frameworks for system performance acceleration. It covers a wide variety of systems, including an OS scheduler targeting on-chip asymmetric multi-core processors (AMPs) on mobile devices, a hierarchical many-core supercomputer and multi-FPGA systems for high performance computing (HPC) centers. Considering heterogeneity from on-chip AMPs, such as thread criticality, core sensitivity, and relative fairness, it suggests a collaborative based approach to co-design the task selector and core allocator on OS scheduler. Considering the typical sources of heterogeneity in HPC systems, such as the memory hierarchy, bandwidth limitations and asymmetric physical connection, it proposes an application-specific automatic data partitioning method for a modern supercomputer, and a topological-ranking heuristic based schedule for a multi-FPGA based reconfigurable cluster. Experiments on both a full system simulator (GEM5) and real systems (Sunway Taihulight Supercomputer and Xilinx Multi-FPGA based clusters) demonstrate the significant advantages of the suggested approaches compared against the state-of-the-art on variety of workloads."This work is supported by St Leonards 7th Century Scholarship and Computer Science PhD funding from University of St Andrews; by UK EPSRC grant Discovery: Pattern Discovery and Program Shaping for Manycore Systems (EP/P020631/1)." -- Acknowledgement

    A Scalable and Adaptive Network on Chip for Many-Core Architectures

    Get PDF
    In this work, a scalable network on chip (NoC) for future many-core architectures is proposed and investigated. It supports different QoS mechanisms to ensure predictable communication. Self-optimization is introduced to adapt the energy footprint and the performance of the network to the communication requirements. A fault tolerance concept allows to deal with permanent errors. Moreover, a template-based automated evaluation and design methodology and a synthesis flow for NoCs is introduced
    corecore