30 research outputs found

    Real-time analysis of MPI programs for NoC-based many-cores using time division multiplexing

    Get PDF
    Worst-case execution time (WCET) analysis is crucial for designing hard real-time systems. While the WCET of tasks in a single core system can be upper bounded in isolation, the tasks in a many-core system are subject to shared memory interferences which impose high overestimation of the WCET bounds. However, many-core-based massively parallel applications will enter the area of real-time systems in the years ahead. Explicit message-passing and a clear separation of computation and communication facilitates WCET analysis for those programs. A standard programming model for message-based communication is the message passing interface (MPI). It provides an application independent interface for different standard communication operations (e.g. broadcast, gather, ...). Thereby, it uses efficient communication patterns with deterministic behaviour. In applying these known structures, we target to provide a WCET analysis for communication that is reusable for different applications if the communication is executed on the same underlying platform. Hence, the analysis must be performed once per hardware platform and can be reused afterwards with only adapting several parameters such as the number of nodes participating in that communication. Typically, the processing elements of many-core platforms are connected via a Network-on-Chip (NoC) and apply techniques such as time-division multiplexing (TDM) to provide guaranteed services for the network. Hence, the hardware and the applied technique for guaranteed service needs to facilitate this reusability of the analysis as well. In this work we review different general-purpose TDM schedules that enable a WCET approximation independent of the placement of tasks on processing elements of a many-core which uses a NoC with torus topology. Furthermore, we provide two new schedules that show a similar performance as the state-of-the-art schedules but additionally serve situations where the presented state-of-the-art schedules perform poorly. Based on these schedules a procedure for the WCET analysis of the communication patterns used in MPI is proposed. Finally, we show how to apply the results of the analysis to calculate the WCET upper bound for a complete MPI program. Detailed insights in the performance of the applied TDM schedules are provided by comparing the schedules to each other in terms of timing. Additionally, we discuss the exhibited timing of the general-purpose schedules compared to a state-of-the-art application specific TDM schedule to put in relation both types of schedules. We apply the proposed procedure to several standard types of communication provided in MPI and compare different patterns that are used to implement a specific communication. Our evaluation investigates the communications’ building blocks of the timing bounds and shows the tremendous impact of choosing the appropriate communication pattern. Finally, a case study demonstrates the application of the presented procedure to a complete MPI program. With the method proposed in this work it is possible to perform a reusable WCET timing analysis for the communication in a NoC that is independent of the placement of tasks on the chip. Moreover, as the applied schedules are not optimized for a specific application but can be used for all applications in the same way, there are only marginal changes in the timing of the communication when the software is adapted or updated. Thus, there is no need to perform the timing analysis from scratch in such cases

    Non-minimal adaptive routing for efficient interconnection networks

    Get PDF
    RESUMEN: La red de interconexión es un concepto clave de los sistemas de computación paralelos. El primer aspecto que define una red de interconexión es su topología. Habitualmente, las redes escalables y eficientes en términos de coste y consumo energético tienen bajo diámetro y se basan en topologías que encaran el límite de Moore y en las que no hay diversidad de caminos mínimos. Una vez definida la topología, quedando implícitamente definidos los límites de rendimiento de la red, es necesario diseñar un algoritmo de enrutamiento que se acerque lo máximo posible a esos límites y debido a la ausencia de caminos mínimos, este además debe explotar los caminos no mínimos cuando el tráfico es adverso. Estos algoritmos de enrutamiento habitualmente seleccionan entre rutas mínimas y no mínimas en base a las condiciones de la red. Las rutas no mínimas habitualmente se basan en el algoritmo de balanceo de carga propuesto por Valiant, esto implica que doblan la longitud de las rutas mínimas y por lo tanto, la latencia soportada por los paquetes se incrementa. En cuanto a la tecnología, desde su introducción en entornos HPC a principios de los años 2000, Ethernet ha sido usado en un porcentaje representativo de los sistemas. Esta tesis introduce una implementación realista y competitiva de una red escalable y sin pérdidas basada en dispositivos de red Ethernet commodity, considerando topologías de bajo diámetro y bajo consumo energético y logrando un ahorro energético de hasta un 54%. Además, propone un enrutamiento sobre la citada arquitectura, en adelante QCN-Switch, el cual selecciona entre rutas mínimas y no mínimas basado en notificaciones de congestión explícitas. Una vez implementada la decisión de enrutar siguiendo rutas no mínimas, se introduce un enrutamiento adaptativo en fuente capaz de adaptar el número de saltos en las rutas no mínimas. Este enrutamiento, en adelante ACOR, es agnóstico de la topología y mejora la latencia en hasta un 28%. Finalmente, se introduce un enrutamiento dependiente de la topología, en adelante LIAN, que optimiza el número de saltos de las rutas no mínimas basado en las condiciones de la red. Los resultados de su evaluación muestran que obtiene una latencia cuasi óptima y mejora el rendimiento de algoritmos de enrutamiento actuales reduciendo la latencia en hasta un 30% y obteniendo un rendimiento estable y equitativo.ABSTRACT: Interconnection network is a key concept of any parallel computing system. The first aspect to define an interconnection network is its topology. Typically, power and cost-efficient scalable networks with low diameter rely on topologies that approach the Moore bound in which there is no minimal path diversity. Once the topology is defined, the performance bounds of the network are determined consequently, so a suitable routing algorithm should be designed to accomplish as much as possible of those limits and, due to the lack of minimal path diversity, it must exploit non-minimal paths when the traffic pattern is adversarial. These routing algorithms usually select between minimal and non-minimal paths based on the network conditions, where the non-minimal paths are built according to Valiant load-balancing algorithm. This implies that these paths double the length of minimal ones and then the latency supported by packets increases. Regarding the technology, from its introduction in HPC systems in the early 2000s, Ethernet has been used in a significant fraction of the systems. This dissertation introduces a realistic and competitive implementation of a scalable lossless Ethernet network for HPC environments considering low-diameter and low-power topologies. This allows for up to 54% power savings. Furthermore, it proposes a routing upon the cited architecture, hereon QCN-Switch, which selects between minimal and non-minimal paths per packet based on explicit congestion notifications instead of credits. Once the miss-routing decision is implemented, it introduces two mechanisms regarding the selection of the intermediate switch to develop a source adaptive routing algorithm capable of adapting the number of hops in the non-minimal paths. This routing, hereon ACOR, is topology-agnostic and improves average latency in all cases up to 28%. Finally, a topology-dependent routing, hereon LIAN, is introduced to optimize the number of hops in the non-minimal paths based on the network live conditions. Evaluations show that LIAN obtains almost-optimal latency and outperforms state-of-the-art adaptive routing algorithms, reducing latency by up to 30.0% and providing stable throughput and fairness.This work has been supported by the Spanish Ministry of Education, Culture and Sports under grant FPU14/02253, the Spanish Ministry of Economy, Industry and Competitiveness under contracts TIN2010-21291-C02-02, TIN2013-46957-C2-2-P, and TIN2013-46957-C2-2-P (AEI/FEDER, UE), the Spanish Research Agency under contract PID2019-105660RBC22/AEI/10.13039/501100011033, the European Union under agreements FP7-ICT-2011- 7-288777 (Mont-Blanc 1) and FP7-ICT-2013-10-610402 (Mont-Blanc 2), the University of Cantabria under project PAR.30.P072.64004, and by the European HiPEAC Network of Excellence through an internship grant supported by the European Union’s Horizon 2020 research and innovation program under grant agreement No. H2020-ICT-2015-687689

    Routing on the Channel Dependency Graph:: A New Approach to Deadlock-Free, Destination-Based, High-Performance Routing for Lossless Interconnection Networks

    Get PDF
    In the pursuit for ever-increasing compute power, and with Moore's law slowly coming to an end, high-performance computing started to scale-out to larger systems. Alongside the increasing system size, the interconnection network is growing to accommodate and connect tens of thousands of compute nodes. These networks have a large influence on total cost, application performance, energy consumption, and overall system efficiency of the supercomputer. Unfortunately, state-of-the-art routing algorithms, which define the packet paths through the network, do not utilize this important resource efficiently. Topology-aware routing algorithms become increasingly inapplicable, due to irregular topologies, which either are irregular by design, or most often a result of hardware failures. Exchanging faulty network components potentially requires whole system downtime further increasing the cost of the failure. This management approach becomes more and more impractical due to the scale of today's networks and the accompanying steady decrease of the mean time between failures. Alternative methods of operating and maintaining these high-performance interconnects, both in terms of hardware- and software-management, are necessary to mitigate negative effects experienced by scientific applications executed on the supercomputer. However, existing topology-agnostic routing algorithms either suffer from poor load balancing or are not bounded in the number of virtual channels needed to resolve deadlocks in the routing tables. Using the fail-in-place strategy, a well-established method for storage systems to repair only critical component failures, is a feasible solution for current and future HPC interconnects as well as other large-scale installations such as data center networks. Although, an appropriate combination of topology and routing algorithm is required to minimize the throughput degradation for the entire system. This thesis contributes a network simulation toolchain to facilitate the process of finding a suitable combination, either during system design or while it is in operation. On top of this foundation, a key contribution is a novel scheduling-aware routing, which reduces fault-induced throughput degradation while improving overall network utilization. The scheduling-aware routing performs frequent property preserving routing updates to optimize the path balancing for simultaneously running batch jobs. The increased deployment of lossless interconnection networks, in conjunction with fail-in-place modes of operation and topology-agnostic, scheduling-aware routing algorithms, necessitates new solutions to solve the routing-deadlock problem. Therefore, this thesis further advances the state-of-the-art by introducing a novel concept of routing on the channel dependency graph, which allows the design of an universally applicable destination-based routing capable of optimizing the path balancing without exceeding a given number of virtual channels, which are a common hardware limitation. This disruptive innovation enables implicit deadlock-avoidance during path calculation, instead of solving both problems separately as all previous solutions

    On Near Optimal Time and Dynamic Delay and Delay Variation Multicast Algorithms

    Get PDF
    Multicast is one of the most prevalent communication modes in computer networks. A plethora of systems and applications today rely on multicast communication to disseminate traffic including but not limited to teleconferencing, videoconferencing, stock exchanges, supercomputers, software update distribution, distributed database systems, and gaming. This dissertation elaborates and addresses key research challenges and problems related to the design and implementation of multicast algorithms. In particular, it investigates the problems of (1) Designing near optimal multicast time algorithms for mesh and torus connected systems and (2) Designing efficient algorithms for Delay and Delay Variation Bounded Multicast (DVBM). To achieve the first goal, improvements on four tree based multicast algorithms are made: Modified PAIR (MPAIR), Modified DIAG (MDIAG), Modified MIN (MMIN), and Modified DIST (MDIST). The proof that MDIAG generates optimal or optimal plus one multicast time in 2-Dimensional (2D) mesh networks is provided. The hybrid version of MDIAG (HMDIAG) is designed, that gives a 3-additive approximation algorithm on multicast time in 2D torus networks. To make HMDIAG applicable on systems using higher dimensional meshes and tori, it is extended and the proof that it gives a (2n-1)-additive approximation algorithm on multicast time in nD torus networks is given. To address the second goal, Directional Core Selection (DCS) algorithm for core selection and DVBM Tree generation is designed. To further reduce the delay variation of trees generated by DCS, a k-shortest-path based algorithm, Build Lower Variation Tree (BLVT) is designed. To tackle dynamic join/leave requests to the ongoing multicast session, the dynamic version of both algorithms is given that responds to requests by reorganizing the tree and avoiding session disruption. To solve cases where single-core based algorithms fail to construct a DVBM tree, a dynamic three-phase algorithm, Multi-core DVBM Trees (MCDVBMT) is designed, that semi-matches group members to core nodes

    Optimization of communication intensive applications on HPC networks

    Get PDF
    Communication is a necessary but overhead inducing component of parallel programming. Its impact on application design and performance is due to several related aspects of a parallel job execution: network topology, routing protocol, suitability of algorithm being used to the network, job placement, etc. This thesis is aimed at developing an understanding of how communication plays out on networks of high performance computing systems and exploring methods that can be used to improve communication performance of large scale applications. Broadly speaking, three topics have been studied in detail in this thesis. The first of these topics is task mapping and job placement on practical installations of torus and dragonfly networks. Next, use of supervised learning algorithms for conducting diagnostic studies of how communication evolves on networks is explored. Finally, efficacy of packet-level simulations for prediction-based studies of communication performance on different networks using different network parameters is analyzed. The primary contribution of this thesis is development of scalable diagnostic and prediction methods that can assist in the process of network designing, adapting applications to future systems, and optimizing execution of applications on existing systems. These meth- ods include a supervised learning approach, a functional modeling tool (called Damselfly), and a PDES-based packet level simulator (called TraceR), all of which are described in this thesis

    A Scalable and Adaptive Network on Chip for Many-Core Architectures

    Get PDF
    In this work, a scalable network on chip (NoC) for future many-core architectures is proposed and investigated. It supports different QoS mechanisms to ensure predictable communication. Self-optimization is introduced to adapt the energy footprint and the performance of the network to the communication requirements. A fault tolerance concept allows to deal with permanent errors. Moreover, a template-based automated evaluation and design methodology and a synthesis flow for NoCs is introduced

    Topology Agnostic Methods for Routing, Reconfiguration and Virtualization of Interconnection Networks

    Get PDF
    Modern computing systems, such as supercomputers, data centers and multicore chips, generally require efficient communication between their different system units; tolerance towards component faults; flexibility to expand or merge; and a high utilization of their resources. Interconnection networks are used in a variety of such computing systems in order to enable communication between their diverse system units. Investigation and proposal of new or improved solutions to topology agnostic routing and reconfiguration of interconnection networks are main objectives of this thesis. In addition, topology agnostic routing and reconfiguration algorithms are utilized in the development of new and flexible approaches to processor allocation. The thesis aims to present versatile solutions that can be used for the interconnection networks of a number of different computing systems. No particular routing algorithm was specified for an interconnection network technology which is now incorporated in Dolphin Express. The thesis states a set of criteria for a suitable routing algorithm, evaluates a number of existing routing algorithms, and recommend that one of the algorithms – which fulfils all of the criteria – is used. Further investigations demonstrate how this routing algorithm inherently supports fault-tolerance, and how it can be optimized for some network topologies. These considerations are also relevant for the InfiniBand interconnection network technology. Reconfiguration of interconnection networks (change of routing function) is a deadlock prone process. Some existing reconfiguration strategies include deadlock avoidance mechanisms that significantly reduce the network service offered to running applications. The thesis expands the area of application for one of the most versatile and efficient reconfiguration algorithms available in the literature, and proposes an optimization of this algorithm that improves the network service offered to running applications. Moreover, a new reconfiguration algorithm is presented that supports a replacement of the routing function without causing performance penalties. Processor allocation strategies that guarantee traffic-containment commonly pose strict requirements on the shape of partitions, and thus achieve only a limited utilization of a system’s computing resources. The thesis introduces two new approaches that are more flexible. Both approaches utilize the properties of a topology agnostic routing algorithm in order to enforce traffic-containment within arbitrarily shaped partitions. Consequently, a high resource utilization as well as isolation of traffic between different partitions is achieved

    Hardware Support for Efficient Packet Processing

    Full text link
    Scalability is the key ingredient to further increase the performance of today’s supercomputers. As other approaches like frequency scaling reach their limits, parallelization is the only feasible way to further improve the performance. The time required for communication needs to be kept as small as possible to increase the scalability, in order to be able to further parallelize such systems. In the first part of this thesis ways to reduce the inflicted latency in packet based interconnection networks are analyzed and several new architectural solutions are proposed to solve these issues. These solutions have been tested and proven in a field programmable gate array (FPGA) environment. In addition, a hardware (HW) structure is presented that enables low latency packet processing for financial markets. The second part and the main contribution of this thesis is the newly designed crossbar architecture. It introduces a novel way to integrate the ability to multicast in a crossbar design. Furthermore, an efficient implementation of adaptive routing to reduce the congestion vulnerability in packet based interconnection networks is shown. The low latency of the design is demonstrated through simulation and its scalability is proven with synthesis results. The third part concentrates on the improvements and modifications made to EXTOLL, a high performance interconnection network specifically designed for low latency and high throughput applications. Contributions are modules enabling an efficient integration of multiple host interfaces as well as the integration of the on-chip interconnect. Additionally, some of the already existing functionality has been revised and improved to reach better performance and a lower latency. Micro-benchmark results are presented to underline the contribution of the made modifications

    Efficient Multicast Algorithms for Mesh and Torus Networks

    Get PDF
    With the increasing popularity of multicomputers, efficient way of communication within its processors has become a popular area of research. Multicomputers refer to a computer system that has multiple processors, they have high computational power and they can perform multiple tasks concurrently. Mesh and Torus are some of the commonly used network topologies in building multicomputer systems. Their performance highly depends on the underlying network communication such as multicast. Multicast is a communication method in which a message is sent from a source node to a certain number of destinations. Two major parameters used to evaluate multicast are time that a multicast process takes to deliver the message to all destinations and traffic that indicates the number of links used for this process. Research indicates that in general, it is NP- complete to find an optimal multicasting algorithm which is efficient on both time and traffic. This thesis suggests two new algorithms to achieve multicast in mesh and torus networks. Extensive simulations of these algorithms show that in practice they perform better than existing ones
    corecore