26 research outputs found
Optimizing Communication for Massively Parallel Processing
The current trends in high performance computing show that large machines with tens of thousands of processors will soon be readily available. The IBM Bluegene-L machine with 128k processors (which is currently being deployed) is an important step in this direction. In this scenario, it is going to be a significant burden for the programmer to manually scale his applications. This task of scaling involves addressing issues like load-imbalance and communication overhead. In this thesis, we explore several communication optimizations to help parallel applications to easily scale on a large number of processors. We also present automatic runtime techniques to relieve the programmer from the burden of optimizing communication in his applications.
This thesis explores processor virtualization to improve communication performance in applications. With processor virtualization, the computation is mapped to virtual processors (VPs). After one VP has finished computation and is waiting for responses to its messages, another VP can compute, thus overlapping communication with computation. This overlap is only effective if the processor overhead of the communication operation is a small fraction of the total communication time. Fortunately, with network interfaces having co-processors, this happens to be true and processor virtualization has a natural advantage on such interconnects.
The communication optimizations we present in this thesis, are motivated by applications such as NAMD (a classical molecular dynamics application) and CPAIMD (a quantum chemistry application). Applications like NAMD and CPAIMD consume a fair share of the time available on supercomputers. So, improving their performance would be of great value. We have successfully scaled NAMD to 1TF of peak performance on 3000 processors of PSC Lemieux, using the techniques presented in this thesis.
We study both point-to-point communication and collective communication (specifically all-to-all communication). On a large number of processors all-to-all communication can take several milli-seconds to finish. With synchronous collectives defined in MPI, the processor idles while the collective messages are in flight. Therefore, we demonstrate an asynchronous collective communication framework, to let the CPU compute while the all-to-all messages are in flight. We also show that the best strategy for all-to-all communication depends on the message size, number of processors and other dynamic parameters. This suggests that these parameters can be observed at runtime and used to choose the optimal strategy for all-to-all communication. In this thesis, we demonstrate adaptive strategy switching for all-to-all communication.
The communication optimization framework presented in this thesis, has been designed to optimize communication in the context of processor virtualization and dynamic migrating objects. We present the streaming strategy to optimize fine grained object-to-object communication.
In this thesis, we motivate the need for hardware collectives, as processor based collectives can be delayed by intermediate that processors busy with computation. We explore a next generation interconnect that supports collectives in the switching hardware. We show the performance gains of hardware collectives through synthetic benchmarks
High performance communication on reconfigurable clusters
High Performance Computing (HPC) has matured to where it is an essential third pillar, along with theory and experiment, in most domains of science and engineering. Communication latency is a key factor that is limiting the performance of HPC, but can be addressed by integrating communication into accelerators. This integration allows accelerators to communicate with each other without CPU interactions, and even bypassing the network stack. Field Programmable Gate Arrays (FPGAs) are the accelerators that currently best integrate communication with computation. The large number of Multi-gigabit Transceivers (MGTs) on most high-end FPGAs can provide high-bandwidth and low-latency inter-FPGA connections. Additionally, the reconfigurable FPGA fabric enables tight coupling between computation kernel and network interface.
Our thesis is that an application-aware communication infrastructure for a multi-FPGA system makes substantial progress in solving the HPC communication bottleneck. This dissertation aims to provide an application-aware solution for communication infrastructure for FPGA-centric clusters. Specifically, our solution demonstrates application-awareness across multiple levels in the network stack, including low-level link protocols, router microarchitectures, routing algorithms, and applications.
We start by investigating the low-level link protocol and the impact of its latency variance on performance. Our results demonstrate that, although some link jitter is always present, we can still assume near-synchronous communication on an FPGA-cluster. This provides the necessary condition for statically-scheduled routing. We then propose two novel router microarchitectures for two different kinds of workloads: a wormhole Virtual Channel (VC)-based router for workloads with dynamic communication, and a statically-scheduled Virtual Output Queueing (VOQ)-based router for workloads with static communication. For the first (VC-based) router, we propose a framework that generates application-aware router configurations. Our results show that, by adding application-awareness into router configuration, the network performance of FPGA clusters can be substantially improved. For the second (VOQ-based) router, we propose a novel offline collective routing algorithm. This shows a significant advantage over a state-of-the-art collective routing algorithm.
We apply our communication infrastructure to a critical strong-scaling HPC kernel, the 3D FFT. The experimental results demonstrate that the performance of our design is faster than that on CPUs and GPUs by at least one order of magnitude (achieving strong scaling for the target applications). Surprisingly, the FPGA cluster performance is similar to that of an ASIC-cluster. We also implement the 3D FFT on another multi-FPGA platform: the Microsoft Catapult II cloud. Its performance is also comparable or superior to CPU and GPU HPC clusters. The second application we investigate is Molecular Dynamics Simulation (MD). We model MD on both FPGA clouds and clusters. We find that combining processing and general communication in the same device leads to extremely promising performance and the prospect of MD simulations well into the us/day range with a commodity cloud
Topology Agnostic Methods for Routing, Reconfiguration and Virtualization of Interconnection Networks
Modern computing systems, such as supercomputers, data centers and multicore chips, generally require efficient communication between their different system units; tolerance towards component faults; flexibility to expand or merge; and a high utilization of their resources. Interconnection networks are used in a variety of such computing systems in order to enable communication between their diverse system units.
Investigation and proposal of new or improved solutions to topology agnostic routing and reconfiguration of interconnection networks are main objectives of this thesis. In addition, topology agnostic routing and reconfiguration algorithms are utilized in the development of new and flexible approaches to processor allocation. The thesis aims to present versatile solutions that can be used for the interconnection networks of a number of different computing systems.
No particular routing algorithm was specified for an interconnection network technology which is now incorporated in Dolphin Express. The thesis states a set of criteria for a suitable routing algorithm, evaluates a number of existing routing algorithms, and recommend that one of the algorithms – which fulfils all of the criteria – is used. Further investigations demonstrate how this routing algorithm inherently supports fault-tolerance, and how it can be optimized for some network topologies. These considerations are also relevant for the InfiniBand interconnection network technology.
Reconfiguration of interconnection networks (change of routing function) is a deadlock prone process. Some existing reconfiguration strategies include deadlock avoidance mechanisms that significantly reduce the network service offered to running applications. The thesis expands the area of application for one of the most versatile and efficient reconfiguration algorithms available in the literature, and proposes an optimization of this algorithm that improves the network service offered to running applications. Moreover, a new reconfiguration algorithm is presented that supports a replacement of the routing function without causing performance penalties.
Processor allocation strategies that guarantee traffic-containment commonly pose strict requirements on the shape of partitions, and thus achieve only a limited utilization of a system’s computing resources. The thesis introduces two new approaches that are more flexible. Both approaches utilize the properties of a topology agnostic routing algorithm in order to enforce traffic-containment within arbitrarily shaped partitions. Consequently, a high resource utilization as well as isolation of traffic between different partitions is achieved
Cost Effective Routing Implementations for On-chip Networks
Arquitecturas de múltiples núcleos como multiprocesadores (CMP) y soluciones multiprocesador para sistemas dentro del chip (MPSoCs) actuales se basan en la eficacia de las redes dentro del chip (NoC) para la comunicación entre los diversos núcleos. Un diseño eficiente de red dentro del chip debe ser escalable y al mismo tiempo obtener valores ajustados de área, latencia y consumo de energÃa. Para diseños de red dentro del chip de propósito general se suele usar topologÃas de malla 2D ya que se ajustan a la distribución del chip. Sin embargo, la aparición de nuevos retos debe ser abordada por los diseñadores. Una mayor probabilidad de defectos de fabricación, la necesidad de un uso optimizado de los recursos para aumentar el paralelismo a nivel de aplicación o la necesidad de técnicas eficaces de ahorro de energÃa, puede ocasionar patrones de irregularidad en las topologÃas. Además, el soporte para comunicación colectiva es una caracterÃstica buscada para abordar con eficacia las necesidades de comunicación de los protocolos de coherencia de caché. En estas condiciones, un encaminamiento eficiente de los mensajes se convierte en un reto a superar.
El objetivo de esta tesis es establecer las bases de una nueva arquitectura para encaminamiento distribuido basado en lógica que es capaz de adaptarse a cualquier topologÃa irregular derivada de una estructura de malla 2D, proporcionando asà una cobertura total para cualquier caso resultado de soportar los retos mencionados anteriormente. Para conseguirlo, en primer lugar, se parte desde una base, para luego analizar una evolución de varios mecanismos, y finalmente llegar a una implementación, que abarca varios módulos para alcanzar el objetivo mencionado anteriormente. De hecho, esta última implementación tiene por nombre eLBDR (effective Logic-Based Distributed Routing). Este trabajo cubre desde el primer mecanismo, LBDR, hasta el resto de mecanismos que han surgido progresivamente.Rodrigo MocholÃ, S. (2010). Cost Effective Routing Implementations for On-chip Networks [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/8962Palanci
Efficient mechanisms to provide fault tolerance in interconnection networks for pc clusters
Actualmente, los clusters de PC son un alternativa rentable a los computadores paralelos.
En estos sistemas, miles de componentes (procesadores y/o discos duros) se conectan a través de redes de interconexión de altas prestaciones.
Entre las tecnologÃas de red actualmente disponibles para construir clusters, InfiniBand (IBA) ha emergido como un nuevo estándar de interconexión para clusters.
De hecho, ha sido adoptado por muchos de los sistemas más potentes construidos actualmente (lista top500).
A medida que el número de nodos aumenta en estos sistemas, la red de interconexión también crece.
Junto con el aumento del número de componentes la probabilidad de averÃas aumenta dramáticamente, y asÃ, la tolerancia a fallos en el sistema en general, y de la red de interconexión en particular, se convierte en una necesidad.
Desafortunadamente, la mayor parte de las estrategias de encaminamiento tolerantes a fallos propuestas para los computadores masivamente paralelos no pueden ser aplicadas porque el encaminamiento y las transiciones de canal virtual son deterministas en IBA, lo que impide que los paquetes eviten los fallos.
Por lo tanto, son necesarias nuevas estrategias para tolerar fallos.
Por ello, esta tesis se centra en proporcionar los niveles adecuados de tolerancia a fallos a los clusters de PC, y en particular a las redes IBA.
En esta tesis proponemos y evaluamos varios mecanismos adecuados para las redes de interconexión para clusters.
El primer mecanismo para proporcionar tolerancia a fallos en IBA (al que nos referimos como encaminamiento tolerante a fallos basado en transiciones; TFTR) consiste en usar varias rutas disjuntas entre cada par de nodos origen-destino y seleccionar la ruta apropiada en el nodo fuente usando el mecanismo APM proporcionado por IBA.
Consiste en migrar las rutas afectadas por el fallo a las rutas alternativas sin fallos.
Sin embargo, con este fin, es necesario un algoritmo eficiente de encaminamiento capaz de proporcionar suficientesMontañana Aliaga, JM. (2008). Efficient mechanisms to provide fault tolerance in interconnection networks for pc clusters [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/2603Palanci
Parallel and Distributed Computing
The 14 chapters presented in this book cover a wide variety of representative works ranging from hardware design to application development. Particularly, the topics that are addressed are programmable and reconfigurable devices and systems, dependability of GPUs (General Purpose Units), network topologies, cache coherence protocols, resource allocation, scheduling algorithms, peertopeer networks, largescale network simulation, and parallel routines and algorithms. In this way, the articles included in this book constitute an excellent reference for engineers and researchers who have particular interests in each of these topics in parallel and distributed computing
Accelerating Network Communication and I/O in Scientific High Performance Computing Environments
High performance computing has become one of the major drivers behind technology inventions and science discoveries. Originally driven through the increase of operating frequencies and technology scaling, a recent slowdown in this evolution has led to the development of multi-core architectures, which are supported by accelerator devices such as graphics processing units (GPUs). With the upcoming exascale era, the overall power consumption and the gap between compute capabilities and I/O bandwidth have become major challenges. Nowadays, the system performance is dominated by the time spent in communication and I/O, which highly depends on the capabilities of the network interface. In order to cope with the extreme concurrency and heterogeneity of future systems, the software ecosystem of the interconnect needs to be carefully tuned to excel in reliability, programmability, and usability.
This work identifies and addresses three major gaps in today's interconnect software systems. The I/O gap describes the disparity in operating speeds between the computing capabilities and second storage tiers. The communication gap is introduced through the communication overhead needed to synchronize distributed large-scale applications and the mixed workload. The last gap is the so called concurrency gap, which is introduced through the extreme concurrency and the inflicted learning curve posed to scientific application developers to exploit the hardware capabilities.
The first contribution is the introduction of the network-attached accelerator approach, which moves accelerators into a "stand-alone" cluster connected through the Extoll interconnect. The novel communication architecture enables the direct accelerators communication without any host interactions and an optimal application-to-compute-resources mapping. The effectiveness of this approach is evaluated for two classes of accelerators: Intel Xeon Phi coprocessors and NVIDIA GPUs.
The next contribution comprises the design, implementation, and evaluation of the support of legacy codes and protocols over the Extoll interconnect technology. By providing TCP/IP protocol support over Extoll, it is shown that the performance benefits of the interconnect can be fully leveraged by a broader range of applications, including the seamless support of legacy codes.
The third contribution is twofold. First, a comprehensive analysis of the Lustre networking protocol semantics and interfaces is presented. Afterwards, these insights are utilized to map the LNET protocol semantics onto the Extoll networking technology. The result is a fully functional Lustre network driver for Extoll. An initial performance evaluation demonstrates promising bandwidth and message rate results.
The last contribution comprises the design, implementation, and evaluation of two easy-to-use load balancing frameworks, which transparently distribute the I/O workload across all available storage system components. The solutions maximize the parallelization and throughput of file I/O. The frameworks are evaluated on the Titan supercomputing systems for three I/O interfaces. For example for large-scale application runs, POSIX I/O and MPI-IO can be improved by up to 50% on a per job basis, while HDF5 shows performance improvements of up to 32%
Recommended from our members
Performance Modelling and Evaluation of Network On Chip Under Bursty Traffic. Performance evaluation of communication networks using analytical and simulation models in NOCs with Fat tree topology under Bursty Traffic with virtual channels.
Physical constrains of integrated circuits (commonly called chip) in regards to size and finite number of wires, has made the design of System-on-Chip (SoC) more interesting to study in terms of finding better solutions for the complexity of the chip-interconnections. The SoC has hundreds of Processing Elements (PEs), and a single shared bus can no longer be acceptable due to poor scalability with the system size. Networks on Chip (NoC) have been proposed as a solution to mitigate complex on-chip communication problems for complex SoCs. They consists of computational resources in the form of PE cores and switching nodes which allow PEs to communicate with each other.
In the design and development of Networks on Chip, performance modelling and analysis has great theoretical and practical importance. This research is devoted to developing efficient and cost-effective analytical tools for the performance analysis and enhancement of NoCs with m-port n-tree topology under bursty traffic.
Recent measurement studies have strongly verified that the traffic generated by many real-world applications in communication networks exhibits bursty and self-similar properties in nature and the message destinations are uniformly distributed. NoC's performance is generally affected by different traffic patterns generated by the processing elements. As the first step in the research, a new analytical model is developed to capture the burstiness and self-similarity characteristics of the traffic within NoCs through the use of Markov Modulated Poisson Process. The performance results of the developed model highlight the importance of accurate traffic modelling in the study and performance evaluation of NoCs.
Having developed an efficient analytical tool to capture the traffic behaviour with a higher accuracy, in the next step, the research focuses on the effect of topology on the performance of NoCs. Many important challenges still remain as vulnerabilities within the design of NoCs with topology being the most important. Therefore a new analytical model is developed to investigate the performance of NoCs with the m-port n-tree topology under bursty traffic. Even though it is broadly proved in practice that fat-tree topology and its varieties result in lower latency, higher throughput and bandwidth, still most studies on NoCs adopt Mesh, Torus and Spidergon topologies. The results gained from the developed model and advanced simulation experiments significantly show the effect of fat-tree topology in reducing latency and increasing the throughput of NoCs.
In order to obtain deeper understanding of NoCs performance attributes and for further improvement, in the final stage of the research, the developed analytical model was extended to consider the use of virtual channels within the architecture of NoCs. Extensive simulation experiments were carried out which show satisfactory improvements in the throughput of NoCs with fat-tree topology and VCs under bursty traffic. The analytical results and those obtained from extensive simulation experiments have shown a good degree of accuracy for predicting the network performance under different design alternatives and various traffic conditions.Libyan Ministry of Higher Educatio
MultiPaths Revisited - A novel approach using OpenFlow-enabled devices
This thesis presents novel approaches enhancing the performance of computer networks using multipaths. Our enhancements take the form of congestion- aware routing protocols. We present three protocols called MultiRoute, Step- Route, and finally PathRoute. Each of these protocols leverage both local and remote congestion statistics and build different representations (or views) of the network congestion by using an innovative representation of congestion for router-router links. These congestion statistics are then distributed via an aggregation protocol to other routers in the network. For many years, multipath routing protocols have only been used in simple situations, such as Link Aggregation and/or networks where paths of equal cost (and therefore equal delay) exist. But, paths of unequal costs are often discarded to the benefit of shortest path only routing because it is known that paths of unequal length present different delays and therefore cause out of order packets which cause catastrophic network performances. Further, multipaths become highly beneficial when alternative paths are selected based on the network congestion. But, no realistic solution has been proposed for congestion-aware multipath networks. We present in this thesis a method which selects alternative paths based on network congestion and completely avoids the issue of out of order packets by grouping packets into flows and binding them to a single path for a limited duration. The implementation of these protocols relies heavily on OpenFlow and NOX. OpenFlow enables network researchers to control the behavior of their network equipment by specifying rules in the routers flow table. NOX provides a simple Application Programming Interface (API) to program a routers flow table. Therefore by using OpenFlow and NOX, we are able to define new routing protocols like the ones which we will present in this thesis. We show in this thesis that grouping packets together, while not optimal, still provides a significant increase in network performance. More precisely we show that our protocols can, in some cases, achieve up to N times the throughput of Shortest Path (SP), where N is the number of distinct paths of identical throughput from source to destination. We also show that our protocols provide more predictable throughput than simple hash-based routing algorithms. Todays networks provide more and more connections between any source- destination pair. Most of these connections remain idle until some failure occurs. Using the protocols proposed in this thesis, networks could leverage the added bandwidth provided by these currently idle connections. Therefore, we could increase the overall performance of current networks without replacing the existing hardware
Scaling High-Performance Interconnect Architectures to Many-Core Systems.
The ever-increasing demand for performance scaling has made multi-core (2-8 cores) chips prevalent in today’s computing systems and foreshadows the shift toward many-core (10s- 100s of cores) chips in the near future. Although the potential performance gains from many-core systems remain appealing, the widespread adoption of these systems hinges on their ability to scale performance while simultaneously satisfying Quality-of-Service (QoS) and energy-efficiency constraints.
This work makes the case that the interconnect for these many-core systems has a significant impact on the aforementioned scalability issues. The impact of interconnects on many-core systems is illustrated by observing that the degree of the interconnect has a signicant effect on system scalability and demonstrating that the architecture of high-radix, many-core systems are feasible, energy-efficient, and high-performance.
The feasibility of high-radix crossbars for many-core systems is first shown through a new circuit-level building block called the Swizzle-Switch which can operate at frequencies up to 1.5GHz for 128-bit, radix-64 crossbars.
This work then shows how a many-core system called the Swizzle-Switch Network (SSN) can use the Swizzle-Switch as the central building block for a flat crossbar interconnect. The SSN is shown to be advantageous to traditional Network-on-Chip (NoC) for systems up to 64 cores. The SSN performance by 21% relative to a Mesh while also providing a 25% energy savings over the Mesh.
The Swizzle-Switch is also leveraged as a building block for high-radix NoC topologies that can support many-core architectures. The Swizzle-Switch-based Flattened Butterfly topology is demonstrated to provide a 15% speedup and 10% energy savings over the Mesh.
Finally, the impact that 3D stacking technology has on many-core scalability is evaluated for bus and crossbar interconnects. A 3D-optimized Swizzle-Switch Network is able to leverage frequency gains to achieve a 15-28% speedup over a 2D-Swizzle-Switch Network when using memory- intensive benchmarks. Additionally, a bus-based 64-core architecture is shown to provide an average speedup of 49× over a baseline uniprocessor system when using 3D technology.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/93980/1/ksewell_1.pd