19 research outputs found

    Optimizing Communication for Massively Parallel Processing

    Get PDF
    The current trends in high performance computing show that large machines with tens of thousands of processors will soon be readily available. The IBM Bluegene-L machine with 128k processors (which is currently being deployed) is an important step in this direction. In this scenario, it is going to be a significant burden for the programmer to manually scale his applications. This task of scaling involves addressing issues like load-imbalance and communication overhead. In this thesis, we explore several communication optimizations to help parallel applications to easily scale on a large number of processors. We also present automatic runtime techniques to relieve the programmer from the burden of optimizing communication in his applications. This thesis explores processor virtualization to improve communication performance in applications. With processor virtualization, the computation is mapped to virtual processors (VPs). After one VP has finished computation and is waiting for responses to its messages, another VP can compute, thus overlapping communication with computation. This overlap is only effective if the processor overhead of the communication operation is a small fraction of the total communication time. Fortunately, with network interfaces having co-processors, this happens to be true and processor virtualization has a natural advantage on such interconnects. The communication optimizations we present in this thesis, are motivated by applications such as NAMD (a classical molecular dynamics application) and CPAIMD (a quantum chemistry application). Applications like NAMD and CPAIMD consume a fair share of the time available on supercomputers. So, improving their performance would be of great value. We have successfully scaled NAMD to 1TF of peak performance on 3000 processors of PSC Lemieux, using the techniques presented in this thesis. We study both point-to-point communication and collective communication (specifically all-to-all communication). On a large number of processors all-to-all communication can take several milli-seconds to finish. With synchronous collectives defined in MPI, the processor idles while the collective messages are in flight. Therefore, we demonstrate an asynchronous collective communication framework, to let the CPU compute while the all-to-all messages are in flight. We also show that the best strategy for all-to-all communication depends on the message size, number of processors and other dynamic parameters. This suggests that these parameters can be observed at runtime and used to choose the optimal strategy for all-to-all communication. In this thesis, we demonstrate adaptive strategy switching for all-to-all communication. The communication optimization framework presented in this thesis, has been designed to optimize communication in the context of processor virtualization and dynamic migrating objects. We present the streaming strategy to optimize fine grained object-to-object communication. In this thesis, we motivate the need for hardware collectives, as processor based collectives can be delayed by intermediate that processors busy with computation. We explore a next generation interconnect that supports collectives in the switching hardware. We show the performance gains of hardware collectives through synthetic benchmarks

    On packet switch design

    Get PDF

    Performance and policy dimensions in internet routing

    Get PDF
    The Internet Routing Project, referred to in this report as the 'Highball Project', has been investigating architectures suitable for networks spanning large geographic areas and capable of very high data rates. The Highball network architecture is based on a high speed crossbar switch and an adaptive, distributed, TDMA scheduling algorithm. The scheduling algorithm controls the instantaneous configuration and swell time of the switch, one of which is attached to each node. In order to send a single burst or a multi-burst packet, a reservation request is sent to all nodes. The scheduling algorithm then configures the switches immediately prior to the arrival of each burst, so it can be relayed immediately without requiring local storage. Reservations and housekeeping information are sent using a special broadcast-spanning-tree schedule. Progress to date in the Highball Project includes the design and testing of a suite of scheduling algorithms, construction of software reservation/scheduling simulators, and construction of a strawman hardware and software implementation. A prototype switch controller and timestamp generator have been completed and are in test. Detailed documentation on the algorithms, protocols and experiments conducted are given in various reports and papers published. Abstracts of this literature are included in the bibliography at the end of this report, which serves as an extended executive summary

    Design and performance evaluation of switching architectures for high-speed Internet

    Get PDF
    The motivation for this thesis is the desire to build faster and scalable routers that efficiently handle the exponential traffic growth in the Internet. The Internet forwards information through a mesh of routers and switches, which has to keep up with the increasing demands of traffic. Shared-memory based switches are known to provide the best throughput-delay performance for a given memory size. In this thesis performance of commonly used memory-sharing schemes for the shared memory switches are evaluated under balanced and unbalanced bursty traffic. The scalability of shared-memory switches has been a research issue for quite sometime. One approach is to employ multiple memory modules and use them in parallel to enhance the capacity. The two well-known architectures in this category are (i) shared-multibuffer (SMB) switch architecture invented by Yamanaka et al. of Mitsubishi Electric Corporation, Japan; and (ii) the sliding-window (SW) switch architecture invented by Dr. Kumar of UTPA, Texas, USA. In this thesis, performance of these two architectures are evaluated and compared. Furthermore, in this thesis, the SW switch architecture is extended to enable priority switching to provide differentiated Quality of Service (QoS) for different traffic classes

    Efficient Q. S support for higt-performance interconnects

    Get PDF
    Las redes de interconexi贸n son un componente clave en un gran n煤mero de sistemas. Los mecanismos de calidad de servicio (qos) son responsables de asegurar que se alcanza un cierto rendimiento en la red. Las soluciones tradicionales para ofrecer qos en redes de interconexi贸n de altas prestaciones normalmente se basan en arquitecturas complejas. El principal objetivo de esta tesis es investigar si podemos ofrecer mecanismos eficientes de qos. Nuestro prop贸sito es alcanzar un soporte completo de qos con el m铆nimo de recursos. Para ello, se identifican redundancias en los mecanismos propuestos de qos y son eliminados sin afectar al rendimiento. Esta tesis consta de tres partes. En la primera comenzamos con las propuestas tradicionales de qos a nivel de clase de tr谩fico. En la segunda parte, proponemos como adaptar los mecanismos de qos basados en deadlines para redes de interconexi贸n de altas prestaciones. Por 煤ltimo, tambi茅n investigamos la interacci贸n de los mecanismos de qos con el control de congesti贸n
    corecore