34 research outputs found

    Optimizing Communication for Massively Parallel Processing

    Get PDF
    The current trends in high performance computing show that large machines with tens of thousands of processors will soon be readily available. The IBM Bluegene-L machine with 128k processors (which is currently being deployed) is an important step in this direction. In this scenario, it is going to be a significant burden for the programmer to manually scale his applications. This task of scaling involves addressing issues like load-imbalance and communication overhead. In this thesis, we explore several communication optimizations to help parallel applications to easily scale on a large number of processors. We also present automatic runtime techniques to relieve the programmer from the burden of optimizing communication in his applications. This thesis explores processor virtualization to improve communication performance in applications. With processor virtualization, the computation is mapped to virtual processors (VPs). After one VP has finished computation and is waiting for responses to its messages, another VP can compute, thus overlapping communication with computation. This overlap is only effective if the processor overhead of the communication operation is a small fraction of the total communication time. Fortunately, with network interfaces having co-processors, this happens to be true and processor virtualization has a natural advantage on such interconnects. The communication optimizations we present in this thesis, are motivated by applications such as NAMD (a classical molecular dynamics application) and CPAIMD (a quantum chemistry application). Applications like NAMD and CPAIMD consume a fair share of the time available on supercomputers. So, improving their performance would be of great value. We have successfully scaled NAMD to 1TF of peak performance on 3000 processors of PSC Lemieux, using the techniques presented in this thesis. We study both point-to-point communication and collective communication (specifically all-to-all communication). On a large number of processors all-to-all communication can take several milli-seconds to finish. With synchronous collectives defined in MPI, the processor idles while the collective messages are in flight. Therefore, we demonstrate an asynchronous collective communication framework, to let the CPU compute while the all-to-all messages are in flight. We also show that the best strategy for all-to-all communication depends on the message size, number of processors and other dynamic parameters. This suggests that these parameters can be observed at runtime and used to choose the optimal strategy for all-to-all communication. In this thesis, we demonstrate adaptive strategy switching for all-to-all communication. The communication optimization framework presented in this thesis, has been designed to optimize communication in the context of processor virtualization and dynamic migrating objects. We present the streaming strategy to optimize fine grained object-to-object communication. In this thesis, we motivate the need for hardware collectives, as processor based collectives can be delayed by intermediate that processors busy with computation. We explore a next generation interconnect that supports collectives in the switching hardware. We show the performance gains of hardware collectives through synthetic benchmarks

    Programming model abstractions for optimizing I/O intensive applications

    Get PDF
    This thesis contributes from the perspective of task-based programming models to the efforts of optimizing I/O intensive applications. Throughout this thesis, we propose programming model abstractions and mechanisms that target a twofold objective: from the one hand, improve the I/O and total performance of applications on nowadays complex storage infrastructures. From the other hand, achieve such performance improvement without increasing the complexity of applications programming. The following paragraphs briefly summarize each of our contributions. First, towards exploiting compute-I/O patterns of I/O intensive applications and transparently improving I/O and total performance, we propose a number of abstractions that we refer to as I/O Awareness abstractions. An I/O aware task-based programming model is able to separate the handling of I/O and computations by supporting I/O Tasks. The execution of such tasks can overlap with compute tasks execution. Moreover, we provide programming model support to improve I/O performance by addressing the issue of I/O congestion. This is achieved by using Storage Bandwidth Constraints to control the level of task parallelism. We support two types of such constraints: (i) Static storage bandwidth constraints that are manually set by application developers. (ii) Auto-tunable constraints that are automatically set and tuned throughout the execution of application. Second, in order to exploit the heterogeneity of modern storage systems to improve performance in a transparent manner, we propose a set of capabilities that we refer to as Storage heterogeneity Awareness. A storage-heterogeneity aware task-based programming model builds on the concepts and abstractions that are introduced in the first contribution to improve the I/O performance of applications on heterogeneous storage systems. More specifically, such programming models support the following features: (i) abstracting the heterogeneity of the storage devices and exposing them as one hierarchical storage resource. (ii) supporting dedicated I/O scheduling. (iii) Finally, we introduce a mechanism that automatically and periodically flushes obsolete data from higher storage layers to lower storage layers. Third, targeting increasing parallelism levels of applications, we propose a Hybrid Programming Model that combines task-based programming models and MPI. In this programming model, tasks are used to achieve coarse-grained parallelism on large-scale distributed infrastructures, whereas MPI is used to gain fine-grained parallelism by parallelizing tasks execution. Such a hybrid programming model offers the possibility to enable parallel I/O and high-level I/O libraries in tasks. We enable such a hybrid programming model by supporting Native MPI Tasks. These tasks are native to the programming model for two reasons: they execute task code as opposed to calling external MPI binaries or scripts. Also, the data transfers and input/output handling is done in a completely transparent manner to application developers. Therefore, increasing parallelism levels while easing the design and programming of applications. Finally, to exploit the inherent parallelism opportunities in applications and overlap computation with I/O, we propose an Eager mechanism for releasing data dependencies. Unlike the traditional approach for releasing dependencies, eagerly releasing data dependencies allows successor tasks to be released for execution as soon as their data dependencies are ready, without having to wait for predecessor task(s) to completely finish execution. In order to support the eager-release of data dependencies, we describe the following core modifications to the design of task-based programming models: (i) defining and managing data dependency relationships as parameter-aware dependencies (ii) a mechanism for notifying the programming model that an output data has been generated before the execution of the producer task ends.Aquesta tesi contribueix des de la perspectiva dels models de programaci贸 basats en tasques als esfor莽os d鈥檕ptimitzar les aplicacions intensives de I/O. Al llarg d'aquesta tesi, proposem abstraccions i mecanismes del model de programaci贸 que persegueixen un doble objectiu: per una banda, millorar la I/O i el rendiment total de les aplicacions a les complexes infraestructures d'emmagatzematge de l'actualitat. D'altra banda, aconsegueixi aquesta millora del rendiment sense augmentar la complexitat de la programaci贸 d'aplicacions. Els par脿grafs seg眉ents resumeixen cadascuna de les nostres contribucions. En primer lloc, proposem una s猫rie d'abstraccions a qu猫 ens referim com a abstraccions de consci猫ncia de I/O. Un model de programaci贸 basat en tasques amb reconeixement d'I/O pot separar el maneig d'I/O i els c脿lculs en admetre Tasques d'I/O. L'execuci贸 d'aquestes tasques es pot superposar amb l'execuci贸 de tasques de c脿lcul. A m茅s, proporcionem suport de model de programaci贸 per millorar el rendiment d'I/O en abordar el problema de la congesti贸 d'I/O. Aix貌 s'aconsegueix mitjan莽ant l'煤s de restriccions d'amplada de banda d'emmagatzematge per controlar el nivell de paral路lelisme de tasques. Admetem dos tipus d'aquestes restriccions: est脿tic i autoajustable. En segon lloc, proposem un conjunt de capacitats a qu猫 ens referim com a Consci猫ncia d'heterogene茂tat d'emmagatzematge. Un model de programaci贸 basat en tasques conscient de l'heterogene茂tat de l'emmagatzematge es basa en els conceptes i les abstraccions que s'introdueixen en la primera contribuci贸 per millorar el rendiment d'I/O de les aplicacions en sistemes d'emmagatzematge heterogenis. M茅s espec铆ficament, aquests models de programaci贸 admeten les caracter铆stiques seg眉ents: (i) abstreure l'heterogene茂tat dels dispositius d'emmagatzematge i exposar-los com a recurs d'emmagatzematge jer脿rquic. (ii) admetre la programaci贸 d'I/O dedicada. (iii) Finalment, presentem un mecanisme que descarrega autom脿ticament i peri貌dicament les dades obsoletes de les capes d'emmagatzematge superiors a les capes d'emmagatzematge inferiors. En tercer lloc, proposem un model de programaci贸 h铆brid que combina models de programaci贸 basats en tasques i MPI. En aquest model de programaci贸, les tasques s'utilitzen per aconseguir un paral路lelisme de gra gruixut en infraestructures distribu茂des a gran escala, mentre que MPI es fa servir per obtenir un paral路lelisme de gra fi en paral路lelitzar l'execuci贸 de tasques. Un model d'aquest tipus de programaci贸 h铆brid ofereix la possibilitat d'habilitar I/O paral路leles i biblioteques d'I/O d'alt nivell en tasques. Habilitem un model de programaci贸 h铆brid d'aquest tipus en admetre tasques MPI natives que executen codi de tasca en lloc de trucar a binaris o scripts MPI externs. A m茅s, la transfer猫ncia de dades i el maneig d鈥檈ntrada / sortida es realitza d鈥檜na manera completament transparent per als desenvolupadors d鈥檃plicacions. Per tant, augmenta els nivells de paral路lelisme alhora que se'n facilita el disseny i la programaci贸 d'aplicacions. Finalment proposem un mecanisme Eager per alliberar depend猫ncies de dades. A difer猫ncia de l'enfocament tradicional per alliberar depend猫ncies, alliberar amb entusiasme les depend猫ncies de dades permet que les tasques successores s'alliberin per a la seva execuci贸 tan aviat com les depend猫ncies de dades estiguin llestes, sense haver d'esperar que les tasques predecessores acabin completament l'execuci贸. Per tal de donar suport a l'alliberament ansi贸s de les depend猫ncies de dades, descrivim les seg眉ents modificacions centrals al disseny de models de programaci贸 basats en tasques: (i) definir i administrar les relacions de depend猫ncia de dades com a depend猫ncies conscients de par脿metres (ii ) un mecanisme per notificar la model de programaci贸 que s'ha generat una dada de sortida abans que finalitzi l'execuci贸 de la tasca de productor.Postprint (published version
    corecore