13 research outputs found

    Design and Evaluation of Low-Latency Communication Middleware on High Performance Computing Systems

    Get PDF
    [Resumen]El inter茅s en Java para computaci贸n paralela est谩 motivado por sus interesantes caracter铆sticas, tales como su soporte multithread, portabilidad, facilidad de aprendizaje,alta productividad y el aumento significativo en su rendimiento omputacional. No obstante, las aplicaciones paralelas en Java carecen generalmente de mecanismos de comunicaci贸n eficientes, los cuales utilizan a menudo protocolos basados en sockets incapaces de obtener el m谩ximo provecho de las redes de baja latencia, obstaculizando la adopci贸n de Java en computaci贸n de altas prestaciones (High Per- formance Computing, HPC). Esta Tesis Doctoral presenta el dise帽o, implementaci贸n y evaluaci贸n de soluciones de comunicaci贸n en Java que superan esta limitaci贸n. En consecuencia, se desarrollaron m煤ltiples dispositivos de comunicaci贸n a bajo nivel para paso de mensajes en Java (Message-Passing in Java, MPJ) que aprovechan al m谩ximo el hardware de red subyacente mediante operaciones de acceso directo a memoria remota que proporcionan comunicaciones de baja latencia. Tambi茅n se incluye una biblioteca de paso de mensajes en Java totalmente funcional, FastMPJ, en la cual se integraron los dispositivos de comunicaci贸n. La evaluaci贸n experimental ha mostrado que las primitivas de comunicaci贸n de FastMPJ son competitivas en comparaci贸n con bibliotecas nativas, aumentando significativamente la escalabilidad de aplicaciones MPJ. Por otro lado, esta Tesis analiza el potencial de la computaci贸n en la nube (cloud computing) para HPC, donde el modelo de distribuci贸n de infraestructura como servicio (Infrastructure as a Service, IaaS) emerge como una alternativa viable a los sistemas HPC tradicionales. La evaluaci贸n del rendimiento de recursos cloud espec铆ficos para HPC del proveedor l铆der, Amazon EC2, ha puesto de manifiesto el impacto significativo que la virtualizaci贸n impone en la red, impidiendo mover las aplicaciones intensivas en comunicaciones a la nube. La clave reside en un soporte de virtualizaci贸n apropiado, como el acceso directo al hardware de red, junto con las directrices para la optimizaci贸n del rendimiento sugeridas en esta Tesis.[Resumo]O interese en Java para computaci贸n paralela est谩 motivado polas s煤as interesantes caracter铆sticas, tales como o seu apoio multithread, portabilidade, facilidade de aprendizaxe, alta produtividade e o aumento signi cativo no seu rendemento computacional. No entanto, as aplicaci贸ns paralelas en Java carecen xeralmente de mecanismos de comunicaci贸n e cientes, os cales adoitan usar protocolos baseados en sockets que son incapaces de obter o m谩ximo proveito das redes de baixa latencia, obstaculizando a adopci贸n de Java na computaci贸n de altas prestaci贸ns (High Performance Computing, HPC). Esta Tese de Doutoramento presenta o dese帽o, implementaci 贸n e avaliaci贸n de soluci贸ns de comunicaci贸n en Java que superan esta limitaci贸n. En consecuencia, desenvolv茅ronse m煤ltiples dispositivos de comunicaci贸n a baixo nivel para paso de mensaxes en Java (Message-Passing in Java, MPJ) que aproveitan ao m谩aximo o hardware de rede subxacente mediante operaci贸ns de acceso directo a memoria remota que proporcionan comunicaci贸ns de baixa latencia. Tam茅n se incl煤e unha biblioteca de paso de mensaxes en Java totalmente funcional, FastMPJ, na cal foron integrados os dispositivos de comunicaci贸n. A avaliaci贸n experimental amosou que as primitivas de comunicaci贸n de FastMPJ son competitivas en comparaci贸n con bibliotecas nativas, aumentando signi cativamente a escalabilidade de aplicaci贸ns MPJ. Por outra banda, esta Tese analiza o potencial da computaci贸n na nube (cloud computing) para HPC, onde o modelo de distribuci贸n de infraestrutura como servizo (Infrastructure as a Service, IaaS) xorde como unha alternativa viable aos sistemas HPC tradicionais. A ampla avaliaci贸n do rendemento de recursos cloud espec铆fi cos para HPC do proveedor l铆der, Amazon EC2, puxo de manifesto o impacto signi ficativo que a virtualizaci贸n imp贸n na rede, impedindo mover as aplicaci贸ns intensivas en comunicaci贸ns 谩 nube. A clave at贸pase no soporte de virtualizaci贸n apropiado, como o acceso directo ao hardware de rede, xunto coas directrices para a optimizaci贸n do rendemento suxeridas nesta Tese.[Abstract]The use of Java for parallel computing is becoming more promising owing to its appealing features, particularly its multithreading support, portability, easy-tolearn properties, high programming productivity and the noticeable improvement in its computational performance. However, parallel Java applications generally su er from inefficient communication middleware, most of which use socket-based protocols that are unable to take full advantage of high-speed networks, hindering the adoption of Java in the High Performance Computing (HPC) area. This PhD Thesis presents the design, development and evaluation of scalable Java communication solutions that overcome these constraints. Hence, we have implemented several lowlevel message-passing devices that fully exploit the underlying network hardware while taking advantage of Remote Direct Memory Access (RDMA) operations to provide low-latency communications. Moreover, we have developed a productionquality Java message-passing middleware, FastMPJ, in which the devices have been integrated seamlessly, thus allowing the productive development of Message-Passing in Java (MPJ) applications. The performance evaluation has shown that FastMPJ communication primitives are competitive with native message-passing libraries, improving signi cantly the scalability of MPJ applications. Furthermore, this Thesis has analyzed the potential of cloud computing towards spreading the outreach of HPC, where Infrastructure as a Service (IaaS) o erings have emerged as a feasible alternative to traditional HPC systems. Several cloud resources from the leading IaaS provider, Amazon EC2, which speci cally target HPC workloads, have been thoroughly assessed. The experimental results have shown the signi cant impact that virtualized environments still have on network performance, which hampers porting communication-intensive codes to the cloud. The key is the availability of the proper virtualization support, such as the direct access to the network hardware, along with the guidelines for performance optimization suggested in this Thesis

    Performance analysis of HPC applications in the cloud

    Get PDF
    [Abstract] The scalability of High Performance Computing (HPC) applications depends heavily on the efficient support of network communications in virtualized environments. However, Infrastructure as a Service (IaaS) providers are more focused on deploying systems with higher computational power interconnected via high-speed networks rather than improving the scalability of the communication middleware. This paper analyzes the main performance bottlenecks in HPC application scalability on the Amazon EC2 Cluster Compute platform: (1) evaluating the communication performance on shared memory and a virtualized 10 Gigabit Ethernet network; (2) assessing the scalability of representative HPC codes, the NAS Parallel Benchmarks, using an important number of cores, up to 512; (3) analyzing the new cluster instances (CC2), both in terms of single instance performance, scalability and cost-efficiency of its use; (4) suggesting techniques for reducing the impact of the virtualization overhead in the scalability of communication-intensive HPC codes, such as the direct access of the Virtual Machine to the network and reducing the number of processes per instance; and (5) proposing the combination of message-passing with multithreading as the most scalable and cost-effective option for running HPC applications on the Amazon EC2 Cluster Compute platform.Ministerio de Ciencia e Innovaci贸n; TIN2010-16735Ministerio de Econom铆a y Competitividad; AP2010-4348

    The readying of applications for heterogeneous computing

    Get PDF
    High performance computing is approaching a potentially significant change in architectural design. With pressures on the cost and sheer amount of power, additional architectural features are emerging which require a re-think to the programming models deployed over the last two decades. Today's emerging high performance computing (HPC) systems are maximising performance per unit of power consumed resulting in the constituent parts of the system to be made up of a range of different specialised building blocks, each with their own purpose. This heterogeneity is not just limited to the hardware components but also in the mechanisms that exploit the hardware components. These multiple levels of parallelism, instruction sets and memory hierarchies, result in truly heterogeneous computing in all aspects of the global system. These emerging architectural solutions will require the software to exploit tremendous amounts of on-node parallelism and indeed programming models to address this are emerging. In theory, the application developer can design new software using these models to exploit emerging low power architectures. However, in practice, real industrial scale applications last the lifetimes of many architectural generations and therefore require a migration path to these next generation supercomputing platforms. Identifying that migration path is non-trivial: With applications spanning many decades, consisting of many millions of lines of code and multiple scientific algorithms, any changes to the programming model will be extensive and invasive and may turn out to be the incorrect model for the application in question. This makes exploration of these emerging architectures and programming models using the applications themselves problematic. Additionally, the source code of many industrial applications is not available either due to commercial or security sensitivity constraints. This thesis highlights this problem by assessing current and emerging hard- ware with an industrial strength code, and demonstrating those issues described. In turn it looks at the methodology of using proxy applications in place of real industry applications, to assess their suitability on the next generation of low power HPC offerings. It shows there are significant benefits to be realised in using proxy applications, in that fundamental issues inhibiting exploration of a particular architecture are easier to identify and hence address. Evaluations of the maturity and performance portability are explored for a number of alternative programming methodologies, on a number of architectures and highlighting the broader adoption of these proxy applications, both within the authors own organisation, and across the industry as a whole

    Acceleration of the hardware-software interface of a communication device for parallel systems

    Full text link
    During the last decades the ever growing need for computational power fostered the development of parallel computer architectures. Applications need to be parallelized and optimized to be able to exploit modern system architectures. Today, scalability of applications is more and more limited both by development resources, as programming of complex parallel applications becomes increasingly demanding, and by the fundamental scalability issues introduced by the cost of communication in distributed memory systems. Lowering the latency of communication is mandatory to increase scalability and serves as an enabling technology for programming of distributed memory systems at a higher abstraction layer using higher degrees of compiler driven automation. At the same time it can increase performance of such systems in general. In this work, the software/hardware interface and the network interface controller functions of the EXTOLL network architecture, which is specifically designed to satisfy the needs of low-latency networking for high-performance computing, is presented. Several new architectural contributions are made in this thesis, namely a new efficient method for virtual-tophysical address-translation named ATU and a novel method to issue operations to a virtual device in an optimal way which has been termed Transactional I/O. This new method needs changes in the architecture of the host CPU the device is connected to. Two additional methods that emulate most of the characteristics of Transactional I/O are developed and employed in the development of the EXTOLL hardware to facilitate usage together with contemporary CPUs. These new methods heavily leverage properties of the HyperTransport interface used to connect the device to the CPU. Finally, this thesis also introduces an optimized remote-memory-access architecture for efficient split-phase transactions and atomic operations. The complete architecture has been prototyped using FPGA technology enabling a more precise analysis and verification than is possible using simulation alone. The resulting design utilizes 95 % of a 90 nm FPGA device and reaches speeds of 200 MHz and 156 MHz in the different clock domains of the design. The EXTOLL software stack is developed and a performance evaluation of the software using the EXTOLL hardware is performed. The performance evaluation shows an excellent start-up latency value of 1.3 μs, which competes with the most advanced networks available, in spite of the technological performance handicap encountered by FPGA technology. The resulting network is, to the best of the knowledge of the author, the fastest FPGA-based interconnection network for commodity processors ever built

    Tuning Parallel Applications in Parallel

    Get PDF
    Auto-tuning has recently received significant attention from the High Performance Computing community. Most auto-tuning approaches are specialized to work either on specific domains such as dense linear algebra and stencil computations, or only at certain stages of program execution such as compile time and runtime. Real scientific applications, however, demand a cohesive environment that can efficiently provide auto-tuning solutions at all stages of application development and deployment. Towards that end, we describe a unified end-to-end approach to auto-tuning scientific applications. Our system, Active Harmony, takes a search-based collaborative approach to auto-tuning. Application programmers, library writers and compilers collaborate to describe and export a set of performance related tunable parameters to the Active Harmony system. These parameters define a tuning search-space. The auto-tuner monitors the program performance and suggests adaptation decisions. The decisions are made by a central controller using a parallel search algorithm. The algorithm leverages parallel architectures to search across a set of optimization parameter values. Different nodes of a parallel system evaluate different configurations at each timestep. Active Harmony supports runtime adaptive code-generation and tuning for parameters that require new code (e.g. unroll factors). Effectively, we merge traditional feedback directed optimization and just-in-time compilation. This feature also enables application developers to write applications once and have the auto-tuner adjust the application behavior automatically when run on new systems. We evaluated our system on multiple large-scale parallel applications and showed that our system can improve the execution time by up to 46% compared to the original version of the program. Finally, we believe that the success of any auto-tuning research depends on how effectively application developers, domain-experts and auto-tuners communicate and work together. To that end, we have developed and released a simple and extensible language that standardizes the parameter space representation. Using this language, developers and researchers can collaborate to export tunable parameters to the tuning frameworks. Relationships (e.g. ordering, dependencies, constraints, ranking) between tunable parameters and search-hints can also be expressed

    Scalable system software for high performance large-scale applications

    Get PDF
    In the last decades, high-performance large-scale systems have been a fundamental tool for scientific discovery and engineering advances. The sustained growth of supercomputing performance and the concurrent reduction in cost have made this technology available for a large number of scientists and engineers working on many different problems. The design of next-generation supercomputers will include traditional HPC requirements as well as the new requirements to handle data-intensive computations. Data intensive applications will hence play an important role in a variety of fields, and are the current focus of several research trends in HPC. Due to the challenges of scalability and power efficiency, next-generation of supercomputers needs a redesign of the whole software stack. Being at the bottom of the software stack, system software is expected to change drastically to support the upcoming hardware and to meet new application requirements. This PhD thesis addresses the scalability of system software. The thesis start at the Operating System level: first studying general-purpose OS (ex. Linux) and then studying lightweight kernels (ex. CNK). Then, we focus on the runtime system: we implement a runtime system for distributed memory systems that includes many of the system services required by next-generation applications. Finally we focus on hardware features that can be exploited at user-level to improve applications performance, and potentially included into our advanced runtime system. The thesis contributions are the following: Operating System Scalability: We provide an accurate study of the scalability problems of modern Operating Systems for HPC. We design and implement a methodology whereby detailed quantitative information may be obtained for each OS noise event. We validate our approach by comparing it to other well-known standard techniques to analyze OS noise, such FTQ (Fixed Time Quantum). Evaluation of the address translation management for a lightweight kernel: we provide a performance evaluation of different TLB management approaches 驴 dynamic memory mapping, static memory mapping with replaceable TLB entries, and static memory mapping with fixed TLB entries (no TLB misses) on a IBM BlueGene/P system. Runtime System Scalability: We show that a runtime system can efficiently incorporate system services and improve scalability for a specific class of applications. We design and implement a full-featured runtime system and programming model to execute irregular appli- cations on a commodity cluster. The runtime library is called Global Memory and Threading library (GMT) and integrates a locality-aware Partitioned Global Address Space communication model with a fork/join program structure. It supports massive lightweight multi-threading, overlapping of communication and computation and small messages aggregation to tolerate network latencies. We compare GMT to other PGAS models, hand-optimized MPI code and custom architectures (Cray XMT) on a set of large scale irregular applications: breadth first search, random walk and concurrent hash map access. Our runtime system shows performance orders of magnitude higher than other solutions on commodity clusters and competitive with custom architectures. User-level Scalability Exploiting Hardware Features: We show the high complexity of low-level hardware optimizations for single applications, as a motivation to incorporate this logic into an adaptive runtime system. We evaluate the effects of controllable hardware-thread priority mechanism that controls the rate at which each hardware-thread decodes instruction on IBM POWER5 and POWER6 processors. Finally, we show how to effectively exploits cache locality and network-on-chip on the Tilera many-core architecture to improve intra-core scalability


    Get PDF
    Energy efficiency in supercomputing is critical to limit operating costs and carbon footprints. While the energy efficiency of future supercomputing centers needs to improve at all levels, the energy consumed by the processing units is a large fraction of the total energy consumed by High Performance Computing (HPC) systems. HPC applications use a parallel programming paradigm like the Message Passing Interface (MPI) to coordinate computation and communication among thousands of processors. With dynamically-changing factors both in hardware and software affecting energy usage of processors, there exists a need for power monitoring and regulation at runtime to achieve savings in energy. This dissertation highlights an adaptive runtime framework that enables processors with core-specific power control by dynamically adapting to workload characteristics to reduce power with little or no performance impact. Two opportunities to improve the energy efficiency of processors running MPI applications are identified - computational workload imbalance and waiting on memory. Monitoring of performance and power regulation is performed by the framework transparently within the MPI runtime system, eliminating the need for code changes to MPI applications. The effect of enforcing power limits (capping) on processors is also investigated. Experiments on 32 nodes (1024 cores) show that in presence of workload imbalance, the runtime reduces Central Processing Unit (CPU) frequency on cores not on the critical path, thereby reducing power and hence energy usage without deteriorating performance. Using this runtime, six MPI mini-applications and a full MPI application show an overall 20% decrease in energy use with less than 1% increase in execution time. In addition, the lowering of frequency on non-critical cores reduces run-to-run performance variation and improves performance. For the full application, an average speedup of 11% is seen, while the power is lowered by about 31% for an energy savings of up to 42%. Another experiment on 16 nodes (256 cores) that are power capped also shows performance improvement along with power reduction. Thus, energy optimization can also be a performance optimization. For applications that are limited by memory access times, memory metrics identified facilitate lowering of power by up to 32% without adversely impacting performance.Doctor of Philosoph

    Performance modelling and optimisation of inertial confinement fusion simulation codes

    Get PDF
    Legacy code performance has failed to keep up with that of modern hardware. Many new hardware features remain under-utilised, with the majority of code bases still unable to make use of accelerated or heterogeneous architectures. Code maintainers now accept that they can no longer rely solely on hardware improvements to drive code performance, and that changes at the software engineering level need to be made. The principal focus of the work presented in this thesis is an analysis of the changes legacy Inertial Confinement Fusion (ICF) codes need to make in order to efficiently use current and future parallel architectures. We discuss the process of developing a performance model, and demonstrate the ability of such a model to make accurate predictions about code performance for code variants on a range of architectures. We build on the knowledge gained from such a process, and examine how Particle-in-Cell (PIC) codes must change in order to move towards the required levels of portable and future-proof performance needed to leverage the capabilities of modern hardware. As part of this investigation, we present an OpenCL port of the legacy code EPOCH, as well as a fully featured mini-app representing EPOCH. Finally, as a direct consequence of these investigations, we directly apply these performance optimisations to the production version EPOCH, culminating in a speedup of over 2x for the core algorith