21 research outputs found

    Design of scalable Java message-passing communications over InfiniBand

    Get PDF
    This is a post-peer-review, pre-copyedit version of an article published in The Journal of Supercomputing. The final authenticated version is available online at: https://doi.org/10.1007/s11227-011-0654-9[Abstract] This paper presents ibvdev a scalable and efficient low-level Java message-passing communication device over InfiniBand. The continuous increase in the number of cores per processor underscores the need for efficient communication support for parallel solutions. Moreover, current system deployments are aggregating a significant number of cores through advanced network technologies, such as InfiniBand, increasing the complexity of communication protocols, especially when dealing with hybrid shared/distributed memory architectures such as clusters. Here, Java represents an attractive choice for the development of communication middleware for these systems, as it provides built-in networking and multithreading support. As the gap between Java and compiled languages performance has been narrowing for the last years, Java is an emerging option for High Performance Computing (HPC). The developed communication middleware ibvdev increases Java applications performance on clusters of multicore processors interconnected via InfiniBand through: (1) providing Java with direct access to InfiniBand using InfiniBand Verbs API, somewhat restricted so far to MPI libraries; (2) implementing an efficient and scalable communication protocol which obtains start-up latencies and bandwidths similar to MPI performance results; and (3) allowing its integration in any Java parallel and distributed application. In fact, it has been successfully integrated in the Java messaging library MPJ Express. The experimental evaluation of this middleware on an InfiniBand cluster of multicore processors has shown significant point-to-point performance benefits, up to 85% start-up latency reduction and twice the bandwidth compared to previous Java middleware on InfiniBand. Additionally, the impact of ibvdev on message-passing collective operations is significant, achieving up to one order of magnitude performance increases compared to previous Java solutions, especially when combined with multithreading. Finally, the efficiency of this middleware, which is even competitive with MPI in terms of performance, increments the scalability of communications intensive Java HPC applications.Ministerio de Ciencia e Innovación; TIN2010-1673

    Design and Evaluation of Low-Latency Communication Middleware on High Performance Computing Systems

    Get PDF
    [Resumen]El interés en Java para computación paralela está motivado por sus interesantes características, tales como su soporte multithread, portabilidad, facilidad de aprendizaje,alta productividad y el aumento significativo en su rendimiento omputacional. No obstante, las aplicaciones paralelas en Java carecen generalmente de mecanismos de comunicación eficientes, los cuales utilizan a menudo protocolos basados en sockets incapaces de obtener el máximo provecho de las redes de baja latencia, obstaculizando la adopción de Java en computación de altas prestaciones (High Per- formance Computing, HPC). Esta Tesis Doctoral presenta el diseño, implementación y evaluación de soluciones de comunicación en Java que superan esta limitación. En consecuencia, se desarrollaron múltiples dispositivos de comunicación a bajo nivel para paso de mensajes en Java (Message-Passing in Java, MPJ) que aprovechan al máximo el hardware de red subyacente mediante operaciones de acceso directo a memoria remota que proporcionan comunicaciones de baja latencia. También se incluye una biblioteca de paso de mensajes en Java totalmente funcional, FastMPJ, en la cual se integraron los dispositivos de comunicación. La evaluación experimental ha mostrado que las primitivas de comunicación de FastMPJ son competitivas en comparación con bibliotecas nativas, aumentando significativamente la escalabilidad de aplicaciones MPJ. Por otro lado, esta Tesis analiza el potencial de la computación en la nube (cloud computing) para HPC, donde el modelo de distribución de infraestructura como servicio (Infrastructure as a Service, IaaS) emerge como una alternativa viable a los sistemas HPC tradicionales. La evaluación del rendimiento de recursos cloud específicos para HPC del proveedor líder, Amazon EC2, ha puesto de manifiesto el impacto significativo que la virtualización impone en la red, impidiendo mover las aplicaciones intensivas en comunicaciones a la nube. La clave reside en un soporte de virtualización apropiado, como el acceso directo al hardware de red, junto con las directrices para la optimización del rendimiento sugeridas en esta Tesis.[Resumo]O interese en Java para computación paralela está motivado polas súas interesantes características, tales como o seu apoio multithread, portabilidade, facilidade de aprendizaxe, alta produtividade e o aumento signi cativo no seu rendemento computacional. No entanto, as aplicacións paralelas en Java carecen xeralmente de mecanismos de comunicación e cientes, os cales adoitan usar protocolos baseados en sockets que son incapaces de obter o máximo proveito das redes de baixa latencia, obstaculizando a adopción de Java na computación de altas prestacións (High Performance Computing, HPC). Esta Tese de Doutoramento presenta o deseño, implementaci ón e avaliación de solucións de comunicación en Java que superan esta limitación. En consecuencia, desenvolvéronse múltiples dispositivos de comunicación a baixo nivel para paso de mensaxes en Java (Message-Passing in Java, MPJ) que aproveitan ao máaximo o hardware de rede subxacente mediante operacións de acceso directo a memoria remota que proporcionan comunicacións de baixa latencia. Tamén se inclúe unha biblioteca de paso de mensaxes en Java totalmente funcional, FastMPJ, na cal foron integrados os dispositivos de comunicación. A avaliación experimental amosou que as primitivas de comunicación de FastMPJ son competitivas en comparación con bibliotecas nativas, aumentando signi cativamente a escalabilidade de aplicacións MPJ. Por outra banda, esta Tese analiza o potencial da computación na nube (cloud computing) para HPC, onde o modelo de distribución de infraestrutura como servizo (Infrastructure as a Service, IaaS) xorde como unha alternativa viable aos sistemas HPC tradicionais. A ampla avaliación do rendemento de recursos cloud específi cos para HPC do proveedor líder, Amazon EC2, puxo de manifesto o impacto signi ficativo que a virtualización impón na rede, impedindo mover as aplicacións intensivas en comunicacións á nube. A clave atópase no soporte de virtualización apropiado, como o acceso directo ao hardware de rede, xunto coas directrices para a optimización do rendemento suxeridas nesta Tese.[Abstract]The use of Java for parallel computing is becoming more promising owing to its appealing features, particularly its multithreading support, portability, easy-tolearn properties, high programming productivity and the noticeable improvement in its computational performance. However, parallel Java applications generally su er from inefficient communication middleware, most of which use socket-based protocols that are unable to take full advantage of high-speed networks, hindering the adoption of Java in the High Performance Computing (HPC) area. This PhD Thesis presents the design, development and evaluation of scalable Java communication solutions that overcome these constraints. Hence, we have implemented several lowlevel message-passing devices that fully exploit the underlying network hardware while taking advantage of Remote Direct Memory Access (RDMA) operations to provide low-latency communications. Moreover, we have developed a productionquality Java message-passing middleware, FastMPJ, in which the devices have been integrated seamlessly, thus allowing the productive development of Message-Passing in Java (MPJ) applications. The performance evaluation has shown that FastMPJ communication primitives are competitive with native message-passing libraries, improving signi cantly the scalability of MPJ applications. Furthermore, this Thesis has analyzed the potential of cloud computing towards spreading the outreach of HPC, where Infrastructure as a Service (IaaS) o erings have emerged as a feasible alternative to traditional HPC systems. Several cloud resources from the leading IaaS provider, Amazon EC2, which speci cally target HPC workloads, have been thoroughly assessed. The experimental results have shown the signi cant impact that virtualized environments still have on network performance, which hampers porting communication-intensive codes to the cloud. The key is the availability of the proper virtualization support, such as the direct access to the network hardware, along with the guidelines for performance optimization suggested in this Thesis

    X10 for high-performance scientific computing

    No full text
    High performance computing is a key technology that enables large-scale physical simulation in modern science. While great advances have been made in methods and algorithms for scientific computing, the most commonly used programming models encourage a fragmented view of computation that maps poorly to the underlying computer architecture. Scientific applications typically manifest physical locality, which means that interactions between entities or events that are nearby in space or time are stronger than more distant interactions. Linear-scaling methods exploit physical locality by approximating distant interactions, to reduce computational complexity so that cost is proportional to system size. In these methods, the computation required for each portion of the system is different depending on that portion’s contribution to the overall result. To support productive development, application programmers need programming models that cleanly map aspects of the physical system being simulated to the underlying computer architecture while also supporting the irregular workloads that arise from the fragmentation of a physical system. X10 is a new programming language for high-performance computing that uses the asynchronous partitioned global address space (APGAS) model, which combines explicit representation of locality with asynchronous task parallelism. This thesis argues that the X10 language is well suited to expressing the algorithmic properties of locality and irregular parallelism that are common to many methods for physical simulation. The work reported in this thesis was part of a co-design effort involving researchers at IBM and ANU in which two significant computational chemistry codes were developed in X10, with an aim to improve the expressiveness and performance of the language. The first is a Hartree–Fock electronic structure code, implemented using the novel Resolution of the Coulomb Operator approach. The second evaluates electrostatic interactions between point charges, using either the smooth particle mesh Ewald method or the fast multipole method, with the latter used to simulate ion interactions in a Fourier Transform Ion Cyclotron Resonance mass spectrometer. We compare the performance of both X10 applications to state-of-the-art software packages written in other languages. This thesis presents improvements to the X10 language and runtime libraries for managing and visualizing the data locality of parallel tasks, communication using active messages, and efficient implementation of distributed arrays. We evaluate these improvements in the context of computational chemistry application examples. This work demonstrates that X10 can achieve performance comparable to established programming languages when running on a single core. More importantly, X10 programs can achieve high parallel efficiency on a multithreaded architecture, given a divide-and-conquer pattern parallel tasks and appropriate use of worker-local data. For distributed memory architectures, X10 supports the use of active messages to construct local, asynchronous communication patterns which outperform global, synchronous patterns. Although point-to-point active messages may be implemented efficiently, productive application development also requires collective communications; more work is required to integrate both forms of communication in the X10 language. The exploitation of locality is the key insight in both linear-scaling methods and the APGAS programming model; their combination represents an attractive opportunity for future co-design efforts

    Adaptive heterogeneous parallelism for semi-empirical lattice dynamics in computational materials science.

    Get PDF
    With the variability in performance of the multitude of parallel environments available today, the conceptual overhead created by the need to anticipate runtime information to make design-time decisions has become overwhelming. Performance-critical applications and libraries carry implicit assumptions based on incidental metrics that are not portable to emerging computational platforms or even alternative contemporary architectures. Furthermore, the significance of runtime concerns such as makespan, energy efficiency and fault tolerance depends on the situational context. This thesis presents a case study in the application of both Mattsons prescriptive pattern-oriented approach and the more principled structured parallelism formalism to the computational simulation of inelastic neutron scattering spectra on hybrid CPU/GPU platforms. The original ad hoc implementation as well as new patternbased and structured implementations are evaluated for relative performance and scalability. Two new structural abstractions are introduced to facilitate adaptation by lazy optimisation and runtime feedback. A deferred-choice abstraction represents a unified space of alternative structural program variants, allowing static adaptation through model-specific exhaustive calibration with regards to the extrafunctional concerns of runtime, average instantaneous power and total energy usage. Instrumented queues serve as mechanism for structural composition and provide a representation of extrafunctional state that allows realisation of a market-based decentralised coordination heuristic for competitive resource allocation and the Lyapunov drift algorithm for cooperative scheduling

    Visualization challenges in distributed heterogeneous computing environments

    Get PDF
    Large-scale computing environments are important for many aspects of modern life. They drive scientific research in biology and physics, facilitate industrial rapid prototyping, and provide information relevant to everyday life such as weather forecasts. Their computational power grows steadily to provide faster response times and to satisfy the demand for higher complexity in simulation models as well as more details and higher resolutions in visualizations. For some years now, the prevailing trend for these large systems is the utilization of additional processors, like graphics processing units. These heterogeneous systems, that employ more than one kind of processor, are becoming increasingly widespread since they provide many benefits, like higher performance or increased energy efficiency. At the same time, they are more challenging and complex to use because the various processing units differ in their architecture and programming model. This heterogeneity is often addressed by abstraction but existing approaches often entail restrictions or are not universally applicable. As these systems also grow in size and complexity, they become more prone to errors and failures. Therefore, developers and users become more interested in resilience besides traditional aspects, like performance and usability. While fault tolerance is well researched in general, it is mostly dismissed in distributed visualization or not adapted to its special requirements. Finally, analysis and tuning of these systems and their software is required to assess their status and to improve their performance. The available tools and methods to capture and evaluate the necessary information are often isolated from the context or not designed for interactive use cases. These problems are amplified in heterogeneous computing environments, since more data is available and required for the analysis. Additionally, real-time feedback is required in distributed visualization to correlate user interactions to performance characteristics and to decide on the validity and correctness of the data and its visualization. This thesis presents contributions to all of these aspects. Two approaches to abstraction are explored for general purpose computing on graphics processing units and visualization in heterogeneous computing environments. The first approach hides details of different processing units and allows using them in a unified manner. The second approach employs per-pixel linked lists as a generic framework for compositing and simplifying order-independent transparency for distributed visualization. Traditional methods for fault tolerance in high performance computing systems are discussed in the context of distributed visualization. On this basis, strategies for fault-tolerant distributed visualization are derived and organized in a taxonomy. Example implementations of these strategies, their trade-offs, and resulting implications are discussed. For analysis, local graph exploration and tuning of volume visualization are evaluated. Challenges in dense graphs like visual clutter, ambiguity, and inclusion of additional attributes are tackled in node-link diagrams using a lens metaphor as well as supplementary views. An exploratory approach for performance analysis and tuning of parallel volume visualization on a large, high-resolution display is evaluated. This thesis takes a broader look at the issues of distributed visualization on large displays and heterogeneous computing environments for the first time. While the presented approaches all solve individual challenges and are successfully employed in this context, their joint utility form a solid basis for future research in this young field. In its entirety, this thesis presents building blocks for robust distributed visualization on current and future heterogeneous visualization environments.Große Rechenumgebungen sind für viele Aspekte des modernen Lebens wichtig. Sie treiben wissenschaftliche Forschung in Biologie und Physik, ermöglichen die rasche Entwicklung von Prototypen in der Industrie und stellen wichtige Informationen für das tägliche Leben, beispielsweise Wettervorhersagen, bereit. Ihre Rechenleistung steigt stetig, um Resultate schneller zu berechnen und dem Wunsch nach komplexeren Simulationsmodellen sowie höheren Auflösungen in der Visualisierung nachzukommen. Seit einigen Jahren ist die Nutzung von zusätzlichen Prozessoren, z.B. Grafikprozessoren, der vorherrschende Trend für diese Systeme. Diese heterogenen Systeme, welche mehr als eine Art von Prozessor verwenden, finden zunehmend mehr Verbreitung, da sie viele Vorzüge, wie höhere Leistung oder erhöhte Energieeffizienz, bieten. Gleichzeitig sind diese jedoch aufwendiger und komplexer in der Nutzung, da die verschiedenen Prozessoren sich in Architektur und Programmiermodel unterscheiden. Diese Heterogenität wird oft durch Abstraktion angegangen, aber bisherige Ansätze sind häufig nicht universal anwendbar oder bringen Einschränkungen mit sich. Diese Systeme werden zusätzlich anfälliger für Fehler und Ausfälle, da ihre Größe und Komplexität zunimmt. Entwickler sind daher neben traditionellen Aspekten, wie Leistung und Bedienbarkeit, zunehmend an Widerstandfähigkeit gegenüber Fehlern und Ausfällen interessiert. Obwohl Fehlertoleranz im Allgemeinen gut untersucht ist, wird diese in der verteilten Visualisierung oft ignoriert oder nicht auf die speziellen Umstände dieses Feldes angepasst. Analyse und Optimierung dieser Systeme und ihrer Software ist notwendig, um deren Zustand einzuschätzen und ihre Leistung zu verbessern. Die verfügbaren Werkzeuge und Methoden, um die erforderlichen Informationen zu sammeln und auszuwerten, sind oft vom Kontext entkoppelt oder nicht für interaktive Szenarien ausgelegt. Diese Probleme sind in heterogenen Rechenumgebungen verstärkt, da dort mehr Daten für die Analyse verfügbar und notwendig sind. Für verteilte Visualisierung ist zusätzlich Rückmeldung in Echtzeit notwendig, um Interaktionen der Benutzer mit Leistungscharakteristika zu korrelieren und um die Gültigkeit und Korrektheit der Daten und ihrer Visualisierung zu entscheiden. Diese Dissertation präsentiert Beiträge für all diese Aspekte. Zunächst werden zwei Ansätze zur Abstraktion im Kontext von generischen Berechnungen auf Grafikprozessoren und Visualisierung in heterogenen Umgebungen untersucht. Der erste Ansatz verbirgt Details verschiedener Prozessoren und ermöglicht deren Nutzung über einheitliche Schnittstellen. Der zweite Ansatz verwendet pro-Pixel verkettete Listen (per-pixel linked lists) zur Kombination von Pixelfarben und zur Vereinfachung von ordnungsunabhängiger Transparenz in verteilter Visualisierung. Übliche Fehlertoleranz-Methoden im Hochleistungsrechnen werden im Kontext der verteilten Visualisierung diskutiert. Auf dieser Grundlage werden Strategien für fehlertolerante verteilte Visualisierung abgeleitet und in einer Taxonomie organisiert. Beispielhafte Umsetzungen dieser Strategien, ihre Kompromisse und Zugeständnisse, und die daraus resultierenden Implikationen werden diskutiert. Zur Analyse werden lokale Exploration von Graphen und die Optimierung von Volumenvisualisierung untersucht. Herausforderungen in dichten Graphen wie visuelle Überladung, Ambiguität und Einbindung zusätzlicher Attribute werden in Knoten-Kanten Diagrammen mit einer Linsenmetapher sowie ergänzenden Ansichten der Daten angegangen. Ein explorativer Ansatz zur Leistungsanalyse und Optimierung paralleler Volumenvisualisierung auf einer großen, hochaufgelösten Anzeige wird untersucht. Diese Dissertation betrachtet zum ersten Mal Fragen der verteilten Visualisierung auf großen Anzeigen und heterogenen Rechenumgebungen in einem größeren Kontext. Während jeder vorgestellte Ansatz individuelle Herausforderungen löst und erfolgreich in diesem Zusammenhang eingesetzt wurde, bilden alle gemeinsam eine solide Basis für künftige Forschung in diesem jungen Feld. In ihrer Gesamtheit präsentiert diese Dissertation Bausteine für robuste verteilte Visualisierung auf aktuellen und künftigen heterogenen Visualisierungsumgebungen

    Parallel and Distributed Computing

    Get PDF
    The 14 chapters presented in this book cover a wide variety of representative works ranging from hardware design to application development. Particularly, the topics that are addressed are programmable and reconfigurable devices and systems, dependability of GPUs (General Purpose Units), network topologies, cache coherence protocols, resource allocation, scheduling algorithms, peertopeer networks, largescale network simulation, and parallel routines and algorithms. In this way, the articles included in this book constitute an excellent reference for engineers and researchers who have particular interests in each of these topics in parallel and distributed computing

    Performance Observability and Monitoring of High Performance Computing with Microservices

    Get PDF
    Traditionally, High Performance Computing (HPC) softwarehas been built and deployed as bulk-synchronous, parallel executables based on the message-passing interface (MPI) programming model. The rise of data-oriented computing paradigms and an explosion in the variety of applications that need to be supported on HPC platforms have forced a re-think of the appropriate programming and execution models to integrate this new functionality. In situ workflows demarcate a paradigm shift in HPC software development methodologies enabling a range of new applications --- from user-level data services to machine learning (ML) workflows that run alongside traditional scientific simulations. By tracing the evolution of HPC software developmentover the past 30 years, this dissertation identifies the key elements and trends responsible for the emergence of coupled, distributed, in situ workflows. This dissertation's focus is on coupled in situ workflows involving composable, high-performance microservices. After outlining the motivation to enable performance observability of these services and why existing HPC performance tools and techniques can not be applied in this context, this dissertation proposes a solution wherein a set of techniques gathers, analyzes, and orients performance data from different sources to generate observability. By leveraging microservice components initially designed to build high performance data services, this dissertation demonstrates their broader applicability for building and deploying performance monitoring and visualization as services within an in situ workflow. The results from this dissertation suggest that: (1) integration of performance data from different sources is vital to understanding the performance of service components, (2) the in situ (online) analysis of this performance data is needed to enable the adaptivity of distributed components and manage monitoring data volume, (3) statistical modeling combined with performance observations can help generate better service configurations, and (4) services are a promising architecture choice for deploying in situ performance monitoring and visualization functionality. This dissertation includes previously published and co-authored material and unpublished co-authored material

    Rocmeu: orientação ao recurso na modelação de aplicações paralelas e exploração cooperativa de clusters multi-SAN

    Get PDF
    O desenvolvimento de soluções paralelas para problemas com requisitos computacionais elevados tem estado limitado à exploração de sistemas de computação específicos e à utilização de abstracções altamente conotadas com a arquitectura desses sistemas. Estes condicionalismos têm um impacto altamente desencorajador na utilização de clusters heterogéneos -- que integram múltiplas tecnologias de interligação -- quando se pretende dar respostas capazes, tanto ao nível da produtividade, como do desempenho. Esta dissertação apresenta a orientação ao recurso como uma nova abordagem à programação paralela, unificando no conceito de recurso as entidades lógicas dispersas pelos nós de um cluster, criadas pelas aplicações em execução, e os recursos físicos que constituem o potencial de computação e comunicação da arquitectura alvo. O paradigma introduz novas abstracções para (i) a comunicação entre recursos lógicos e (ii) a manipulação de recursos físicos a partir das aplicações. As primeiras garantem um interface mais conveniente ao programador, sem comprometerem o desempenho intrínseco das modernas tecnologias de comunicação SAN. As segundas permitem que o programador estabeleça, explicitamente, uma correspondência efectiva entre as entidades lógicas e os recursos físicos, por forma a explorar os diferentes padrões de localidade existentes na hierarquia de recursos que resulta da utilização de múltiplas tecnologias SAN e múltiplos nós SMP. O paradigma proposto traduz-se numa metodologia de programação concretizada na plataforma Meu, que visa a integração do desenho/desenvolvimento de aplicações paralelas e do processo de selecção/alocação de recursos físicos em tempo de execução, em ambientes multi-aplicação e multi-utilizador. Na base desta plataforma está o RoCl, uma outra plataforma, desenvolvido com o intuito de oferecer uma imagem de sistema uno. Na arquitectura resultante, o primeiro nível, suportado pelo RoCl, garante a conectividade entre recursos lógicos dispersos pelos diferentes nós do cluster, enquanto o segundo, da responsabilidade do Meu, permite a organização e manipulação desses recursos lógicos, a partir de uma especificação inicial, administrativa, dos recursos físicos disponíveis. Do ponto de vista da programação paralela/distribuída, o Meu integra adaptações e extensões dos paradigmas da programação por memória partilhada, passagem de mensagens e memória global. Numa outra vertente, estão disponíveis capacidades básicas para a manipulação de recursos físicos em conjunto com facilidades para a criação e localização de entidades que suportam a interoperabilidade e a cooperação entre aplicações.The development of parallel solutions for high demanding computational problems has been limited to the exploitation of specific computer systems and to the use of abstractions closely related to the architecture of these systems. These limitations are a strong obstacle to the use of heterogeneous clusters -- clusters that integrate multiple interconnection technologies -- when we intend to give capable answers to both productivity and performance. This work presents the resource orientation as a new approach to parallel programming, unifying in the resource concept the logical entities spread through cluster nodes by applications and the physical resources that represent computation and communication power. The paradigm introduces new abstractions for (i) the communication among logical resources and (ii) the manipulation of physical resources from applications. The first ones guarantee a more convenient interface to the programmer, without compromising the intrinsic performance of modern SAN communication technologies. The second ones allow the programmer to explicitly establish the effective mapping between logical entities and physical resources, in order to exploit the different levels of locality that we can find in the hierarchy of resources that results from using distinct SAN technologies and multiple SMP nodes. The proposed paradigm corresponds to a programming methodology materialized in the Meu platform, which aims to integrate the design/development of parallel applications and the process of selecting/allocating physical resources at execution time in multi-application, multi-user environments. The basis for this platform is RoCl, another platform, developed to offer a single system image. The first layer of the resultant architecture, which corresponds to RoCl, guarantees the connectivity among logical resources instantiated at different cluster nodes, while the second, corresponding to Meu, allows to organize and manipulate these logical resources, starting from an initial administrative specification of the available physical resources. In the context of parallel/distributed programming, Meu integrates adaptations and extensions to the shared memory, message passing and global memory programming paradigms. Basic capabilities for the manipulation of physical resources along with facilities for the creation and discovery of entities that support the interoperability and cooperation between applications are also available
    corecore