13 research outputs found

    Kernel-assisted and Topology-aware MPI Collective Communication among Multicore or Many-core Clusters

    Get PDF
    Multicore or many-core clusters have become the most prominent form of High Performance Computing (HPC) systems. Hardware complexity and hierarchies not only exist in the inter-node layer, i.e., hierarchical networks, but also exist in internals of multicore compute nodes, e.g., Non Uniform Memory Accesses (NUMA), network-style interconnect, and memory and shared cache hierarchies. Message Passing Interface (MPI), the most widely adopted in the HPC communities, suffers from decreased performance and portability due to increased hardware complexity of multiple levels. We identified three critical issues specific to collective communication: The first problem arises from the gap between logical collective topologies and underlying hardware topologies; Second, current MPI communications lack efficient shared memory message delivering approaches; Last, on distributed memory machines, like multicore clusters, a single approach cannot encompass the extreme variations not only in the bandwidth and latency capabilities, but also in features such as the aptitude to operate multiple concurrent copies simultaneously. To bridge the gap between logical collective topologies and hardware topologies, we developed a distance-aware framework to integrate the knowledge of hardware distance into collective algorithms in order to dynamically reshape the communication patterns to suit the hardware capabilities. Based on process distance information, we used graph partitioning techniques to organize the MPI processes in a multi-level hierarchy, mapping on the hardware characteristics. Meanwhile, we took advantage of the kernel-assisted one-sided single-copy approach (KNEM) as the default shared memory delivering method. Via kernel-assisted memory copy, the collective algorithms offload copy tasks onto non-leader/not-root processes to evenly distribute copy workloads among available cores. Finally, on distributed memory machines, we developed a technique to compose multi-layered collective algorithms together to express a multi-level algorithm with tight interoperability between the levels. This tight collaboration results in more overlaps between inter- and intra-node communication. Experimental results have confirmed that, by leveraging several technologies together, such as kernel-assisted memory copy, the distance-aware framework, and collective algorithm composition, not only do MPI collectives reach the potential maximum performance on a wide variation of platforms, but they also deliver a level of performance immune to modifications of the underlying process-core binding

    Scalable and Reliable Sparse Data Computation on Emergent High Performance Computing Systems

    Get PDF
    Heterogeneous systems with both CPUs and GPUs have become important system architectures in emergent High Performance Computing (HPC) systems. Heterogeneous systems must address both performance-scalability and power-scalability in the presence of failures. Aggressive power reduction pushes hardware to its operating limit and increases the failure rate. Resilience allows programs to progress when subjected to faults and is an integral component of large-scale systems, but incurs significant time and energy overhead. The future exascale systems are expected to have higher power consumption with higher fault rates. Sparse data computation is the fundamental kernel in many scientific applications. It is suitable for the studies of scalability and resilience on heterogeneous systems due to its computational characteristics. To deliver the promised performance within the given power budget, heterogeneous computing mandates a deep understanding of the interplay between scalability and resilience. Managing scalability and resilience is challenging in heterogeneous systems, due to the heterogeneous compute capability, power consumption, and varying failure rates between CPUs and GPUs. Scalability and resilience have been traditionally studied in isolation, and optimizing one typically detrimentally impacts the other. While prior works have been proved successful in optimizing scalability and resilience on CPU-based homogeneous systems, simply extending current approaches to heterogeneous systems results in suboptimal performance-scalability and/or power-scalability. To address the above multiple research challenges, we propose novel resilience and energy-efficiency technologies to optimize scalability and resilience for sparse data computation on heterogeneous systems with CPUs and GPUs. First, we present generalized analytical and experimental methods to analyze and quantify the time and energy costs of various recovery schemes, and develop and prototype performance optimization and power management strategies to improve scalability for sparse linear solvers. Our results quantitatively reveal that each resilience scheme has its own advantages depending on the fault rate, system size, and power budget, and the forward recovery can further benefit from our performance and power optimizations for large-scale computing. Second, we design a novel resilience technique that relaxes the requirement of synchronization and identicalness for processes, and allows them to run in heterogeneous resources with power reduction. Our results show a significant reduction in energy for unmodified programs in various fault situations compared to exact replication techniques. Third, we propose a novel distributed sparse tensor decomposition that utilizes an asynchronous RDMA-based approach with OpenSHMEM to improve scalability on large-scale systems and prove that our method works well in heterogeneous systems. Our results show our irregularity-aware workload partition and balanced-asynchronous algorithms are scalable and outperform the state-of-the-art distributed implementations. We demonstrate that understanding different bottlenecks for various types of tensors plays critical roles in improving scalability

    A job response time prediction method for production Grid computing environments

    Get PDF
    A major obstacle to the widespread adoption of Grid Computing in both the scientific community and industry sector is the difficulty of knowing in advance a job submission running cost that can be used to plan a correct allocation of resources. Traditional distributed computing solutions take advantage of homogeneous and open environments to propose prediction methods that use a detailed analysis of the hardware and software components. However, production Grid computing environments, which are large and use a complex and dynamic set of resources, present a different challenge. In Grid computing the source code of applications, programme libraries, and third-party software are not always available. In addition, Grid security policies may not agree to run hardware or software analysis tools to generate Grid components models. The objective of this research is the prediction of a job response time in production Grid computing environments. The solution is inspired by the concept of predicting future Grid behaviours based on previous experiences learned from heterogeneous Grid workload trace data. The research objective was selected with the aim of improving the Grid resource usability and the administration of Grid environments. The predicted data can be used to allocate resources in advance and inform forecasted finishing time and running costs before submission. The proposed Grid Computing Response Time Prediction (GRTP) method implements several internal stages where the workload traces are mined to produce a response time prediction for a given job. In addition, the GRTP method assesses the predicted result against the actual target job’s response time to inference information that is used to tune the methods setting parameters. The GRTP method was implemented and tested using a cross-validation technique to assess how the proposed solution generalises to independent data sets. The training set was taken from the Grid environment DAS (Distributed ASCI Supercomputer). The two testing sets were taken from AuverGrid and Grid5000 Grid environments Three consecutive tests assuming stable jobs, unstable jobs, and using a job type method to select the most appropriate prediction function were carried out. The tests offered a significant increase in prediction performance for data mining based methods applied in Grid computing environments. For instance, in Grid5000 the GRTP method answered 77 percent of job prediction requests with an error of less than 10 percent. While in the same environment, the most effective and accurate method using workload traces was only able to predict 32 percent of the cases within the same range of error. The GRTP method was able to handle unexpected changes in resources and services which affect the job response time trends and was able to adapt to new scenarios. The tests showed that the proposed GRTP method is capable of predicting job response time requests and it also improves the prediction quality when compared to other current solutions

    Software Approaches to Manage Resource Tradeoffs of Power and Energy Constrained Applications

    Get PDF
    Power and energy efficiency have become an increasingly important design metric for a wide spectrum of computing devices. Battery efficiency, which requires a mixture of energy and power efficiency, is exceedingly important especially since there have been no groundbreaking advances in battery capacity recently. The need for energy and power efficiency stretches from small embedded devices to portable computers to large scale data centers. The projected future of computing demand, referred to as exascale computing, demands that researchers find ways to perform exaFLOPs of computation at a power bound much lower than would be required by simply scaling today's standards. There is a large body of work on power and energy efficiency for a wide range of applications and at different levels of abstraction. However, there is a lack of work studying the nuances of different tradeoffs that arise when operating under a power/energy budget. Moreover, there is no work on constructing a generalized model of applications running under power/energy constraints, which allows the designer to optimize their resource consumption, be it power, energy, time, bandwidth, or space. There is need for an efficient model that can provide bounds on the optimality of an application's resource consumption, becoming a basis against which online resource management heuristics can be measured. In this thesis, we tackle the problem of managing resource tradeoffs of power/energy constrained applications. We begin by studying the nuances of power/energy tradeoffs with the response time and throughput of stream processing applications. We then study the power performance tradeoff of batch processing applications to identify a power configuration that maximizes performance under a power bound. Next, we study the tradeoff of power/energy with network bandwidth and precision. Finally, we study how to combine tradeoffs into a generalized model of applications running under resource constraints. The work in this thesis presents detailed studies of the power/energy tradeoff with response time, throughput, performance, network bandwidth, and precision of stream and batch processing applications. To that end, we present an adaptive algorithm that manages stream processing tradeoffs of response time and throughput at the CPU level. At the task-level, we present an online heuristic that adaptively distributes bounded power in a cluster to improve performance, as well as an offline approach to optimally bound performance. We demonstrate how power can be used to reduce bandwidth bottlenecks and extend our offline approach to model bandwidth tradeoffs. Moreover, we present a tool that identifies parts of a program that can be downgraded in precision with minimal impact on accuracy, and maximal impact on energy consumption. Finally, we combine all the above tradeoffs into a flexible model that is efficient to solve and allows for bounding and/or optimizing the consumption of different resources

    Design and Evaluation of Low-Latency Communication Middleware on High Performance Computing Systems

    Get PDF
    [Resumen]El interés en Java para computación paralela está motivado por sus interesantes características, tales como su soporte multithread, portabilidad, facilidad de aprendizaje,alta productividad y el aumento significativo en su rendimiento omputacional. No obstante, las aplicaciones paralelas en Java carecen generalmente de mecanismos de comunicación eficientes, los cuales utilizan a menudo protocolos basados en sockets incapaces de obtener el máximo provecho de las redes de baja latencia, obstaculizando la adopción de Java en computación de altas prestaciones (High Per- formance Computing, HPC). Esta Tesis Doctoral presenta el diseño, implementación y evaluación de soluciones de comunicación en Java que superan esta limitación. En consecuencia, se desarrollaron múltiples dispositivos de comunicación a bajo nivel para paso de mensajes en Java (Message-Passing in Java, MPJ) que aprovechan al máximo el hardware de red subyacente mediante operaciones de acceso directo a memoria remota que proporcionan comunicaciones de baja latencia. También se incluye una biblioteca de paso de mensajes en Java totalmente funcional, FastMPJ, en la cual se integraron los dispositivos de comunicación. La evaluación experimental ha mostrado que las primitivas de comunicación de FastMPJ son competitivas en comparación con bibliotecas nativas, aumentando significativamente la escalabilidad de aplicaciones MPJ. Por otro lado, esta Tesis analiza el potencial de la computación en la nube (cloud computing) para HPC, donde el modelo de distribución de infraestructura como servicio (Infrastructure as a Service, IaaS) emerge como una alternativa viable a los sistemas HPC tradicionales. La evaluación del rendimiento de recursos cloud específicos para HPC del proveedor líder, Amazon EC2, ha puesto de manifiesto el impacto significativo que la virtualización impone en la red, impidiendo mover las aplicaciones intensivas en comunicaciones a la nube. La clave reside en un soporte de virtualización apropiado, como el acceso directo al hardware de red, junto con las directrices para la optimización del rendimiento sugeridas en esta Tesis.[Resumo]O interese en Java para computación paralela está motivado polas súas interesantes características, tales como o seu apoio multithread, portabilidade, facilidade de aprendizaxe, alta produtividade e o aumento signi cativo no seu rendemento computacional. No entanto, as aplicacións paralelas en Java carecen xeralmente de mecanismos de comunicación e cientes, os cales adoitan usar protocolos baseados en sockets que son incapaces de obter o máximo proveito das redes de baixa latencia, obstaculizando a adopción de Java na computación de altas prestacións (High Performance Computing, HPC). Esta Tese de Doutoramento presenta o deseño, implementaci ón e avaliación de solucións de comunicación en Java que superan esta limitación. En consecuencia, desenvolvéronse múltiples dispositivos de comunicación a baixo nivel para paso de mensaxes en Java (Message-Passing in Java, MPJ) que aproveitan ao máaximo o hardware de rede subxacente mediante operacións de acceso directo a memoria remota que proporcionan comunicacións de baixa latencia. Tamén se inclúe unha biblioteca de paso de mensaxes en Java totalmente funcional, FastMPJ, na cal foron integrados os dispositivos de comunicación. A avaliación experimental amosou que as primitivas de comunicación de FastMPJ son competitivas en comparación con bibliotecas nativas, aumentando signi cativamente a escalabilidade de aplicacións MPJ. Por outra banda, esta Tese analiza o potencial da computación na nube (cloud computing) para HPC, onde o modelo de distribución de infraestrutura como servizo (Infrastructure as a Service, IaaS) xorde como unha alternativa viable aos sistemas HPC tradicionais. A ampla avaliación do rendemento de recursos cloud específi cos para HPC do proveedor líder, Amazon EC2, puxo de manifesto o impacto signi ficativo que a virtualización impón na rede, impedindo mover as aplicacións intensivas en comunicacións á nube. A clave atópase no soporte de virtualización apropiado, como o acceso directo ao hardware de rede, xunto coas directrices para a optimización do rendemento suxeridas nesta Tese.[Abstract]The use of Java for parallel computing is becoming more promising owing to its appealing features, particularly its multithreading support, portability, easy-tolearn properties, high programming productivity and the noticeable improvement in its computational performance. However, parallel Java applications generally su er from inefficient communication middleware, most of which use socket-based protocols that are unable to take full advantage of high-speed networks, hindering the adoption of Java in the High Performance Computing (HPC) area. This PhD Thesis presents the design, development and evaluation of scalable Java communication solutions that overcome these constraints. Hence, we have implemented several lowlevel message-passing devices that fully exploit the underlying network hardware while taking advantage of Remote Direct Memory Access (RDMA) operations to provide low-latency communications. Moreover, we have developed a productionquality Java message-passing middleware, FastMPJ, in which the devices have been integrated seamlessly, thus allowing the productive development of Message-Passing in Java (MPJ) applications. The performance evaluation has shown that FastMPJ communication primitives are competitive with native message-passing libraries, improving signi cantly the scalability of MPJ applications. Furthermore, this Thesis has analyzed the potential of cloud computing towards spreading the outreach of HPC, where Infrastructure as a Service (IaaS) o erings have emerged as a feasible alternative to traditional HPC systems. Several cloud resources from the leading IaaS provider, Amazon EC2, which speci cally target HPC workloads, have been thoroughly assessed. The experimental results have shown the signi cant impact that virtualized environments still have on network performance, which hampers porting communication-intensive codes to the cloud. The key is the availability of the proper virtualization support, such as the direct access to the network hardware, along with the guidelines for performance optimization suggested in this Thesis

    New Sequential and Scalable Parallel Algorithms for Incomplete Factor Preconditioning

    Get PDF
    The solution of large, sparse, linear systems of equations Ax = b is an important kernel, and the dominant term with regard to execution time, in many applications in scientific computing. The large size of the systems of equations being solved currently (millions of unknowns and equations) requires iterative solvers on parallel computers. Preconditioning, which is the process of translating a linear system into a related system that is easier to solve, is widely used to reduce solution time and is sometimes required to ensure convergence. Level-based preconditioning (ILU(ℓ)) has long been used in serial contexts and is widely recognized as robust and effective for a wide range of problems. However, the method has long been regarded as an inherently sequential technique. Parallelism, it has been thought, can be achieved primarily at the expense of increased iterations. We dispute these claims. The first half of this dissertation takes an in-depth look at structurally based ILU(ℓ) symbolic factorization. There are two definitions of fill level, “sum” and “max,” that have been proposed. Hitherto, these definitions have been cast in terms of matrix terminology. We develop a sequence of lemmas and theorems that provide graph theoretic characterizations of both definitions; these characterizations are based on the static graph of a matrix, G(A). Our Incomplete Fill Path Theorem characterizes fill levels per the sum definition; this is the definition that is used in most library implementations of the “classic” ILU(ℓ) factorization algorithm. Our theorem leads to several new graph-search algorithms that compute factors identical, or nearly identical, to those computed by the “classic” algorithm. Our analyses shows that the new algorithms have lower run time complexity than that of the previously existing algorithms for certain classes of matrices that are commonly encountered in scientific applications. The second half of this dissertation presents a Parallel ILU algorithmic framework (PILU). This framework enables scalable parallel ILU preconditioning by combining concepts from domain decomposition and graph ordering. The framework can accommodate ILU(ℓ) factorization as well as threshold-based ILUT methods. A model implementation of the framework, the Euclid library, was developed as part of this dissertation. This library was used to obtain experimental results for Poisson\u27s equation, the Convection-Diffusion equation, and a nonlinear Radiative Transfer problem. The experiments, which were conducted on a variety of platforms with up to 400 CPUs, demonstrate that our approach is highly scalable for arbitrary ILU(ℓ) fill levels

    Conjugate heat transfer coupling relying on large eddy simulation with complex geometries in massively parallel environments

    Get PDF
    Progress in scientific computing has led to major advances in simulation and understanding of the different physical phenomena that exist in industrial gas turbines. However' most of these advances have focused on solving one problem at a time. Indeed' the combustion problem is solved independently from the thermal or radiation problems' etc... In reality all these problems interact: one speaks of coupled problems. Thus performing coupled computations can improve the quality of simulations and provide gas turbines engineers with new design tools. Recently' solutions have been developed to handle multiple physics simultaneously using generic solvers. However' due to their genericity these solutions reveal to be ineffective on expensive problems such as Large Eddy Simulation (LES). Another solution is to perform code coupling: specialized codes are connected together' one for each problem and they exchange data periodically. In this thesis a conjugate heat transfer problem is considered. A fluid domain solved by a combustion LES solver is coupled with a solid domain in which the conduction problem is solved. Implementing this coupled problem raises multiple issues which are addressed in this thesis. Firstly' the specific problem of coupling an LES solver to a conduction solver is considered: the impact of the inter-solver exchange frequency on convergence' possible temporal aliasing' and stability of the coupled system is studied. Then interpolation and geometrical issues are addressed: a conservative interpolation method is developed and compared to other methods. These methods are then applied to an industrial configuration' highlighting the problems and solutions specific to complex geometry. Finally' high performance computing (HPC) is considered: an efficient method to perform data exchange and interpolation between parallel codes is developed. This work has been applied to an aeronautical combustion chamber configuration

    On Distributed Verification and Verified Distribution

    Get PDF
    Fokkink, W.J. [Promotor]Pol, J.C. van de [Copromotor

    Modelling and numerical simulation of combustion and multi-phase flows using finite volume methods on unstructured meshes

    Get PDF
    The present thesis is devoted to the development and implementation of mathematical models and numerical methods in order to carry out computational simulations of complex heat and mass transfer phenomena. Several areas and topics in the field of Computational Fluid Dynamics (CFD) have been treated and covered during the development of the current thesis, specially combustion and dispersed multi-phase flows. This type of simulations requires the implementation and coupling of different physics. The numerical simulation of multiphysics phenomena is challenging due to the wide range of spatial and temporal scales which can characterize each one of the physics involved in the problem. Moreover, when solving turbulent flows, turbulence itself is a very complex physical phenomenon that can demand a huge computational effort. Hence, in order to make turbulent flow simulations computationally affordable, the turbulence should be modelled. Therefore, throughout this thesis different numerical methods and algorithms have been developed and implemented aiming to perform multiphysics simulations in turbulent flows. The first topic addressed is turbulent combustion. Chapter 2 presents a combustion model able to notably reduce the computational cost of the simulation. The model, namely the Progress-Variable (PV) model, relies on a separation of the spatio-temporal scales between the flow and the chemistry. Moreover, in order to account for the influence of the sub-grid species concentrations and energy fluctuations, the PV model is coupled to the Presumed Conditional Moment (PCM) model. Chapter 2 also shows the development of a smart load-balancing method for the evaluation of chemical reaction rates in parallel combustion simulations. Chapter 3 is devoted to dispersed multiphase flows. This type of flows are composed of a continuous phase and a dispersed phase in the form of unconnected particles or droplets. In this thesis, the Eulergian-Lagrangian approach has been selected. This type of model is the best-suited for dispersed multiphase flows with thousands or millions of particles, and with a flow regime ranging from the very dilute up to relatively dense. In Chapter 4, a new method capable of performing parallel numerical simulations using non-overlapping disconnected mesh domains with adjacent boundaries is presented. The presented algorithm stitches at each iteration independent meshes and solves them as a unique domain. Finally, Chapter 5 addresses a transversal aspect to the previously covered topics throughout the thesis. In this chapter, a self-adaptive strategy for the maximisation of the time-step for the numerical solution of convection-diffusion equations is discussed. The method is capable of determining dynamically at each iteration which is the maximum allowable time-step which assures a stable time integration. Moreover, the method also smartly modifies the temporal integration scheme in order to maximize its stability region depending on the properties of the system matrix.La present tesis està dedicada al desenvolupament e implementació de models matemàtics i mètodes numèrics amb l’objectiu de realitzar simulacions computacionals de fenòmens complexos de transferència de calor i massa. Diverses àrees i temes en el camp de la Dinàmica de Fluids Computacional (CFD) han sigut tractats i coberts durant el desenvolupament de la present tesi, en especial, la combustió i els fluxos multi-fase dispersos. Aquest tipus de simulacions de fenòmens multi-físics es desafiant degut al gran rang d’escales espaio-temporals que poden caracteritzar cada una de les físiques involucrades en el problema. D’altra banda, quan es resolen fluxos turbulents, la pròpia turbulència ja és un fenomen físic molt complex que pot requerir un gran esforç computacional. Per tant, amb l’objectiu de fer les simulacions computacionals de fluxos turbulents computacionalment assequibles, la turbulència ha de ser modelada. Per tant, durant aquesta tesis diferents mètodes i algoritmes han sigut desenvolupats e implementats amb l’objectiu de realitzar simulacions multi-físiques en fluxos turbulents. El primer tema abordat és la combustió turbulenta. El Capítol 2 presenta un model de combustió capaç de reduir notablement el cost computacional de la simulació. El model, anomenat el model Progress-Variable (PV), està basat en la separació d’escales espaio-temporals entre el fluid i la química. A més, amb l’objectiu de tenir en compte l’influencia de les fluctuacions a nivell sub-grid d’energia i concentracions d'espècies, el model PV s’acobla amb el model Presumed Conditional Moment (PCM). El Capítol 2 també mostra el desenvolupament d’un mètode intel·ligent de balanceig de càrrega per l'avaluació de el rati de reacció químic en simulacions de combustió paral·leles. El Capítol 3 està dedicat als fluxos multi-fase dispersos. Aquest tipus de fluids estan formats per una fase continua i una fase dispersa en forma de partícules o gotes inconnexes. En aquesta tesis, l’aproximació Euleriana-Lagrangiana ha sigut la seleccionada. Aquest tipus de model és el més adequat per fluxos multi-fase dispersos amb milers o milions de partícules, i amb règims que van des del molt diluït fins al relativament dens. Al Capítol 4, es presenta un nou mètode capaç de realitzar simulacions numèriques paral·leles utilitzant malles inconnexes no solapades que tenen fronteres adjacents. L’algoritme presentat cus a cada iteració les malles independents i les resol com un únic domini. Finalment, el Capítol 5 tracta un aspecte transversal a tots els temes coberts al llarg de la tesi. En aquest capítol es discuteix una estratègia auto-adaptativa destinada a la maximització del pas de temps per a la solució numèrica d’equacions de convecció-difusió. El mètode es capaç de determinar dinàmicament a cada iteració quin és el màxim pas de temps possible que assegura una integració temporal estable. A més, el mètode també modifica de forma intel·ligent la regió d’estabilitat en funció de les propietats de la matriu del sistema.Postprint (published version
    corecore