196 research outputs found

    Versatile, Scalable, and Accurate Simulation of Distributed Applications and Platforms

    Get PDF
    International audienceThe study of parallel and distributed applications and platforms, whether in the cluster, grid, peer-to-peer, volunteer, or cloud computing domain, often mandates empirical evaluation of proposed algorithmic and system solutions via simulation. Unlike direct experimentation via an application deployment on a real-world testbed, simulation enables fully repeatable and configurable experiments for arbitrary hypothetical scenarios. Two key concerns are accuracy (so that simulation results are scientifically sound) and scalability (so that simulation experiments can be fast and memory-efficient). While the scalability of a simulator is easily measured, the accuracy of many state-of-the-art simulators is largely unknown because they have not been sufficiently validated. In this work we describe recent accuracy and scalability advances made in the context of the SimGrid simulation framework. A design goal of SimGrid is that it should be versatile, i.e., applicable across all aforementioned domains. We present quantitative results that show that SimGrid compares favorably to state-of-the-art domain-specific simulators in terms of scalability, accuracy, or the trade-off between the two. An important implication is that, contrary to popular wisdom, striving for versatility in a simulator is not an impediment but instead is conducive to improving both accuracy and scalability

    Scalable Applications on Heterogeneous System Architectures: A Systematic Performance Analysis Framework

    Get PDF
    The efficient parallel execution of scientific applications is a key challenge in high-performance computing (HPC). With growing parallelism and heterogeneity of compute resources as well as increasingly complex software, performance analysis has become an indispensable tool in the development and optimization of parallel programs. This thesis presents a framework for systematic performance analysis of scalable, heterogeneous applications. Based on event traces, it automatically detects the critical path and inefficiencies that result in waiting or idle time, e.g. due to load imbalances between parallel execution streams. As a prerequisite for the analysis of heterogeneous programs, this thesis specifies inefficiency patterns for computation offloading. Furthermore, an essential contribution was made to the development of tool interfaces for OpenACC and OpenMP, which enable a portable data acquisition and a subsequent analysis for programs with offload directives. At present, these interfaces are already part of the latest OpenACC and OpenMP API specification. The aforementioned work, existing preliminary work, and established analysis methods are combined into a generic analysis process, which can be applied across programming models. Based on the detection of wait or idle states, which can propagate over several levels of parallelism, the analysis identifies wasted computing resources and their root cause as well as the critical-path share for each program region. Thus, it determines the influence of program regions on the load balancing between execution streams and the program runtime. The analysis results include a summary of the detected inefficiency patterns and a program trace, enhanced with information about wait states, their cause, and the critical path. In addition, a ranking, based on the amount of waiting time a program region caused on the critical path, highlights program regions that are relevant for program optimization. The scalability of the proposed performance analysis and its implementation is demonstrated using High-Performance Linpack (HPL), while the analysis results are validated with synthetic programs. A scientific application that uses MPI, OpenMP, and CUDA simultaneously is investigated in order to show the applicability of the analysis

    What broke where for distributed and parallel applications — a whodunit story

    Get PDF
    Detection, diagnosis and mitigation of performance problems in today\u27s large-scale distributed and parallel systems is a difficult task. These large distributed and parallel systems are composed of various complex software and hardware components. When the system experiences some performance or correctness problem, developers struggle to understand the root cause of the problem and fix in a timely manner. In my thesis, I address these three components of the performance problems in computer systems. First, we focus on diagnosing performance problems in large-scale parallel applications running on supercomputers. We developed techniques to localize the performance problem for root-cause analysis. Parallel applications, most of which are complex scientific simulations running in supercomputers, can create up to millions of parallel tasks that run on different machines and communicate using the message passing paradigm. We developed a highly scalable and accurate automated debugging tool called PRODOMETER, which uses sophisticated algorithms to first, create a logical progress dependency graph of the tasks to highlight how the problem spread through the system manifesting as a system-wide performance issue. Second, uses this logical progress dependence graph to identify the task where the problem originated. Finally, PRODOMETER pinpoints the code region corresponding to the origin of the bug. Second, we developed a tool-chain that can detect performance anomaly using machine-learning techniques and can achieve very low false positive rate. Our input-aware performance anomaly detection system consists of a scalable data collection framework to collect performance related metrics from different granularity of code regions, an offline model creation and prediction-error characterization technique, and a threshold based anomaly-detection-engine for production runs. Our system requires few training runs and can handle unknown inputs and parameter combinations by dynamically calibrating the anomaly detection threshold according to the characteristics of the input data and the characteristics of the prediction-error of the models. Third, we developed performance problem mitigation scheme for erasure-coded distributed storage systems. Repair operations of the failed blocks in erasure-coded distributed storage system take really long time in networked constrained data-centers. The reason being, during the repair operation for erasure-coded distributed storage, a lot of data from multiple nodes are gathered into a single node and then a mathematical operation is performed to reconstruct the missing part. This process severely congests the links toward the destination where newly recreated data is to be hosted. We proposed a novel distributed repair technique, called Partial-Parallel-Repair (PPR) that performs this reconstruction in parallel on multiple nodes and eliminates network bottlenecks, and as a result, greatly speeds up the repair process. Fourth, we study how for a class of applications, performance can be improved (or performance problems can be mitigated) by selectively approximating some of the computations. For many applications, the main computation happens inside a loop that can be logically divided into a few temporal segments, we call phases. We found that while approximating the initial phases might severely degrade the quality of the results, approximating the computation for the later phases have very small impact on the final quality of the result. Based on this observation, we developed an optimization framework that for a given budget of quality-loss, would find the best approximation settings for each phase in the execution

    Impact of network interconnection in cloud computing environments for high-performance computing applications

    Get PDF
    The availability of computational resources has changed significantly due to the use of the cloud computing paradigm. Aiming at potential advantages, such as cost savings through the pay-per-use method and scalable/elastic resource allocation, we have witnessed ef forts to execute high-performance computing (HPC) applications in the cloud. Due to the distributed nature of these environments, performance is highly dependent on two primary components of the system: processing power and network interconnection. If allocating more powerful hardware theoretically increases performance, it increases the allocation cost on the other hand. Allocation exclusivity guarantees space for memory, storage, and CPU. This is not the case for the network interconnection since several si multaneous instances (multi-tenants) share the same communication channel, making the network a bottleneck. Therefore, this dissertation aims to analyze the impact of network interconnection on the execution of workloads from the HPC domain. We carried out two different assessments. The first concentrates on different network interconnections (GbE and InfiniBand) in the Microsoft Azure public cloud and costs related to their use. The second focuses on different network configurations using NIC aggregation methodolo gies in a private cloud-controlled environment. The results obtained showed that network interconnection is a crucial aspect and can significantly impact the performance of HPC applications executed in the cloud. In the Azure public cloud, the accelerated networking approach, which allows the instance to have a high-performance interconnection without additional charges, allows significant performance improvements for HPC applications with better cost efficiency. Finally, in the private cloud environment, the NIC aggre gation approach outperformed the baseline up to ≈98% of the executions with applica tions that make intensive use of the network. Also, Balance Round-Robin aggregation mode performed better than 802.3ad aggregation mode in the majority of the executions.A disponibilidade de recursos computacionais mudou significativamente devido ao uso do paradigma de computação em nuvem. Visando vantagens potenciais, como economia de custos por meio do mĂ©todo de pagamento por uso e alocação de recursos escalĂĄvel/e lĂĄstica, testemunhamos esforços para executar aplicaçÔes de computação de alto desem penho (HPC) na nuvem. Devido Ă  natureza distribuĂ­da desses ambientes, o desempenho Ă© altamente dependente de dois componentes principais do sistema: potĂȘncia de processa mento e interconexĂŁo de rede. Se a alocação de um hardware mais poderoso teoricamente aumenta o desempenho, ele aumenta o custo de alocação, por outro lado. A exclusividade de alocação garante espaço para memĂłria, armazenamento e CPU. Este nĂŁo Ă© o caso da interconexĂŁo de rede, pois vĂĄrias instĂąncias simultĂąneas (multilocatĂĄrios) compartilham o mesmo canal de comunicação, tornando a rede um gargalo. Portanto, esta dissertação tem como objetivo analisar o impacto da interconexĂŁo de redes na execução de cargas de tra balho do domĂ­nio HPC. Realizamos duas avaliaçÔes diferentes. O primeiro concentra-se em diferentes interconexĂ”es de rede (GbE e InfiniBand) na nuvem pĂșblica da Microsoft Azure e nos custos relacionados ao seu uso. O segundo se concentra em diferentes confi guraçÔes de rede usando metodologias de agregação de NICs em um ambiente controlado por nuvem privada. Os resultados obtidos mostraram que a interconexĂŁo de rede Ă© um aspecto crucial e pode impactar significativamente no desempenho das aplicaçÔes HPC executados na nuvem. Na nuvem pĂșblica do Azure, a abordagem de rede acelerada, que permite que a instĂąncia tenha uma interconexĂŁo de alto desempenho sem encargos adici onais, permite melhorias significativas de desempenho para aplicaçÔes HPC com melhor custo-benefĂ­cio. Finalmente, no ambiente de nuvem privada, a abordagem de agrega ção NIC superou a linha de base em atĂ© 98% das execuçÔes com aplicaçÔes que fazem uso intensivo da rede. AlĂ©m disso, o modo de agregação Balance Round-Robin teve um desempenho melhor do que o modo de agregação 802.3ad na maioria das execuçÔes

    Message Passing with Communication Structures

    Get PDF
    Institute for Computing Systems ArchitectureAbstraction concepts based on process groups have largely dominated the design and implementation of communication patterns in message passing systems. Although such an approach seems pragmatic—given that participating processes form a ‘group’—in this dissertation, we discuss subtle issues that affect the qualitative and quantitative aspects of this approach. To address these issues, we introduce the concept of a ‘communication structure,’ which defines a communication pattern as an implicit runtime composition of localised patterns, known as ‘roles.’ During application development, communication structures are derived from the algorithm being implemented. These are then translated to an executable form by defining process specific data structures, known as ‘branching channels.’ The qualitative advantages of the communication structure approach are that the resulting programming model is non-ambiguous, uniform, expressive, and extensible. To use a pattern is to access the corresponding branching channels; to define a new pattern is simply to combine appropriate roles. The communication structure approach therefore allows immediate implementation of ad hoc patterns. Furthermore, it is guaranteed that every newly added role interfaces correctly with all of the existing roles, therefore scaling the benefit of every new addition. Quantitatively, branching channels improve performance by automatically overlapping computations and communications. The runtime system uses a receiver initiated communication protocol that allows senders to continue immediately without waiting for the receivers to respond. The advantage is that, unlike split-phase asynchronous communications, senders need not check whether the send operations were successful. Another property of branching channels is that they allow communications to be grouped, identified, and referenced. Communication structure specific parameters, such as message buffering, can therefore be specified immediately. Furthermore, a ‘commit’ based interface optimisation for send-and-forget type communications—where senders do not reuse sent data—is presented. This uses the referencing property of branching channels, allowing message buffering without incurring performance degradation due to intermediate memory copy

    Software for Exascale Computing - SPPEXA 2016-2019

    Get PDF
    This open access book summarizes the research done and results obtained in the second funding phase of the Priority Program 1648 "Software for Exascale Computing" (SPPEXA) of the German Research Foundation (DFG) presented at the SPPEXA Symposium in Dresden during October 21-23, 2019. In that respect, it both represents a continuation of Vol. 113 in Springer’s series Lecture Notes in Computational Science and Engineering, the corresponding report of SPPEXA’s first funding phase, and provides an overview of SPPEXA’s contributions towards exascale computing in today's sumpercomputer technology. The individual chapters address one or more of the research directions (1) computational algorithms, (2) system software, (3) application software, (4) data management and exploration, (5) programming, and (6) software tools. The book has an interdisciplinary appeal: scholars from computational sub-fields in computer science, mathematics, physics, or engineering will find it of particular interest

    Book of Abstracts of the Sixth SIAM Workshop on Combinatorial Scientific Computing

    Get PDF
    Book of Abstracts of CSC14 edited by Bora UçarInternational audienceThe Sixth SIAM Workshop on Combinatorial Scientific Computing, CSC14, was organized at the Ecole Normale Supérieure de Lyon, France on 21st to 23rd July, 2014. This two and a half day event marked the sixth in a series that started ten years ago in San Francisco, USA. The CSC14 Workshop's focus was on combinatorial mathematics and algorithms in high performance computing, broadly interpreted. The workshop featured three invited talks, 27 contributed talks and eight poster presentations. All three invited talks were focused on two interesting fields of research specifically: randomized algorithms for numerical linear algebra and network analysis. The contributed talks and the posters targeted modeling, analysis, bisection, clustering, and partitioning of graphs, applied in the context of networks, sparse matrix factorizations, iterative solvers, fast multi-pole methods, automatic differentiation, high-performance computing, and linear programming. The workshop was held at the premises of the LIP laboratory of ENS Lyon and was generously supported by the LABEX MILYON (ANR-10-LABX-0070, Université de Lyon, within the program ''Investissements d'Avenir'' ANR-11-IDEX-0007 operated by the French National Research Agency), and by SIAM
    • 

    corecore