860 research outputs found

    MARS: Exploiting Multi-Level Parallelism for DNN Workloads on Adaptive Multi-Accelerator Systems

    Full text link
    Along with the fast evolution of deep neural networks, the hardware system is also developing rapidly. As a promising solution achieving high scalability and low manufacturing cost, multi-accelerator systems widely exist in data centers, cloud platforms, and SoCs. Thus, a challenging problem arises in multi-accelerator systems: selecting a proper combination of accelerators from available designs and searching for efficient DNN mapping strategies. To this end, we propose MARS, a novel mapping framework that can perform computation-aware accelerator selection, and apply communication-aware sharding strategies to maximize parallelism. Experimental results show that MARS can achieve 32.2% latency reduction on average for typical DNN workloads compared to the baseline, and 59.4% latency reduction on heterogeneous models compared to the corresponding state-of-the-art method.Comment: Accepted by 60th DA

    Energy Demand Response for High-Performance Computing Systems

    Get PDF
    The growing computational demand of scientific applications has greatly motivated the development of large-scale high-performance computing (HPC) systems in the past decade. To accommodate the increasing demand of applications, HPC systems have been going through dramatic architectural changes (e.g., introduction of many-core and multi-core systems, rapid growth of complex interconnection network for efficient communication between thousands of nodes), as well as significant increase in size (e.g., modern supercomputers consist of hundreds of thousands of nodes). With such changes in architecture and size, the energy consumption by these systems has increased significantly. With the advent of exascale supercomputers in the next few years, power consumption of the HPC systems will surely increase; some systems may even consume hundreds of megawatts of electricity. Demand response programs are designed to help the energy service providers to stabilize the power system by reducing the energy consumption of participating systems during the time periods of high demand power usage or temporary shortage in power supply. This dissertation focuses on developing energy-efficient demand-response models and algorithms to enable HPC system\u27s demand response participation. In the first part, we present interconnection network models for performance prediction of large-scale HPC applications. They are based on interconnected topologies widely used in HPC systems: dragonfly, torus, and fat-tree. Our interconnect models are fully integrated with an implementation of message-passing interface (MPI) that can mimic most of its functions with packet-level accuracy. Extensive experiments show that our integrated models provide good accuracy for predicting the network behavior, while at the same time allowing for good parallel scaling performance. In the second part, we present an energy-efficient demand-response model to reduce HPC systems\u27 energy consumption during demand response periods. We propose HPC job scheduling and resource provisioning schemes to enable HPC system\u27s emergency demand response participation. In the final part, we propose an economic demand-response model to allow both HPC operator and HPC users to jointly reduce HPC system\u27s energy cost. Our proposed model allows the participation of HPC systems in economic demand-response programs through a contract-based rewarding scheme that can incentivize HPC users to participate in demand response

    Programming model abstractions for optimizing I/O intensive applications

    Get PDF
    This thesis contributes from the perspective of task-based programming models to the efforts of optimizing I/O intensive applications. Throughout this thesis, we propose programming model abstractions and mechanisms that target a twofold objective: from the one hand, improve the I/O and total performance of applications on nowadays complex storage infrastructures. From the other hand, achieve such performance improvement without increasing the complexity of applications programming. The following paragraphs briefly summarize each of our contributions. First, towards exploiting compute-I/O patterns of I/O intensive applications and transparently improving I/O and total performance, we propose a number of abstractions that we refer to as I/O Awareness abstractions. An I/O aware task-based programming model is able to separate the handling of I/O and computations by supporting I/O Tasks. The execution of such tasks can overlap with compute tasks execution. Moreover, we provide programming model support to improve I/O performance by addressing the issue of I/O congestion. This is achieved by using Storage Bandwidth Constraints to control the level of task parallelism. We support two types of such constraints: (i) Static storage bandwidth constraints that are manually set by application developers. (ii) Auto-tunable constraints that are automatically set and tuned throughout the execution of application. Second, in order to exploit the heterogeneity of modern storage systems to improve performance in a transparent manner, we propose a set of capabilities that we refer to as Storage heterogeneity Awareness. A storage-heterogeneity aware task-based programming model builds on the concepts and abstractions that are introduced in the first contribution to improve the I/O performance of applications on heterogeneous storage systems. More specifically, such programming models support the following features: (i) abstracting the heterogeneity of the storage devices and exposing them as one hierarchical storage resource. (ii) supporting dedicated I/O scheduling. (iii) Finally, we introduce a mechanism that automatically and periodically flushes obsolete data from higher storage layers to lower storage layers. Third, targeting increasing parallelism levels of applications, we propose a Hybrid Programming Model that combines task-based programming models and MPI. In this programming model, tasks are used to achieve coarse-grained parallelism on large-scale distributed infrastructures, whereas MPI is used to gain fine-grained parallelism by parallelizing tasks execution. Such a hybrid programming model offers the possibility to enable parallel I/O and high-level I/O libraries in tasks. We enable such a hybrid programming model by supporting Native MPI Tasks. These tasks are native to the programming model for two reasons: they execute task code as opposed to calling external MPI binaries or scripts. Also, the data transfers and input/output handling is done in a completely transparent manner to application developers. Therefore, increasing parallelism levels while easing the design and programming of applications. Finally, to exploit the inherent parallelism opportunities in applications and overlap computation with I/O, we propose an Eager mechanism for releasing data dependencies. Unlike the traditional approach for releasing dependencies, eagerly releasing data dependencies allows successor tasks to be released for execution as soon as their data dependencies are ready, without having to wait for predecessor task(s) to completely finish execution. In order to support the eager-release of data dependencies, we describe the following core modifications to the design of task-based programming models: (i) defining and managing data dependency relationships as parameter-aware dependencies (ii) a mechanism for notifying the programming model that an output data has been generated before the execution of the producer task ends.Aquesta tesi contribueix des de la perspectiva dels models de programació basats en tasques als esforços d’optimitzar les aplicacions intensives de I/O. Al llarg d'aquesta tesi, proposem abstraccions i mecanismes del model de programació que persegueixen un doble objectiu: per una banda, millorar la I/O i el rendiment total de les aplicacions a les complexes infraestructures d'emmagatzematge de l'actualitat. D'altra banda, aconsegueixi aquesta millora del rendiment sense augmentar la complexitat de la programació d'aplicacions. Els paràgrafs següents resumeixen cadascuna de les nostres contribucions. En primer lloc, proposem una sèrie d'abstraccions a què ens referim com a abstraccions de consciència de I/O. Un model de programació basat en tasques amb reconeixement d'I/O pot separar el maneig d'I/O i els càlculs en admetre Tasques d'I/O. L'execució d'aquestes tasques es pot superposar amb l'execució de tasques de càlcul. A més, proporcionem suport de model de programació per millorar el rendiment d'I/O en abordar el problema de la congestió d'I/O. Això s'aconsegueix mitjançant l'ús de restriccions d'amplada de banda d'emmagatzematge per controlar el nivell de paral·lelisme de tasques. Admetem dos tipus d'aquestes restriccions: estàtic i autoajustable. En segon lloc, proposem un conjunt de capacitats a què ens referim com a Consciència d'heterogeneïtat d'emmagatzematge. Un model de programació basat en tasques conscient de l'heterogeneïtat de l'emmagatzematge es basa en els conceptes i les abstraccions que s'introdueixen en la primera contribució per millorar el rendiment d'I/O de les aplicacions en sistemes d'emmagatzematge heterogenis. Més específicament, aquests models de programació admeten les característiques següents: (i) abstreure l'heterogeneïtat dels dispositius d'emmagatzematge i exposar-los com a recurs d'emmagatzematge jeràrquic. (ii) admetre la programació d'I/O dedicada. (iii) Finalment, presentem un mecanisme que descarrega automàticament i periòdicament les dades obsoletes de les capes d'emmagatzematge superiors a les capes d'emmagatzematge inferiors. En tercer lloc, proposem un model de programació híbrid que combina models de programació basats en tasques i MPI. En aquest model de programació, les tasques s'utilitzen per aconseguir un paral·lelisme de gra gruixut en infraestructures distribuïdes a gran escala, mentre que MPI es fa servir per obtenir un paral·lelisme de gra fi en paral·lelitzar l'execució de tasques. Un model d'aquest tipus de programació híbrid ofereix la possibilitat d'habilitar I/O paral·leles i biblioteques d'I/O d'alt nivell en tasques. Habilitem un model de programació híbrid d'aquest tipus en admetre tasques MPI natives que executen codi de tasca en lloc de trucar a binaris o scripts MPI externs. A més, la transferència de dades i el maneig d’entrada / sortida es realitza d’una manera completament transparent per als desenvolupadors d’aplicacions. Per tant, augmenta els nivells de paral·lelisme alhora que se'n facilita el disseny i la programació d'aplicacions. Finalment proposem un mecanisme Eager per alliberar dependències de dades. A diferència de l'enfocament tradicional per alliberar dependències, alliberar amb entusiasme les dependències de dades permet que les tasques successores s'alliberin per a la seva execució tan aviat com les dependències de dades estiguin llestes, sense haver d'esperar que les tasques predecessores acabin completament l'execució. Per tal de donar suport a l'alliberament ansiós de les dependències de dades, descrivim les següents modificacions centrals al disseny de models de programació basats en tasques: (i) definir i administrar les relacions de dependència de dades com a dependències conscients de paràmetres (ii ) un mecanisme per notificar la model de programació que s'ha generat una dada de sortida abans que finalitzi l'execució de la tasca de productor.Postprint (published version

    Analysis and Modeling of Advanced PIM Architecture Design Tradeoffs

    Get PDF
    A major trend in high performance computer architecture over the last two decades is the migration of memory in the form of high speed caches onto the microprocessor semiconductor die. Where temporal locality in the computation is high, caches prove very effective at hiding memory access latency and contention for communication resources. However where temporal locality is absent, caches may exhibit low hit rates resulting in poor operational efficiency. Vector computing exploiting pipelined arithmetic units and memory access address this challenge for certain forms of data access patterns, for example involving long contiguous data sets exhibiting high spatial locality. But for many advanced applications for science, technology, and national security at least some data access patterns are not consistent to the restricted forms well handled by either caches or vector processing. An important alternative is the reverse strategy; that of migrating logic in to the main memory (DRAM) and performing those operations directly on the data stored there. Processor in Memory (PIM) architecture has advanced to the point where it may fill this role and provide an important new mechanism for improving performance and efficiency of future supercomputers for a broad range of applications. One important project considering both the role of PIM in supercomputer architecture and the design of such PIM components is the Cray Cascade Project sponsored by the DARPA High Productivity Computing Program. Cascade is a Petaflops scale computer targeted for deployment at the end of the decade that merges the raw speed of an advanced custom vector architecture with the high memory bandwidth processing delivered by an innovative class of PIM architecture. The work represented here was performed under the Cascade project to explore critical design space issues that will determine the value of PIM in supercomputers and contribute to the optimization of its design. But this work also has strong relevance to hybrid systems comprising a combination of conventional microprocessors and advanced PIM based intelligent main memory

    Computing Platforms for Big Biological Data Analytics: Perspectives and Challenges.

    Full text link
    The last decade has witnessed an explosion in the amount of available biological sequence data, due to the rapid progress of high-throughput sequencing projects. However, the biological data amount is becoming so great that traditional data analysis platforms and methods can no longer meet the need to rapidly perform data analysis tasks in life sciences. As a result, both biologists and computer scientists are facing the challenge of gaining a profound insight into the deepest biological functions from big biological data. This in turn requires massive computational resources. Therefore, high performance computing (HPC) platforms are highly needed as well as efficient and scalable algorithms that can take advantage of these platforms. In this paper, we survey the state-of-the-art HPC platforms for big biological data analytics. We first list the characteristics of big biological data and popular computing platforms. Then we provide a taxonomy of different biological data analysis applications and a survey of the way they have been mapped onto various computing platforms. After that, we present a case study to compare the efficiency of different computing platforms for handling the classical biological sequence alignment problem. At last we discuss the open issues in big biological data analytics

    Understanding and Optimizing Flash-based Key-value Systems in Data Centers

    Get PDF
    Flash-based key-value systems are widely deployed in today’s data centers for providing high-speed data processing services. These systems deploy flash-friendly data structures, such as slab and Log Structured Merge(LSM) tree, on flash-based Solid State Drives(SSDs) and provide efficient solutions in caching and storage scenarios. With the rapid evolution of data centers, there appear plenty of challenges and opportunities for future optimizations. In this dissertation, we focus on understanding and optimizing flash-based key-value systems from the perspective of workloads, software, and hardware as data centers evolve. We first propose an on-line compression scheme, called SlimCache, considering the unique characteristics of key-value workloads, to virtually enlarge the cache space, increase the hit ratio, and improve the cache performance. Furthermore, to appropriately configure increasingly complex modern key-value data systems, which can have more than 50 parameters with additional hardware and system settings, we quantitatively study and compare five multi-objective optimization methods for auto-tuning the performance of an LSM-tree based key-value store in terms of throughput, the 99th percentile tail latency, convergence time, real-time system throughput, and the iteration process, etc. Last but not least, we conduct an in-depth, comprehensive measurement work on flash-optimized key-value stores with recently emerging 3D XPoint SSDs. We reveal several unexpected bottlenecks in the current key-value store design and present three exemplary case studies to showcase the efficacy of removing these bottlenecks with simple methods on 3D XPoint SSDs. Our experimental results show that our proposed solutions significantly outperform traditional methods. Our study also contributes to providing system implications for auto-tuning the key-value system on flash-based SSDs and optimizing it on revolutionary 3D XPoint based SSDs

    An efficient implementation of the Bellman-Ford algorithm for Kepler GPU architectures

    Get PDF
    Finding the shortest paths from a single source to all other vertices is a common problem in graph analysis. The Bellman-Ford's algorithm is the solution that solves such a single-source shortest path (SSSP) problem and better applies to be parallelized for many-core architectures. Nevertheless, the high degree of parallelism is guaranteed at the cost of low work efficiency, which, compared to similar algorithms in literature (e.g., Dijkstra's) involves much more redundant work and a consequent waste of power consumption. This article presents a parallel implementation of the Bellman-Ford algorithm that exploits the architectural characteristics of recent GPU architectures (i.e., NVIDIA Kepler, Maxwell) to improve both performance and work efficiency. The article presents different optimizations to the implementation, which are oriented both to the algorithm and to the architecture. The experimental results show that the proposed implementation provides an average speedup of 5x higher than the existing most efficient parallel implementations for SSSP, that it works on graphs where those implementations cannot work or are inefficient (e.g., graphs with negative weight edges, sparse graphs), and that it sensibly reduces the redundant work caused by the parallelization process
    corecore