1,199 research outputs found

    Scheduling Heterogeneous HPC Applications in Next-Generation Exascale Systems

    Get PDF
    Next generation HPC applications will increasingly time-share system resources with emerging workloads such as in-situ analytics, resilience tasks, runtime adaptation services and power management activities. HPC systems must carefully schedule these co-located codes in order to reduce their impact on application performance. Among the techniques traditionally used to mitigate the performance effects of time- share systems is gang scheduling. This approach, however, leverages global synchronization and time agreement mechanisms that will become hard to support as systems increase in size. Alternative performance interference mitigation approaches must be explored for future HPC systems. This dissertation evaluates the impacts of workload concurrency in future HPC systems. It uses simulation and modeling techniques to study the performance impacts of existing and emerging interference sources on a selection of HPC benchmarks, mini-applications, and applications. It also quantifies the cost and benefits of different approaches to scheduling co-located workloads, studies performance interference mitigation solutions based on gang scheduling, and examines their synchronization requirements. To do so, this dissertation presents and leverages a new Extreme Value Theory- based model to characterize interference sources, and investigate their impact on Bulk Synchronous Parallel (BSP) applications. It demonstrates how this model can be used to analyze the interference attenuation effects of alternative fine-grained OS scheduling approaches based on periodic real time schedulers. This analysis can, in turn, guide the design of those mitigation techniques by providing tools to understand the tradeoffs of selecting scheduling parameters

    Vectorizing unstructured mesh computations for many-core architectures.

    Get PDF
    Achieving optimal performance on the latest multi-core and many-core architectures increasingly depends on making efficient use of the hardware's vector units. This paper presents results on achieving high performance through vectorization on CPUs and the Xeon-Phi on a key class of irregular applications: unstructured mesh computations. Using single instruction multiple thread (SIMT) and single instruction multiple data (SIMD) programming models, we show how unstructured mesh computations map to OpenCL or vector intrinsics through the use of code generation techniques in the OP2 Domain Specific Library and explore how irregular memory accesses and race conditions can be organized on different hardware. We benchmark Intel Xeon CPUs and the Xeon-Phi, using a tsunami simulation and a representative CFD benchmark. Results are compared with previous work on CPUs and NVIDIA GPUs to provide a comparison of achievable performance on current many-core systems. We show that auto-vectorization and the OpenCL SIMT model do not map efficiently to CPU vector units because of vectorization issues and threading overheads. In contrast, using SIMD vector intrinsics imposes some restrictions and requires more involved programming techniques but results in efficient code and near-optimal performance, two times faster than non-vectorized code. We observe that the Xeon-Phi does not provide good performance for these applications but is still comparable with a pair of mid-range Xeon chips

    Аналіз можливостей використання одноплатних комп’ютерів Raspberry при викладанні розподілених та паралельних обчислень

    Get PDF
    У статті розглядається можливість використання одноплатних комп’ютерів при викладанні дисциплін пов’язаних з паралельними та розподіленими обчисленнями. Для проведення практичних занять з паралельних обчислень досить персонального комп’ютера з багатоядерним процесором та відповідною операційною системою, але для проведення практичних занять з розподілених обчислень необхідно організовувати обчислювальний кластер. Для розробки кластеру необхідно специфічне налаштування програмного забезпечення, що може зробити важким використання комп’ютерного класу для викладання інших дисциплін. Також придбання персональних комп’ютерів для обчислювального кластеру може привести к значним матеріальним витратам. Тому доцільним є дослідити можливість використання одноплатних комп’ютерів для розробки обчислювального кластеру та використання його на практичних роботах з розподілених обчислень. Аналіз сучасних одноплатних комп’ютерів та різних джерел з розробки обчислювальних кластерів показав, що для цього підходить одноплатний комп'ютер Raspberry Pi 3 Model B+. Був проведений порівняльний аналіз декілька моделей Raspberry Pi та, для порівняння, одноплатний комп'ютер іншого виробника - ASUS Tinker Board. Виявлено що характеристики Raspberry Pi 3 Model B+ є оптимальними у співвідношенні ціна/якість для формування учбового обчислювального кластеру для розподілених та паралельних обчислень. Такий кластер має порівняно не високу ціну та дуже хорошу масштабованість. Також виявлені проблеми, які необхідно вирішити при проектування обчислювального кластеру: це відсутність стандартизованих шасі для розміщення плат, відсутність стандартних засобів відводу тепла від одноплатних комп’ютерів та необхідність розхробки нестандартного захищеного блоку живлення. Організація такого кластеру дозволить значно покращити навички студентів при розробці програмного забезпечення для розподілених обчислень та навички побудови та налаштування кластерних систем, та дозволить розвити такі компетентності майбутніх інженерів-програмістів

    ALMA: ALgorithm Modeling Application

    Get PDF
    As of today, the most recent trend in information technology is the employment of large-scale data analytic methods powered by Artificial Intelligence (AI), influencing the priorities of businesses and research centers all over the world. However, due to both the lack of specialized talent and the need for greater compute, less established businesses struggle to adopt such endeavors, with major technological mega-corporations such as Microsoft, Facebook and Google taking the upper hand in this uneven playing field. Therefore, in an attempt to promote the democratization of AI and increase the efficiency of data scientists, this work proposes a novel no-code/low-code AI platform: the ALgorithm Modeling Application (ALMA). Moreover, as the state of the art of such platforms is still gradually maturing, current solutions often fail into encompassing security/safety aspects directly into their process. In that respect, the solution proposed in this thesis aims not only to achieve greater development and deployment efficiency while building machine learning applications but also to build upon others by addressing the inherent pitfalls of AI through a ”secure by design” philosophy.Atualmente, a tendência mais recente no domínio das tecnologias de informação e a utilização de métodos de análise de dados baseados em Inteligência Artificial (IA), influenciando as prioridades das empresas e centros de investigação de todo o mundo. No entanto, devido à falta de talento especializado no mercado e a necessidade de obter equipamentos com maior capacidade de computação, negócios menos estabelecidos têm maiores dificuldades em realizar esse tipo de investimentos quando comparados a grandes empresas tecnológicas como a Microsoft, o Facebook e a Google. Deste modo, na tentativa de promover a democratização da IA e aumentar a eficiência dos cientistas de dados, este trabalho propõe uma nova plataforma de no-code/low- code: “THe Algorithm Modeling Application” (ALMA). Por outro lado, e visto que a maioria das soluções atuais falham em abranger aspetos de segurança relativos ˜ a IA diretamente no seu processo, a solução proposta nesta tese visa não só alcançar maior eficiência na construção de soluções baseadas em IA, mas também abordar as questões de segurança implícitas ao seu uso

    Enabling efficient graph computing with near-data processing techniques

    Get PDF
    With the emergence of data science, graph computing is becoming a crucial tool for processing big connected data. However, when mapped to modern computing systems, graph computing typically suffers from poor performance because of inefficiencies in memory subsystems. At the same time, emerging technologies, such as Hybrid Memory Cube (HMC), enable processing-in-memory (PIM) functionality, a promising technique of near-data processing (NDP), by integrating compute units in the 3D-stacked logic layer. The PIM units allows operation offloading at an instruction level, which has considerable potential to overcome the performance bottleneck of graph computing. Nevertheless, studies have not fully explored this functionality for graph workloads or identified its applications and shortcomings. The main objective of this dissertation is to enable NDP techniques for efficient graph computing. Specifically, it investigates the PIM offloading at instruction level. To achieve this goal, it presents a graph benchmark suite for understanding graph computing behaviors, and then proposes architectural techniques for PIM offloading on various host platforms. This dissertation first presents GraphBIG, a comprehensive graph benchmark suite. To cover major graph computation types and data sources, GraphBIG selects representative data representations, workloads, and datasets from 21 real-world use cases of multiple application domains. This dissertation characterized the benchmarks on real machines and observed extremely irregular memory patterns and significant diverse behaviors across various computation types. GraphBIG helps users understand the behavior of modern graph computing on hardware architectures and enables future architecture and system research for graph computing. To achieve better performance of graph computing, this dissertation proposes GraphPIM, a full-stack NDP solution for graph computing. This dissertation performs an analysis on modern graph workloads to assess the applicability of PIM offloading and presents hardware and software mechanisms to efficiently make use of the PIM functionality. Following the real-world HMC 2.0 specification, GraphPIM provides performance benefits for graph applications without any user code modification and ISA changes. In addition, GraphPIM proposes an extension to PIM operations that can further bring performance benefits for more graph applications. The evaluation results show that GraphPIM achieves up to a 2.4X speedup with a 37% reduction in energy consumption. To effectively utilize NDP systems with GPU-based host architectures that can fully utilize hundreds of gigabytes of bandwidth, this dissertation explores managing the thermal constraints of 3D-stacked memory cubes. Based on the real experiment with an HMC prototype, this study observes that the operating temperature of HMC is much higher than conventional DRAM, which can even cause thermal shutdown with a passive cooling solution. In addition, it also shows that even with a commodity-server cooling solution, HMC can fail to maintain the temperature of the memory dies within the normal operating range when in-memory processing is highly utilized, thereby resulting in higher energy consumption and performance overhead. To this end, this dissertation proposes CoolPIM, a thermal-aware source throttling mechanism that controls the intensity of PIM offloading on runtime. The proposed technique keeps the memory dies of HMC within the normal operating temperature using software-based techniques. The evaluation results show that CoolPIM achieves up to 1.4X and 1.37X speedups compared to non-offloading and naïve offloading scenarios.Ph.D

    Domain-specific Architectures for Data-intensive Applications

    Full text link
    Graphs' versatile ability to represent diverse relationships, make them effective for a wide range of applications. For instance, search engines use graph-based applications to provide high-quality search results. Medical centers use them to aid in patient diagnosis. Most recently, graphs are also being employed to support the management of viral pandemics. Looking forward, they are showing promise of being critical in unlocking several other opportunities, including combating the spread of fake content in social networks, detecting and preventing fraudulent online transactions in a timely fashion, and in ensuring collision avoidance in autonomous vehicle navigation, to name a few. Unfortunately, all these applications require more computational power than what can be provided by conventional computing systems. The key reason is that graph applications present large working sets that fail to fit in the small on-chip storage of existing computing systems, while at the same time they access data in seemingly unpredictable patterns, thus cannot draw benefit from traditional on-chip storage. In this dissertation, we set out to address the performance limitations of existing computing systems so to enable emerging graph applications like those described above. To achieve this, we identified three key strategies: 1) specializing memory architecture, 2) processing data near its storage, and 3) message coalescing in the network. Based on these strategies, this dissertation develops several solutions: OMEGA, which employs specialized on-chip storage units, with co-located specialized compute engines to accelerate the computation; MessageFusion, which coalesces messages in the interconnect; and Centaur, providing an architecture that optimizes the processing of infrequently-accessed data. Overall, these solutions provide 2x in performance improvements, with negligible hardware overheads, across a wide range of applications. Finally, we demonstrate the applicability of our strategies to other data-intensive domains, by exploring an acceleration solution for MapReduce applications, which achieves a 4x performance speedup, also with negligible area and power overheads.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/163186/1/abrahad_1.pd
    corecore