22 research outputs found

    On benchmarking of deep learning systems: software engineering issues and reproducibility challenges

    Get PDF
    Since AlexNet won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012, Deep Learning (and Machine Learning/AI in general) gained an exponential interest. Nowadays, their adoption spreads over numerous sectors, like automotive, robotics, healthcare and finance. The ML advancement goes in pair with the quality improvement delivered by those solutions. However, those ameliorations are not for free: ML algorithms always require an increasing computational power, which pushes computer engineers to develop new devices capable of coping with this demand for performance. To foster the evolution of DSAs, and thus ML research, it is key to make it easy to experiment and compare them. This may be challenging since, even if the software built around these devices simplifies their usage, obtaining the best performance is not always straightforward. The situation gets even worse when the experiments are not conducted in a reproducible way. Even though the importance of reproducibility for the research is evident, it does not directly translate into reproducible experiments. In fact, as already shown by previous studies regarding other research fields, also ML is facing a reproducibility crisis. Our work addresses the topic of reproducibility of ML applications. Reproducibility in this context has two aspects: results reproducibility and performance reproducibility. While the reproducibility of the results is mandatory, performance reproducibility cannot be neglected because high-performance device usage causes cost. To understand how the ML situation is regarding reproducibility of performance, we reproduce results published for the MLPerf suite, which seems to be the most used machine learning benchmark. Because of the wide range of devices and frameworks used in different benchmark submissions, we focus on a subset of accuracy and performance results submitted to the MLPerf Inference benchmark, presenting a detailed analysis of the difficulties a scientist may find when trying to reproduce such a benchmark and a possible solution using our workflow tool for experiment reproducibility: PROVA!. We designed PROVA! to support the reproducibility in traditional HPC experiments, but we will show how we extended it to be used as a 'driver' for MLPerf benchmark applications. The PROVA! driver mode allows us to experiment with different versions of the MLPerf Inference benchmark switching among different hardware and software combinations and compare them in a reproducible way. In the last part, we will present the results of our reproducibility study, demonstrating the importance of having a support tool to reproduce and extend original experiments getting deeper knowledge about performance behaviours

    Seamless optimization of the GEMM kernel for task-based programming models

    Get PDF
    The general matrix-matrix multiplication (GEMM) kernel is a fundamental building block of many scientific applications. Many libraries such as Intel MKL and BLIS provide highly optimized sequential and parallel versions of this kernel. The parallel implementations of the GEMM kernel rely on the well-known fork-join execution model to exploit multi-core systems efficiently. However, these implementations are not well suited for task-based applications as they break the data-flow execution model. In this paper, we present a task-based implementation of the GEMM kernel that can be seamlessly leveraged by task-based applications while providing better performance than the fork-join version. Our implementation leverages several advanced features of the OmpSs-2 programming model and a new heuristic to select the best parallelization strategy and blocking parameters based on the matrix and hardware characteristics. When evaluating the performance and energy consumption on two modern multi-core systems, we show that our implementations provide significant performance improvements over an optimized OpenMP fork-join implementation, and can beat vendor implementations of the GEMM (e.g., Intel MKL and AMD AOCL). We also demonstrate that a real application can leverage our optimized task-based implementation to enhance performance.Peer ReviewedPostprint (author's final draft

    Computing Platforms for Big Biological Data Analytics: Perspectives and Challenges.

    Full text link
    The last decade has witnessed an explosion in the amount of available biological sequence data, due to the rapid progress of high-throughput sequencing projects. However, the biological data amount is becoming so great that traditional data analysis platforms and methods can no longer meet the need to rapidly perform data analysis tasks in life sciences. As a result, both biologists and computer scientists are facing the challenge of gaining a profound insight into the deepest biological functions from big biological data. This in turn requires massive computational resources. Therefore, high performance computing (HPC) platforms are highly needed as well as efficient and scalable algorithms that can take advantage of these platforms. In this paper, we survey the state-of-the-art HPC platforms for big biological data analytics. We first list the characteristics of big biological data and popular computing platforms. Then we provide a taxonomy of different biological data analysis applications and a survey of the way they have been mapped onto various computing platforms. After that, we present a case study to compare the efficiency of different computing platforms for handling the classical biological sequence alignment problem. At last we discuss the open issues in big biological data analytics

    Task Oriented Programming for the RC64 Manycore DSP

    Get PDF
    RC64 is a rad-hard manycore DSP combining 64 VLIW/SIMD DSP cores, lock-free shared memory, a hardware scheduler and a task-based programming model. The hardware scheduler enables fast scheduling and allocation of fine grain tasks to all cores. Parallel programming is based on Tasks

    Enabling high performance dynamic language programming for micro-core architectures

    Get PDF
    Micro-core architectures are intended to deliver high performance at a low overall power consumption by combining many simple central processing unit (CPU) cores, with an associated small amount of memory, onto a single chip. This technology is not only of great interest for embedded, Edge and IoT applications but also for High-Performance Computing (HPC) accelerators. However, micro-core architectures are difficult to program and exploit, not only because each technology is different, with its own idiosyncrasies, but also because they each present a different low-level interface to the programmer. Furthermore, micro-cores have very constrained amounts of on-chip, scratchpad memory (often around 32KB), further hampering programmer productivity by requiring the programmer to manually manage the regular loading and unloading of data from the host to the device during program execution. To help address these issues, dynamic languages such as Python have been ported to several micro-core architectures but these are often delivered as interpreters with the associated performance penalty over natively compiled languages, such as C. The research questions for this thesis target four areas of concern for dynamic programming languages on micro-core architectures: (RQ1) how to manage the limited on-chip memory for data, (RQ2) how to manage the limited on-chip memory for code, (RQ3) how to address the low runtime performance of virtual machines and (RQ4) how to manage the idiosyncratic architectures of micro-core architectures. The focus of this work is to address these concerns whilst maintaining the programmer productivity benefits of dynamic programming languages, using ePython as the research vehicle. Therefore, key areas of design (such as abstractions for offload) and implementation (novel compiler and runtime techniques for these technologies) are considered, resulting in a number of approaches that are not only applicable to the compilation of Python codes but also more generally to other dynamic languages on micro-cores architectures. RQ1 was addressed by providing support for kernels with arbitrary data size through high-level programming abstractions that enable access to the memory hierarchies of micro-core devices, allowing the deployment of real-world applications, such as a machine learning code to detect cancer cells in full-sized scan images. A new abstract machine, Olympus, addressed RQ2 by supporting the compilation of dynamic languages, such as Python, to micro-core native code. Olympus enables ePython to close the kernel runtime performance gap with native C, matching C for the LINPACK and an iterative Fibonacci benchmark, and to provide, on average, around 75\% of native C runtime performance across four benchmarks running on a set of eight CPU architectures. Olympus also addresses RQ3 by providing dynamic function loading, supporting kernel codes larger than the on-chip memory, whilst still retaining the runtime performance benefits of native code generation. Finally, RQ4 was addressed by the Eithne benchmarking framework which not only enabled a single benchmarking code to be deployed, unchanged, across different CPU architectures, but also provided the underlying communications framework for Olympus. The portability of end-user ePython codes and the underlying Olympus abstract machine were validated by running a set of four benchmarks on eight different CPU architectures, from a single codebase

    Heterogeneity, High Performance Computing, Self-Organization and the Cloud

    Get PDF
    application; blueprints; self-management; self-organisation; resource management; supply chain; big data; PaaS; Saas; HPCaa

    Scalable Applications on Heterogeneous System Architectures: A Systematic Performance Analysis Framework

    Get PDF
    The efficient parallel execution of scientific applications is a key challenge in high-performance computing (HPC). With growing parallelism and heterogeneity of compute resources as well as increasingly complex software, performance analysis has become an indispensable tool in the development and optimization of parallel programs. This thesis presents a framework for systematic performance analysis of scalable, heterogeneous applications. Based on event traces, it automatically detects the critical path and inefficiencies that result in waiting or idle time, e.g. due to load imbalances between parallel execution streams. As a prerequisite for the analysis of heterogeneous programs, this thesis specifies inefficiency patterns for computation offloading. Furthermore, an essential contribution was made to the development of tool interfaces for OpenACC and OpenMP, which enable a portable data acquisition and a subsequent analysis for programs with offload directives. At present, these interfaces are already part of the latest OpenACC and OpenMP API specification. The aforementioned work, existing preliminary work, and established analysis methods are combined into a generic analysis process, which can be applied across programming models. Based on the detection of wait or idle states, which can propagate over several levels of parallelism, the analysis identifies wasted computing resources and their root cause as well as the critical-path share for each program region. Thus, it determines the influence of program regions on the load balancing between execution streams and the program runtime. The analysis results include a summary of the detected inefficiency patterns and a program trace, enhanced with information about wait states, their cause, and the critical path. In addition, a ranking, based on the amount of waiting time a program region caused on the critical path, highlights program regions that are relevant for program optimization. The scalability of the proposed performance analysis and its implementation is demonstrated using High-Performance Linpack (HPL), while the analysis results are validated with synthetic programs. A scientific application that uses MPI, OpenMP, and CUDA simultaneously is investigated in order to show the applicability of the analysis

    Auto-tuning compiler options for HPC

    Get PDF

    高性能計算機に適したヤコビSVD手法の実装技術と性能解析

    Get PDF
    The massive-parallelism of recent high-performance computers (HPC) requires all algorithms running on them, including the singular value decomposition (SVD) algorithm, to have high-level of scalability. To achieve this goal, we focus on the one-sided Jacobi SVD (OSJ) algorithm. The OSJ with extension techniques like blocking and parallelization has a potential for high parallel efficiency due to its large-grained parallelism, but some of its theoretical properties are unknown. Furthermore, implementation techniques of the OSJ for HPC have not been studied well. This thesis aims to analyze the blocking and parallelization techniques of OSJ both theoretically and experimentally and provides new parallelization techniques for HPC. It consists of seven chapters. In the first chapter, the motivation and contribution of the thesis are described. In the second and third chapters, as the background, the current trend of HPC and applications of SVD are described. The idea and detailed algorithm of the OSJ and the existing extension techniques are also described.In the fourth chapter, a new bound on the orthogonality error of Hari’s V2 method is provided. The V2 method is a blocking method for the OSJ suitable for HPC. The bound is tighter than the existing one due to Drmač, thanks to the exploitation of the diagonally scaled structure of the matrix. In the fifth chapter, a new implementation method for HPC based on 2D blocked data distribution and all-reduce type communication is described. Theoretical analysis of the method shows the number of communication is reduced compared with the existing data distributions. Experimental results on highly parallel machines support the theoretical prediction and show good strong-scalability of the method. Bečka’s dynamic ordering method, which can reduce the number of iteration of OSJ, is also analyzed and a new bound on the global convergence rate of the method is provided.In the sixth chapter, implementation techniques of DSYRK for a many-core CPU, a new architecture of CPU which is used in HPC, is considered. DSYRK is a variation of matrix-matrix multiplication used in the V2 method. The new parallelization technique of DSYRK which utilizes all the three dimensions for parallelism in the matrix-matrix multiplication accelerates the performance of DSYRK on Xeon Phi, an Intel’s many-core CPU, up to 75% of the theoretical peak performance.The last chapter concludes the thesis and provides the future work of this study. 現代の高性能計算機は高い並列性をもっており,特異値分解(SVD)のような科学技術計算に用いられる基本的な行列計算についてもこのような高い並列性に対応することが求められている.片側ヤコビ法は広く用いられている行列計算ライブラリLAPACKに実装されたSVD計算手法の一つであり,高い計算精度と実用的な速度を併せ持つ.また,ブロック化や並列化などの拡張手法と組み合わせることで,高い計算効率と粗粒度な並列性を持たせることができる.そのため片側ヤコビ法は高性能計算機に適すると考えられているが,ブロック化や並列化については理論的・実験的検証が不十分であり,また現代の高性能計算機に向けた実装の研究も進んでいなかった. 本論文は,片側ヤコビ法の拡張手法について理論的・実験的解析を行うとともに高い並列性能を持つ実装手法の開発とその解析を行うものであり,7つの章で構成される. 第1章では現状の片側ヤコビ法が持つ精度面での利点と性能面での問題点について実例を用いて示すことで本研究の目的を説明し,また本論文の成果の概略と構成を述べる. 第2章では本研究の背景となる事柄についてまとめる.第一に現代の高性能計算機が持つ特性についてまとめ,高度な並列性に対応することがSVD計算アルゴリズムに不可欠であることを示す.第二に,SVDとその科学技術計算における応用を示し,また近年明らかにされたSVDの誤差の性質についてまとめる.そして最後にSVD計算アルゴリズムについて概要を示し,ヤコビ法やそのSVDへの応用手法である片側ヤコビ法の基礎となるアイデアについて説明する. 第3章では片側ヤコビ法とそれに対する既存の拡張手法について詳細に解説する.ここでははじめに片側ヤコビ法の詳細な手順を示し,その問題点を指摘することで拡張手法の必要性を説明する.そして,第一に並列化と密接な関係があるヤコビ法の巡回順序について,第二に片側ヤコビ法の高性能化に必要なブロック化手法について,第三に近年開発されヤコビ法の劇的な高性能化につながった前処理手法について説明する. 第4章では本論文の成果の1つである,片側ヤコビ法のブロック化したときにおける誤差解析の結果を示す.ブロック化手法はいくつか考えられるが本論文ではとくに高性能計算に適したHariのV2手法を対象にし,片側ヤコビ法の収束性に影響を与える直交性に対する誤差について調べている.本研究における解析はDrmačによる先行研究から発展させたものであり,行列の対角スケーリングされた構造を利用することでより良い上限が得られている.また多数の行列を用いた実験による検証により,上限に出現する行列に依存した係数が実用上小さいことを示している. 第5章では第一に,本論文の2つ目の成果である二次元ブロック分割によるデータ配置とAllReduce型の計算を用いた分散メモリ向け並列化手法について論じている.現代の高性能計算機のほぼすべてが分散メモリ型並列計算機であるが,このような計算機ではデータ配置が計算や通信の特性を決定づける.本論文ではまずV2手法の計算パターンを利用して,二次元ブロック分割とAllReduce型計算の組み合わせた新規の並列化手法を示している.そして計算量や通信量を理論的に解析し,この手法が従来のデータ分散手法と比較してオーダーの意味で通信回数を削減することを示している.また,高性能計算機による実験的検証によってこの手法の性能が理論から得られた予測に従うことを確かめている.さらに京コンピュータを用いた大規模並列計算機上での性能検証を行い,1万次元の行列のSVDを計算するときに約2万5千コアまで性能が向上するという高い強スケーラビリティを確認している.第二に,本論文の3つ目の成果である,Bečkaらの動的順序と呼ばれる並列化手法について理論的検証を行い,一次収束性に対する新しい上界を求めている.この上界は従来のものとは異なり並列数と反比例する形となっており,動的順序の並列計算に適した特性を示している. 第6章では本論文の4つ目の成果である,DSYRKの高性能実装手法について論じている.DSYRKはBLASにおいて定義された対称な形を持つ行列積であり,V2手法における主要な計算の1つである.本研究ではメニーコアCPUという将来使用されると目されるアーキテクチャを持つ,Xeon Phiを用いたときにおける問題点を検証する.DSYRKの既存実装ではDSYRKの持つ対称な構造のためXeon Phiが持つ高い性能を活かせていなかったが,本研究ではメニーコアCPUの持つ高い並列性に対処するため,行列積の持つ3次元的な並列化軸をすべて利用する新規並列化アルゴリズムを示している.またXeon Phi上で性能検証結果では,理論ピーク性能の約76%に達する高い性能を得ている. 第7章では本論文をまとめ,今後の展望について述べている.電気通信大学201
    corecore