    Experiences in porting mini-applications to OpenACC and OpenMP on heterogeneous systems

    This article studies mini-applications—Minisweep, GenASiS, GPP, and FF—that use computational methods commonly encountered in HPC. We have ported these applications to develop OpenACC and OpenMP versions, and evaluated their performance on Titan (Cray XK7 with K20x GPUs), Cori (Cray XC40 with Intel KNL), Summit (IBM AC922 with Volta GPUs), and Cori-GPU (Cray CS-Storm 500NX with Intel Skylake and Volta GPUs). Our goals are for these new ports to be useful to both application and compiler developers, to document and describe the lessons learned and the methodology to create optimized OpenMP and OpenACC versions, and to provide a description of possible migration paths between the two specifications. Cases where specific directives or code patterns result in improved performance for a given architecture are highlighted. We also include discussions of the functionality and maturity of the latest compilers available on the above platforms with respect to OpenACC or OpenMP implementations

    vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design

    The most widely used machine learning frameworks require users to carefully tune their memory usage so that the deep neural network (DNN) fits into the DRAM capacity of a GPU. This restriction hampers a researcher's flexibility to study different machine learning algorithms, forcing them to either use a less desirable network architecture or parallelize the processing across multiple GPUs. We propose a runtime memory manager that virtualizes the memory usage of DNNs such that both GPU and CPU memory can simultaneously be utilized for training larger DNNs. Our virtualized DNN (vDNN) reduces the average GPU memory usage of AlexNet by up to 89%, OverFeat by 91%, and GoogLeNet by 95%, a significant reduction in memory requirements of DNNs. Similar experiments on VGG-16, one of the deepest and memory hungry DNNs to date, demonstrate the memory-efficiency of our proposal. vDNN enables VGG-16 with batch size 256 (requiring 28 GB of memory) to be trained on a single NVIDIA Titan X GPU card containing 12 GB of memory, with 18% performance loss compared to a hypothetical, oracular GPU with enough memory to hold the entire DNN.Comment: Published as a conference paper at the 49th IEEE/ACM International Symposium on Microarchitecture (MICRO-49), 201

    Study of Parallel Programming Models on Computer Clusters with Accelerators

    In order to reach exascale computing capability, accelerators have become a crucial part in developing supercomputers. This work examines the potential of two latest acceleration technologies, Intel Many Integrated Core (MIC) Architecture and Graphics Processing Units (GPUs). This thesis applies three benchmarks under 3 different configurations, MPI+CPU, MPI+GPU, and MPI+MIC. The benchmarks include intensely communicating application, loosely communicating application, and embarrassingly parallel application. This thesis also carries out a detailed study on the scalability and performance of MIC processors under two programming models, i.e., offload model and native model, on the Beacon computer cluster. According to different benchmarks, the results demonstrate different performance and scalability between GPU and MIC. (1) For embarrassingly parallel case, GPU-based parallel implementation on Keeneland computer cluster has a better performance than other accelerators. However, MIC-based parallel implementation shows a better scalability than the implementation on GPU. The performances of native model and offload model on MIC are very close. (2) For loosely communicating case, the performances on GPU and MIC are very close. The MIC-based parallel implementation still demonstrates a strong scalability when using 120 MIC processors in computation. (3) For the intensely communicating case, the MPI implementations on CPUs and GPUs both have a strong scalability. GPUs can consistently outperform other accelerators. However, the MIC-based implementation cannot scale quite well. The performance of different models on MIC is different from the performance of embarrassingly parallel case. Native model can consistently outperform the offload model by ~10 times. And there is not much performance gain when allocating more MIC processors. The increase of communication cost will offset the performance gain from the reduced workload on each MIC core. This work also tests the performance capabilities and scalability by changing the number of threads on each MIC card form 10 to 60. When using different number of threads for the intensely communicating case, it shows different capabilities of the MIC based offload model. The scalability can hold when the number of threads increases from 10 to 30, and the computation time reduces with a smaller rate from 30 threads to 50 threads. When using 60 threads, the computation time will increase. The reason is that the communication overhead will offset the performance gain when 60 threads are deployed on a single MIC card

    A functional approach to heterogeneous computing in embedded systems

    Developing programs for embedded systems presents quite a challenge; not only should programs be resource efficient, as they operate under memory and timing constraints, but they should also take full advantage of the hardware to achieve maximum performance. Since performance is such a significant factor in the design of embedded systems, modern systems typically incorporate more than one kind of processing element to benefit from specialized processing capabilities. For such heterogeneous systems the challenge in developing programs is even greater.In this thesis we explore a functional approach to heterogeneous system development as a means to address many of the modularity problems that are typically found in the application of low-level imperative programming for embedded systems. In particular, we explore a staged hardware software co-design language that we name Co-Feldspar and embed in Haskell. The staged approach enables designers to build their applications from reusable components and skeletons while retaining control over much of the generated source code. Furthermore, by embedding the language in Haskell we can exploit its type classes to write not only hardware and software programs, but also generic programs with overloaded instructions and expressions. We demonstrate the usefulness of the functional approach for co-design on a cryptographic example and signal processing filters, and benchmark software and mixed hardware-software implementations. Co-Feldspar currently adopts a monadic interface, which provides an imperative functional programming style that is suitable for explicit memory management and algorithms that rely on a certain evaluation order. For algorithms that are better defined as pure functions operating on immutable values, we provide a signal and array library that extends a monadic language, like Co-Feldspar. These extensions permit a functional style of programming by composing high-level combinators. Our compiler transforms such high-level code into efficient programs with mutating code. In particular, we show how to execute an FFT safely in-place, and how to describe a FIR and IIR filter efficiently as streams. Co-Feldspar’s monadic interface is however quite invasive; not only is the burden of explicit memory management quite heavy on the user, it is also quite easy to shoot on eself in the foot. It is for these reasons that we also explore a dynamic memory management discipline that is based on regions but predictable enough to be of use for embedded systems. Specifically, this thesis introduces a program analysis which annotates values with dynamically allocated memory regions. By limiting our efforts to functional languages that target embedded software, we manage to define a region inference algorithm that is considerably simpler than traditional approaches


    Heterogeneous multi/many-core chips are commonly used in today’s top tier supercomputers. Similar heterogeneous processing elements — or, computation ac- celerators — are commonly found in FPGA systems. Within both multi/many-core chips and FPGA systems, the on-chip network plays a critical role by connecting these processing elements together. However, The common use of the on-chip network is for point-to-point communication between on-chip components and the memory in- terface. As the system scales up with more nodes, traditional programming methods, such as MPI, cannot effectively use the on-chip network and the off-chip network, therefore could make communication the performance bottleneck. This research proposes a MPI-like Message Passing Engine (MPE) as part of the on-chip network, providing point-to-point and collective communication primitives in hardware. On one hand, the MPE improves the communication performance by offloading the communication workload from the general processing elements. On the other hand, the MPE provides direct interface to the heterogeneous processing ele- ments which can eliminate the data path going around the OS and libraries. Detailed experimental results have shown that the MPE can significantly reduce the com- munication time and improve the overall performance, especially for heterogeneous computing systems because of the tight coupling with the network. Additionally, a hybrid “MPI+X” computing system is tested and it shows MPE can effectively of- fload the communications and let the processing elements play their strengths on the computation

    XEngine : Optimal Tensor Rematerialization for Neural Networks in Heterogeneous Environments

    Memory efficiency is crucial in training deep learning networks on resource-restricted devices. During backpropagation, forward tensors are used to calculate gradients. Despite the option of keeping those dependencies in memory until they are reused in backpropagation, some forward tensors can be discarded and recomputed later from saved tensors, so-called checkpoints. This allows, in particular, for resource-constrained heterogeneous environments to make use of all available compute devices. Unfortunately, the definition of these checkpoints is a non-trivial problem and poses a challenge to the programmer—improper or excessive recomputations negate the benefit of checkpointing. In this article, we present XEngine, an approach that schedules network operators to heterogeneous devices in low memory environments by determining checkpoints and recomputations of tensors. Our approach selects suitable resources per timestep and operator and optimizes the end-to-end time for neural networks taking the memory limitation of each device into account. For this, we formulate a mixed-integer quadratic program (MIQP) to schedule operators of deep learning networks on heterogeneous systems. We compare our MIQP solver XEngine against Checkmate [12], a mixed-integer linear programming (MILP) approach that solves recomputation on a single device. Our solver finds solutions that are up to 22.5% faster than the fastest Checkmate schedule in which the network is computed exclusively on a single device. We also find valid schedules for networks making use of both central processing units and graphics processing units if memory limitations do not allow scheduling exclusively to the graphics processing unit