44 research outputs found
Adaptive Task Size Control on High Level Programming for GPU/CPU Work Sharing
International audienceOn the work sharing among GPUs and CPU cores on GPU equipped clusters, it is a critical issue to keep load balance among these heterogeneous computing resources. We have been developing a runtime system for this problem on PGAS language named XcalableMP- dev/StarPU [1]. Through the development, we found the necessity of adaptive load balancing for GPU/CPU work sharing to achieve the best performance for various application codes. In this paper, we enhance our language system XcalableMP-dev/StarPU to add a new feature which can control the task size to be assigned to these heterogeneous resources dynamically during application execution. As a result of performance evaluation on several benchmarks, we confirmed the proposed feature correctly works and the performance with heterogeneous work sharing provides up to about 40% higher performance than GPU-only utilization even for relatively small size of problems
White Paper from Workshop on Large-scale Parallel Numerical Computing Technology (LSPANC 2020): HPC and Computer Arithmetic toward Minimal-Precision Computing
In numerical computations, precision of floating-point computations is a key
factor to determine the performance (speed and energy-efficiency) as well as
the reliability (accuracy and reproducibility). However, precision generally
plays a contrary role for both. Therefore, the ultimate concept for maximizing
both at the same time is the minimal-precision computing through
precision-tuning, which adjusts the optimal precision for each operation and
data. Several studies have been already conducted for it so far (e.g.
Precimoniuos and Verrou), but the scope of those studies is limited to the
precision-tuning alone. Hence, we aim to propose a broader concept of the
minimal-precision computing system with precision-tuning, involving both
hardware and software stack.
In 2019, we have started the Minimal-Precision Computing project to propose a
more broad concept of the minimal-precision computing system with
precision-tuning, involving both hardware and software stack. Specifically, our
system combines (1) a precision-tuning method based on Discrete Stochastic
Arithmetic (DSA), (2) arbitrary-precision arithmetic libraries, (3) fast and
accurate numerical libraries, and (4) Field-Programmable Gate Array (FPGA) with
High-Level Synthesis (HLS).
In this white paper, we aim to provide an overview of various technologies
related to minimal- and mixed-precision, to outline the future direction of the
project, as well as to discuss current challenges together with our project
members and guest speakers at the LSPANC 2020 workshop;
https://www.r-ccs.riken.jp/labs/lpnctrt/lspanc2020jan/
CP-PACS : A massively parallel processor for large scale scientific calculations
CP-PACS (Computational Physics by Parallel Array Computer System) is a massively parallel processor with 2048 processing units built at Center for Computational Physics, University of Tsukuba. It has an MIMD architecture with distributed memory system. The node processor of CPPACS is a RISC microprocessor enhanced by Pseudo Vector Processing feature, which can realize high-performance vector processing. The interconnection network is 3-dimensional Hyper-Crossbar Network, which has high flexibility and embeddability for various network topologies and communication patterns. The theoretical peak performance of whole system is 614.4 GFLOPS. In this paper, we describe the overview of CP-PACS architecture and several special architectural characteristics of it. Then, several performance evaluations both for single node processor and for parallel system are described based on LINPACK and Kernel CG of NAS Parallel Benchmarks. Through these evaluations, the effectiveness of Pseudo Vector Proce..
Performance Improvement for Matrix Calculation on CP-PACS Node Processor
CP-PACS (Computational Physics by Parallel Array Computer System) is a massively parallel processing system with 2048 node processors for large scale scientific calculations. On a node processor of CPPACS, there is a special hardware feature called PVPSW (Pseudo Vector Processor based on Slide Window) , which realizes an efficient vector processing on a superscalar processor without depending on the cache. In this paper, we present the effectiveness of PVPSW by performance measurement on single node processor for LINPACK benchmark. Utilizing loop unrolling techniques and Block-TLB feature, PVP-SW function improves the basic performance up to 3.3 times faster for 1000 2 1000 LINPACK. This performance corresponds to the 73% of theoretical peak. 1 Introduction For efficient large scale scientific calculations on massively parallel processors (MPP's), the sustained performance of each node processor must be enough high as well as increasing the number of node processors. CP-PACS [1] (Comp..