19 research outputs found
swTVM: Exploring the Automated Compilation for Deep Learning on Sunway Architecture
The flourish of deep learning frameworks and hardware platforms has been
demanding an efficient compiler that can shield the diversity in both software
and hardware in order to provide application portability. Among the exiting
deep learning compilers, TVM is well known for its efficiency in code
generation and optimization across diverse hardware devices. In the meanwhile,
the Sunway many-core processor renders itself as a competitive candidate for
its attractive computational power in both scientific and deep learning
applications. This paper combines the trends in these two directions.
Specifically, we propose swTVM that extends the original TVM to support
ahead-of-time compilation for architecture requiring cross-compilation such as
Sunway. In addition, we leverage the architecture features during the
compilation such as core group for massive parallelism, DMA for high bandwidth
memory transfer and local device memory for data locality, in order to generate
efficient code for deep learning application on Sunway. The experimental
results show the ability of swTVM to automatically generate code for various
deep neural network models on Sunway. The performance of automatically
generated code for AlexNet and VGG-19 by swTVM achieves 6.71x and 2.45x speedup
on average than hand-optimized OpenACC implementations on convolution and fully
connected layers respectively. This work is the first attempt from the compiler
perspective to bridge the gap of deep learning and high performance
architecture particularly with productivity and efficiency in mind. We would
like to open source the implementation so that more people can embrace the
power of deep learning compiler and Sunway many-core processor
Exponentially Complex Quantum Many-Body Simulation via Scalable Deep Learning Method
For decades, people are developing efficient numerical methods for solving
the challenging quantum many-body problem, whose Hilbert space grows
exponentially with the size of the problem. However, this journey is far from
over, as previous methods all have serious limitations. The recently developed
deep learning methods provide a very promising new route to solve the
long-standing quantum many-body problems. We report that a deep learning based
simulation protocol can achieve the solution with state-of-the-art precision in
the Hilbert space as large as for spin system and for
fermion system , using a HPC-AI hybrid framework on the new Sunway
supercomputer. With highly scalability up to 40 million heterogeneous cores,
our applications have measured 94% weak scaling efficiency and 72% strong
scaling efficiency. The accomplishment of this work opens the door to simulate
spin models and Fermion models on unprecedented lattice size with extreme high
precision.Comment: Massive ground state optimizations of CNN-based wave-functions for
- model and - model carried out on a heterogeneous cores
supercompute
MG3MConv: Multi-Grained Matrix-Multiplication-Mapping Convolution Algorithm toward the SW26010 Processor
As the core of artificial intelligence applications, the research of
convolution has become a hot topic in high performance computing. With the
rapid development of the emerging SW26010 processor in artificial intelligence,
there is an urgent need for high-performance convolution algorithms on the
processor. However, the current support of convolution on SW26010 is still
rudimentary. The only studies provide sufficient runtime peak performance but
lack the adaptability to various convolution scenes. To perfect convolution
algorithms on SW26010, we propose a multi-grained matrix-multiplication-mapping
convolution algorithm called MG3MConv, which targets the architectural features
of SW26010. MG3MConv supports diversified mapping schemes of convolution tasks
based on the concept of the thread block proposed in this paper. All the
architecture-oriented optimization methods are elaborately designed from four
levels to fully exploit the hardware efficiency of SW26010. The experiments
show that the hardware efficiency of MG3MConv can reach 84.78% in max, which is
1.75 times compared with that of cuDNN based on NVIDIA K80m GPU. Moreover,
MG3MConv can overperform cuDNN in most convolution scenes. We also use six
representative CNNs as real-world cases, and the hardware efficiency of
MG3MConv reaches up to 67.04% on the VGG network model, which is 1.37 times and
1.96 times that of cuDNN and swDNN, respectively
Heterogeneity-aware scheduling and data partitioning for system performance acceleration
Over the past decade, heterogeneous processors and accelerators have become increasingly prevalent in modern computing systems. Compared with previous homogeneous parallel machines, the hardware heterogeneity in modern systems provides new opportunities and challenges for performance acceleration. Classic operating systems optimisation problems such as task scheduling, and application-specific optimisation techniques such as the adaptive data partitioning of parallel algorithms, are both required to work together to address hardware heterogeneity.
Significant effort has been invested in this problem, but either focuses on a specific type of heterogeneous systems or algorithm, or a high-level framework without insight into the difference in heterogeneity between different types of system. A general software framework is required, which can not only be adapted to multiple types of systems and workloads, but is also equipped with the techniques to address a variety of hardware heterogeneity.
This thesis presents approaches to design general heterogeneity-aware software frameworks for system performance acceleration. It covers a wide variety of systems, including an OS scheduler targeting on-chip asymmetric multi-core processors (AMPs) on mobile devices, a hierarchical many-core supercomputer and multi-FPGA systems for high performance computing (HPC) centers. Considering heterogeneity from on-chip AMPs, such as thread criticality, core sensitivity, and relative fairness, it suggests a collaborative based approach to co-design the task selector and core allocator on OS scheduler. Considering the typical sources of heterogeneity in HPC systems, such as the memory hierarchy, bandwidth limitations and asymmetric physical connection, it proposes an application-specific automatic data partitioning method for a modern supercomputer, and a topological-ranking heuristic based schedule for a multi-FPGA based reconfigurable cluster.
Experiments on both a full system simulator (GEM5) and real systems (Sunway Taihulight Supercomputer and Xilinx Multi-FPGA based clusters) demonstrate the significant advantages of the suggested approaches compared against the state-of-the-art on variety of workloads."This work is supported by St Leonards 7th Century Scholarship and
Computer Science PhD funding from University of St Andrews; by UK
EPSRC grant Discovery: Pattern Discovery and Program Shaping for Manycore
Systems (EP/P020631/1)." -- Acknowledgement
A survey of network-based hardware accelerators
Many practical data-processing algorithms fail to execute efficiently on general-purpose CPUs (Central Processing Units) due to the sequential matter of their operations and memory bandwidth limitations. To achieve desired performance levels, reconfigurable (FPGA (Field-Programmable Gate Array)-based) hardware accelerators are frequently explored that permit the processing units’ architectures to be better adapted to the specific problem/algorithm requirements. In particular, network-based data-processing algorithms are very well suited to implementation in reconfigurable hardware because several data-independent operations can easily and naturally be executed in parallel over as many processing blocks as actually required and technically possible. GPUs (Graphics Processing Units) have also demonstrated good results in this area but they tend to use significantly more power than FPGA, which could be a limiting factor in embedded applications. Moreover, GPUs employ a Single Instruction, Multiple Threads (SIMT) execution model and are therefore optimized to SIMD (Single Instruction, Multiple Data) operations, while in FPGAs fully custom datapaths can be built, eliminating much of the control overhead. This review paper aims to analyze, compare, and discuss different approaches to implementing network-based hardware accelerators in FPGA and programmable SoC (Systems-on-Chip). The performed analysis and the derived recommendations would be useful to hardware designers of future network-based hardware accelerators.publishe
Supercomputing Frontiers
This open access book constitutes the refereed proceedings of the 6th Asian Supercomputing Conference, SCFA 2020, which was planned to be held in February 2020, but unfortunately, the physical conference was cancelled due to the COVID-19 pandemic. The 8 full papers presented in this book were carefully reviewed and selected from 22 submissions. They cover a range of topics including file systems, memory hierarchy, HPC cloud platform, container image configuration workflow, large-scale applications, and scheduling