191 research outputs found

    swTVM: Exploring the Automated Compilation for Deep Learning on Sunway Architecture

    Full text link
    The flourish of deep learning frameworks and hardware platforms has been demanding an efficient compiler that can shield the diversity in both software and hardware in order to provide application portability. Among the exiting deep learning compilers, TVM is well known for its efficiency in code generation and optimization across diverse hardware devices. In the meanwhile, the Sunway many-core processor renders itself as a competitive candidate for its attractive computational power in both scientific and deep learning applications. This paper combines the trends in these two directions. Specifically, we propose swTVM that extends the original TVM to support ahead-of-time compilation for architecture requiring cross-compilation such as Sunway. In addition, we leverage the architecture features during the compilation such as core group for massive parallelism, DMA for high bandwidth memory transfer and local device memory for data locality, in order to generate efficient code for deep learning application on Sunway. The experimental results show the ability of swTVM to automatically generate code for various deep neural network models on Sunway. The performance of automatically generated code for AlexNet and VGG-19 by swTVM achieves 6.71x and 2.45x speedup on average than hand-optimized OpenACC implementations on convolution and fully connected layers respectively. This work is the first attempt from the compiler perspective to bridge the gap of deep learning and high performance architecture particularly with productivity and efficiency in mind. We would like to open source the implementation so that more people can embrace the power of deep learning compiler and Sunway many-core processor

    Large-Scale Automatic K-Means Clustering for Heterogeneous Many-Core Supercomputer

    Get PDF
    Funding: UK EPSRC grants ”Discovery” EP/P020631/1, ”ABC: Adaptive Brokerage for the Cloud” EP/R010528/1.This article presents an automatic k-means clustering solution targeting the Sunway TaihuLight supercomputer. We first introduce a multilevel parallel partition approach that not only partitions by dataflow and centroid, but also by dimension, which unlocks the potential of the hierarchical parallelism in the heterogeneous many-core processor and the system architecture of the supercomputer. The parallel design is able to process large-scale clustering problems with up to 196,608 dimensions and over 160,000 targeting centroids, while maintaining high performance and high scalability. Furthermore, we propose an automatic hyper-parameter determination process for k-means clustering, by automatically generating and executing the clustering tasks with a set of candidate hyper-parameter, and then determining the optimal hyper-parameter using a proposed evaluation method. The proposed auto-clustering solution can not only achieve high performance and scalability for problems with massive high-dimensional data, but also support clustering without sufficient prior knowledge for the number of targeted clusters, which can potentially increase the scope of k-means algorithm to new application areas.PostprintPeer reviewe

    Supercomputing Frontiers

    Get PDF
    This open access book constitutes the refereed proceedings of the 6th Asian Supercomputing Conference, SCFA 2020, which was planned to be held in February 2020, but unfortunately, the physical conference was cancelled due to the COVID-19 pandemic. The 8 full papers presented in this book were carefully reviewed and selected from 22 submissions. They cover a range of topics including file systems, memory hierarchy, HPC cloud platform, container image configuration workflow, large-scale applications, and scheduling

    Heterogeneity-aware scheduling and data partitioning for system performance acceleration

    Get PDF
    Over the past decade, heterogeneous processors and accelerators have become increasingly prevalent in modern computing systems. Compared with previous homogeneous parallel machines, the hardware heterogeneity in modern systems provides new opportunities and challenges for performance acceleration. Classic operating systems optimisation problems such as task scheduling, and application-specific optimisation techniques such as the adaptive data partitioning of parallel algorithms, are both required to work together to address hardware heterogeneity. Significant effort has been invested in this problem, but either focuses on a specific type of heterogeneous systems or algorithm, or a high-level framework without insight into the difference in heterogeneity between different types of system. A general software framework is required, which can not only be adapted to multiple types of systems and workloads, but is also equipped with the techniques to address a variety of hardware heterogeneity. This thesis presents approaches to design general heterogeneity-aware software frameworks for system performance acceleration. It covers a wide variety of systems, including an OS scheduler targeting on-chip asymmetric multi-core processors (AMPs) on mobile devices, a hierarchical many-core supercomputer and multi-FPGA systems for high performance computing (HPC) centers. Considering heterogeneity from on-chip AMPs, such as thread criticality, core sensitivity, and relative fairness, it suggests a collaborative based approach to co-design the task selector and core allocator on OS scheduler. Considering the typical sources of heterogeneity in HPC systems, such as the memory hierarchy, bandwidth limitations and asymmetric physical connection, it proposes an application-specific automatic data partitioning method for a modern supercomputer, and a topological-ranking heuristic based schedule for a multi-FPGA based reconfigurable cluster. Experiments on both a full system simulator (GEM5) and real systems (Sunway Taihulight Supercomputer and Xilinx Multi-FPGA based clusters) demonstrate the significant advantages of the suggested approaches compared against the state-of-the-art on variety of workloads."This work is supported by St Leonards 7th Century Scholarship and Computer Science PhD funding from University of St Andrews; by UK EPSRC grant Discovery: Pattern Discovery and Program Shaping for Manycore Systems (EP/P020631/1)." -- Acknowledgement

    Large-scale hierarchical k-means for heterogeneous many-core supercomputers

    Get PDF
    Funding: J.Thomson and T.Yu are supported by the EPSRC grants ”Discovery” EP/P020631/1, ”ABC: Adaptive Brokerage for the Cloud” EP/R010528/1, and EU Horizon 2020 grant Team-Play: ”Time, Energy and security Analysis for Multi/Many-core heterogenous PLAtforms” (ICT-779882, https://teamplay- h2020.eu)This paper presents a novel design and implementation of k-means clustering algorithm targeting the Sunway TaihuLight supercomputer. We introduce a multi-level parallel partition approach that not only partitions by dataflow and centroid, but also by dimension. Our multi-level (nkd) approach unlocks the potential of the hierarchical parallelism in the SW26010 heterogeneous many-core processor and the system architecture of the supercomputer. Our design is able to process large-scale clustering problems with up to 196,608 dimensions and over 160,000 targeting centroids, while maintaining high performance and high scalability, significantly improving the capability of k-means over previous approaches. The evaluation shows our implementation achieves performance of less than 18 seconds per iteration for a large-scale clustering case with 196,608 data dimensions and 2,000 centroids by applying 4,096 nodes (1,064,496 cores) in parallel, making k-means a more feasible solution for complex scenarios.Postprin

    212962^{1296} Exponentially Complex Quantum Many-Body Simulation via Scalable Deep Learning Method

    Full text link
    For decades, people are developing efficient numerical methods for solving the challenging quantum many-body problem, whose Hilbert space grows exponentially with the size of the problem. However, this journey is far from over, as previous methods all have serious limitations. The recently developed deep learning methods provide a very promising new route to solve the long-standing quantum many-body problems. We report that a deep learning based simulation protocol can achieve the solution with state-of-the-art precision in the Hilbert space as large as 212962^{1296} for spin system and 31443^{144} for fermion system , using a HPC-AI hybrid framework on the new Sunway supercomputer. With highly scalability up to 40 million heterogeneous cores, our applications have measured 94% weak scaling efficiency and 72% strong scaling efficiency. The accomplishment of this work opens the door to simulate spin models and Fermion models on unprecedented lattice size with extreme high precision.Comment: Massive ground state optimizations of CNN-based wave-functions for J1J1-J2J2 model and tt-JJ model carried out on a heterogeneous cores supercompute

    A Survey of Parallel Message Authentication and Hashing Methods

    Get PDF
    مقدمة: الإنترنت، وتبادل المعلومات، والتواصل الاجتماعي، وغيرها من الأنشطة التي ازدادت بشكل كبير في السنوات الأخيرة. لذلك، يتطلب الأمر زيادة السرية والخصوصية. في الأيام الأخيرة، كان الاحتيال عبر الإنترنت واحدًا من العوائق الرئيسية لنشر استخدام تطبيقات الأعمال. وبالتالي، تحدث الثلاث مخاوف الأمنية الهامة بشكل يومي في عالم الأزياء الشفافة لدينا، وهي: الهوية، والمصادقة، والترخيص. التعرف هو إجراء يسمح بتحديد هوية كيان ما، والذي يمكن أن يكون شخصًا أو جهاز كمبيوتر أو أصل آخر مثل مبرمج برامج. طرق العمل: في أنظمة الأمان، المصادقة والترخيص هما إجراءان مكملان لتحديد من يمكنه الوصول إلى موارد المعلومات عبر الشبكة. تم تقديم العديد من الحلول في الأدبيات. وللحصول على أداء أفضل في خوارزميات المصادقة، استخدم الباحثون التوازي لزيادة الإنتاجية لخوارزمياتهم. من جهة، تم استخدام مجموعة من الطرق لزيادة مستوى الأمان في الأنظمة التشفيرية، بما في ذلك زيادة عدد الجولات، واستخدام جداول الاستبدال ودمج آليات الأمان الأخرى لتشفير الرسائل والمصادقة عليها. النتائج: أظهرت الدراسات الحديثة حول مصادقة الرسائل المتوازية وخوارزميات التجزئة أن وحدات معالجة الرسومات تتفوق في الأداء على الأنظمة الأساسية المتوازية الأخرى من حيث الأداء. الاستنتاجات: يقدم هذا العمل تنفيذًا متوازيًا لتقنيات مصادقة الرسائل على العديد من الأنظمة الأساسية. تدرس وتعرض الأعمال التي تناقش المصادقة والتجزئة وتنفيذها على منصة موازية كهدف رئيسي.Background: Currently, there are approximately 4.95 billion people who use the Internet. This massive audience desires internet shopping, information exchange, social networking, and other activities that have grown dramatically in recent years. Therefore, it creates the need for greater confidentiality and privacy. In recent days, fraud via the Internet has been one of the key impediments to the dissemination of the use of business apps. Therefore, the three important security concerns actually occur daily in our world of transparent fashion, more accurately: identity, authentication, and authorization. Identification is a procedure that permits the recognition of an entity, which may be a person, a computer, or another asset such as a software programmer. Materials and Methods: In security systems, authentication and authorization are two complementary procedures for deciding who may access the information resources across a network. Many solutions have been presented in the literature. To get more performance on the authentication algorithmic, researchers used parallelism to increase the throughput of their algorithms.  On the one hand, various approaches have been employed to enhance the security of cryptographic systems, including increasing the number of rounds, utilizing substitution tables, and integrating other security primitives for encryption and message authentication. Results: Recent studies on parallel message authentication and hashing algorithms have demonstrated that GPUs outperform other parallel platforms in terms of performance. Conclusion: This work presents a parallel implementation of message authentication techniques on several platforms. It is studying and demonstrating works which discuss authentication, hashing, and their implementation on a parallel platform as a main objective
    corecore