142 research outputs found

    Transformations of High-Level Synthesis Codes for High-Performance Computing

    Full text link
    Specialized hardware architectures promise a major step in performance and energy efficiency over the traditional load/store devices currently employed in large scale computing systems. The adoption of high-level synthesis (HLS) from languages such as C/C++ and OpenCL has greatly increased programmer productivity when designing for such platforms. While this has enabled a wider audience to target specialized hardware, the optimization principles known from traditional software design are no longer sufficient to implement high-performance codes. Fast and efficient codes for reconfigurable platforms are thus still challenging to design. To alleviate this, we present a set of optimizing transformations for HLS, targeting scalable and efficient architectures for high-performance computing (HPC) applications. Our work provides a toolbox for developers, where we systematically identify classes of transformations, the characteristics of their effect on the HLS code and the resulting hardware (e.g., increases data reuse or resource consumption), and the objectives that each transformation can target (e.g., resolve interface contention, or increase parallelism). We show how these can be used to efficiently exploit pipelining, on-chip distributed fast memory, and on-chip streaming dataflow, allowing for massively parallel architectures. To quantify the effect of our transformations, we use them to optimize a set of throughput-oriented FPGA kernels, demonstrating that our enhancements are sufficient to scale up parallelism within the hardware constraints. With the transformations covered, we hope to establish a common framework for performance engineers, compiler developers, and hardware developers, to tap into the performance potential offered by specialized hardware architectures using HLS

    Large-Scale Optical Neural Networks based on Photoelectric Multiplication

    Full text link
    Recent success in deep neural networks has generated strong interest in hardware accelerators to improve speed and energy consumption. This paper presents a new type of photonic accelerator based on coherent detection that is scalable to large (N106N \gtrsim 10^6) networks and can be operated at high (GHz) speeds and very low (sub-aJ) energies per multiply-and-accumulate (MAC), using the massive spatial multiplexing enabled by standard free-space optical components. In contrast to previous approaches, both weights and inputs are optically encoded so that the network can be reprogrammed and trained on the fly. Simulations of the network using models for digit- and image-classification reveal a "standard quantum limit" for optical neural networks, set by photodetector shot noise. This bound, which can be as low as 50 zJ/MAC, suggests performance below the thermodynamic (Landauer) limit for digital irreversible computation is theoretically possible in this device. The proposed accelerator can implement both fully-connected and convolutional networks. We also present a scheme for back-propagation and training that can be performed in the same hardware. This architecture will enable a new class of ultra-low-energy processors for deep learning.Comment: Text: 10 pages, 5 figures, 1 table. Supplementary: 8 pages, 5, figures, 2 table

    Accelerating Reduction and Scan Using Tensor Core Units

    Full text link
    Driven by deep learning, there has been a surge of specialized processors for matrix multiplication, referred to as TensorCore Units (TCUs). These TCUs are capable of performing matrix multiplications on small matrices (usually 4x4 or 16x16) to accelerate the convolutional and recurrent neural networks in deep learning workloads. In this paper we leverage NVIDIA's TCU to express both reduction and scan with matrix multiplication and show the benefits -- in terms of program simplicity, efficiency, and performance. Our algorithm exercises the NVIDIA TCUs which would otherwise be idle, achieves 89%-98% of peak memory copy bandwidth, and is orders of magnitude faster (up to 100x for reduction and 3x for scan) than state-of-the-art methods for small segment sizes -- common in machine learning and scientific applications. Our algorithm achieves this while decreasing the power consumption by up to 22% for reduction and16%for scan.Comment: In Proceedings of the ACM International Conference on Supercomputing (ICS '19

    METHODOLOGY AND ANALYSIS FOR EFFICIENT CUSTOM ARCHITECTURE DESIGN USING MACHINE LEARNING

    Get PDF
    Machine learning algorithms especially Deep Neural Networks (DNNs) have revolutionized the arena of computing in the last decade. DNNs along the with the computational advancements also bring an unprecedented appetite for compute and parallel processing. Computer architects have risen to challenge by creating novel custom architectures called accelerators. However, given the ongoing rapid advancements in algorithmic development accelerators architects are playing catch- up to churn out optimized designs each time new algorithmic changes are published. It is also worth noting that the accelerator design cycle is expensive. It requires multiple iteration of design space optimization and expert knowledge of both digital design as well as domain knowledge of the workload itself. It is therefore imperative to build scalable and flexible architectures which are adaptive to work well for a variety of workloads. Moreover, it is also important to develop relevant tools and design methodologies which lower the overheads incurred at design time such that subsequent design iterations are fast and sustainable. This thesis takes a three-pronged approach to address these problems and push the frontiers for DNN accelerator design process. First, the thesis presents the description of a now popular cycle accurate DNN accelerator simulator. This simulator is built with the goal of obtaining detailed metrics as fast as possible. A detailed analytical model is also presented in this thesis which enables the designer to understand the interactions of the workload and architecture parameters. The information from the model can be directly used to prune the design search space to achieve faster convergence. Second, the thesis details a couple of flexible yet scalable DNN accelerator architectures. Finally, this thesis describes the use of machine learning to capture the design space of DNN accelerators and train a model to predict optimum configurations when queried with workload parameters and design constraints. The novelty of this piece of work is that it systematically lays out the formulation of traditional design optimization into a machine learning problem and describes the quality and components of a model which works well across various architecture design tasks.Ph.D
    corecore