1,532 research outputs found

    Interstellar: Using Halide's Scheduling Language to Analyze DNN Accelerators

    Full text link
    We show that DNN accelerator micro-architectures and their program mappings represent specific choices of loop order and hardware parallelism for computing the seven nested loops of DNNs, which enables us to create a formal taxonomy of all existing dense DNN accelerators. Surprisingly, the loop transformations needed to create these hardware variants can be precisely and concisely represented by Halide's scheduling language. By modifying the Halide compiler to generate hardware, we create a system that can fairly compare these prior accelerators. As long as proper loop blocking schemes are used, and the hardware can support mapping replicated loops, many different hardware dataflows yield similar energy efficiency with good performance. This is because the loop blocking can ensure that most data references stay on-chip with good locality and the processing units have high resource utilization. How resources are allocated, especially in the memory system, has a large impact on energy and performance. By optimizing hardware resource allocation while keeping throughput constant, we achieve up to 4.2X energy improvement for Convolutional Neural Networks (CNNs), 1.6X and 1.8X improvement for Long Short-Term Memories (LSTMs) and multi-layer perceptrons (MLPs), respectively.Comment: Published as a conference paper at ASPLOS 202

    Preliminary Report on High-Performance Computational Structures for Robot Control

    Get PDF
    In this report we present some initial results of our work completed thus far on Computational Structures for Robot Control . A SIMD architecture with the crossbar interprocessor network which achieves the parallel processing execution time lower bound of o( [a1n ]), where a1 is a constant and n is the number of manipulator joints, for the computation of the inverse dynamics problem, is discussed. A novel SIMD task scheduling algorithm that optimizes the parallel processing performance on the indicated architecture is also delineated. Simulations performed on this architecture show speedup factor of 3.4 over previous related work completed for the evaluation of the specified problem, is achieved. Parallel processing of PUMA forward and inverse kinematics solutions is next investigated using a particular scheduling algorithm. In addition, a custom bit-serial array architecture is designed for the computation of the inverse dynamics problem within the bit-serial execution time lower bound of o(c1k + c2kn), where c1 and c2 are specified constants, k is the word length, and n is the number of manipulator joints. Finally, mapping of the Newton-Euler equations onto a fixed systolic array is investigated. A balanced architecture for the inverse dynamics problem which achieves the systolic execution time lower bound for the specified problem is depicted. Please note again that these results are only preliminary and improvements to our algorithms and architectures are currently still being made

    Efficient Mapping of Neural Network Models on a Class of Parallel Architectures.

    Get PDF
    This dissertation develops a formal and systematic methodology for efficient mapping of several contemporary artificial neural network (ANN) models on k-ary n-cube parallel architectures (KNC\u27s). We apply the general mapping to several important ANN models including feedforward ANN\u27s trained with backpropagation algorithm, radial basis function networks, cascade correlation learning, and adaptive resonance theory networks. Our approach utilizes a parallel task graph representing concurrent operations of the ANN model during training. The mapping of the ANN is performed in two steps. First, the parallel task graph of the ANN is mapped to a virtual KNC of compatible dimensionality. This involves decomposing each operation into its atomic tasks. Second, the dimensionality of the virtual KNC architecture is recursively reduced through a sequence of transformations until a desired metric is optimized. We refer to this process as folding the virtual architecture. The optimization criteria we consider in this dissertation are defined in terms of the iteration time of the algorithm on the folded architecture. If necessary, the mapping scheme may utilize a subset of the processors of a given KNC architecture if it results in the most efficient simulation. A unique feature of our mapping is that it systematically selects an appropriate degree of parallelism leading to a highly efficient realization of the ANN model on KNC architectures. A novel feature of our work is its ability to efficiently map unit-allocating ANN\u27s. These networks possess a dynamic structure which grows during training. We present a highly efficient scheme for simulating such networks on existing KNC parallel architectures. We assume an upper bound on size of the neural network We perform the folding such that the iteration time of the largest network is minimized. We show that our mapping leads to near-optimal simulation of smaller instances of the neural network. In addition, based on our mapping no data migration or task rescheduling is needed as the size of network grows

    Throughput-optimal systolic arrays from recurrence equations

    Get PDF
    Many compute-bound software kernels have seen order-of-magnitude speedups on special-purpose accelerators built on specialized architectures such as field-programmable gate arrays (FPGAs). These architectures are particularly good at implementing dynamic programming algorithms that can be expressed as systems of recurrence equations, which in turn can be realized as systolic array designs. To efficiently find good realizations of an algorithm for a given hardware platform, we pursue software tools that can search the space of possible parallel array designs to optimize various design criteria. Most existing design tools in this area produce a design that is latency-space optimal. However, we instead wish to target applications that operate on a large collection of small inputs, e.g. a database of biological sequences. For such applications, overall throughput rather than latency per input is the most important measure of performance. In this work, we introduce a new procedure to optimize throughput of a systolic array subject to resource constraints, in this case the area and bandwidth constraints of an FPGA device. We show that the throughput of an array is dependent on the maximum number of lattice points executed by any processor in the array, which to a close approximation is determined solely by the array’s projection vector. We describe a bounded search process to find throughput-optimal projection vectors and a tool to perform automated design space exploration, discovering a range of array designs that are optimal for inputs of different sizes. We apply our techniques to the Nussinov RNA folding algorithm to generate multiple mappings of this algorithm into systolic arrays. By combining our library of designs with run-time reconfiguration of an FPGA device to dynamically switch among them, we predict significant speedup over a single, latency-space optimal array

    Formal synthesis of control signals for systolic arrays

    Get PDF

    A Comprehensive Methodology for Algorithm Characterization, Regularization and Mapping Into Optimal VLSI Arrays.

    Get PDF
    This dissertation provides a fairly comprehensive treatment of a broad class of algorithms as it pertains to systolic implementation. We describe some formal algorithmic transformations that can be utilized to map regular and some irregular compute-bound algorithms into the best fit time-optimal systolic architectures. The resulted architectures can be one-dimensional, two-dimensional, three-dimensional or nonplanar. The methodology detailed in the dissertation employs, like other methods, the concept of dependence vector to order, in space and time, the index points representing the algorithm. However, by differentiating between two types of dependence vectors, the ordering procedure is allowed to be flexible and time optimal. Furthermore, unlike other methodologies, the approach reported here does not put constraints on the topology or dimensionality of the target architecture. The ordered index points are represented by nodes in a diagram called Systolic Precedence Diagram (SPD). The SPD is a form of precedence graph that takes into account the systolic operation requirements of strictly local communications and regular data flow. Therefore, any algorithm with variable dependence vectors has to be transformed into a regular indexed set of computations with local dependencies. This can be done by replacing variable dependence vectors with sets of fixed dependence vectors. The SPD is transformed into an acyclic, labeled, directed graph called the Systolic Directed Graph (SDG). The SDG models the data flow as well as the timing for the execution of the given algorithm on a time-optimal array. The target architectures are obtained by projecting the SDG along defined directions. If more than one valid projection direction exists, different designs are obtained. The resulting architectures are then evaluated to determine if an improvement in the performance can be achieved by increasing PE fan-out. If so, the methodology provides the corresponding systolic implementation. By employing a new graph transformation, the SDG is manipulated so that it can be mapped into fixed-size and fixed-depth multi-linear arrays. The latter is a new concept of systolic arrays that is adaptable to changes in the state of technology. It promises a bonded clock skew, higher throughput and better performance than the linear implementation

    On the synthesis of integral and dynamic recurrences

    Get PDF
    PhD ThesisSynthesis techniques for regular arrays provide a disciplined and well-founded approach to the design of classes of parallel algorithms. The design process is guided by a methodology which is based upon a formal notation and transformations. The mathematical model underlying synthesis techniques is that of affine Euclidean geometry with embedded lattice spaces. Because of this model, computationally powerful methods are provided as an effective way of engineering regular arrays. However, at present the applicability of such methods is limited to so-called affine problems. The work presented in this thesis aims at widening the applicability of standard synthesis methods to more general classes of problems. The major contributions of this thesis are the characterisation of classes of integral and dynamic problems, and the provision of techniques for their systematic treatment within the framework of established synthesis methods. The basic idea is the transformation of the initial algorithm specification into a specification with data dependencies of increased regularity, so that corresponding regular arrays can be obtained by a direct application of the standard mapping techniques. We will complement the formal development of the techniques with the illustration of a number of case studies from the literature.EPSR

    Systolic Array Implementations With Reduced Compute Time.

    Get PDF
    The goal of the research is the establishment of a formal methodology to develop computational structures more suitable for the changing nature of real-time signal processing and control applications. A major effort is devoted to the following question: Given a systolic array designed to execute a particular algorithm, what other algorithms can be executed on the same array? One approach for answering this question is based on a general model of array operations using graph-theoretic techniques. As a result, a systematic procedure is introduced that models array operations as a function of the compute cycle. As a consequence of the analysis, the dissertation develops the concept of fast algorithm realizations. This concept characterizes specific realizations that can be evaluated in a reduced number of cycles. It restricts the operations to remain in the same class but with reduced execution time. The concept takes advantage of the data dependencies of the algorithm at hand. This feature allows the modification of existing structures by reordering the input data. Applications of the principle allows optimum time band and triangular matrix product on arrays designed for dense matrices. A second approach for analyzing the families of algorithms implementable in an array, is based on the concept of array time constrained operation. The principle uses the number of compute cycle as an additional degree of freedom to expand the class of transformations generated by a single array. A mathematical approach, based on concepts from multilinear algebra, is introduced to model the recursive transformations implemented in linear arrays at each compute cycle. The proposed representation is general enough to encompass a large class of signal processing and control applications. A complete analytical model of the linear maps implementable by the array at each compute cycle is developed. The proposed methodology results in arrays that are more adaptable to the changing nature of operations. Lessons learned from analyzing existing arrays are used to design smart arrays for special algorithm realizations. Applications of the methodology include the design of flexible time structures and the ability to decompose a full size array into subarrays implementing smaller size problems

    METHODOLOGY AND ANALYSIS FOR EFFICIENT CUSTOM ARCHITECTURE DESIGN USING MACHINE LEARNING

    Get PDF
    Machine learning algorithms especially Deep Neural Networks (DNNs) have revolutionized the arena of computing in the last decade. DNNs along the with the computational advancements also bring an unprecedented appetite for compute and parallel processing. Computer architects have risen to challenge by creating novel custom architectures called accelerators. However, given the ongoing rapid advancements in algorithmic development accelerators architects are playing catch- up to churn out optimized designs each time new algorithmic changes are published. It is also worth noting that the accelerator design cycle is expensive. It requires multiple iteration of design space optimization and expert knowledge of both digital design as well as domain knowledge of the workload itself. It is therefore imperative to build scalable and flexible architectures which are adaptive to work well for a variety of workloads. Moreover, it is also important to develop relevant tools and design methodologies which lower the overheads incurred at design time such that subsequent design iterations are fast and sustainable. This thesis takes a three-pronged approach to address these problems and push the frontiers for DNN accelerator design process. First, the thesis presents the description of a now popular cycle accurate DNN accelerator simulator. This simulator is built with the goal of obtaining detailed metrics as fast as possible. A detailed analytical model is also presented in this thesis which enables the designer to understand the interactions of the workload and architecture parameters. The information from the model can be directly used to prune the design search space to achieve faster convergence. Second, the thesis details a couple of flexible yet scalable DNN accelerator architectures. Finally, this thesis describes the use of machine learning to capture the design space of DNN accelerators and train a model to predict optimum configurations when queried with workload parameters and design constraints. The novelty of this piece of work is that it systematically lays out the formulation of traditional design optimization into a machine learning problem and describes the quality and components of a model which works well across various architecture design tasks.Ph.D

    Tools for efficient Deep Learning

    Get PDF
    In the era of Deep Learning (DL), there is a fast-growing demand for building and deploying Deep Neural Networks (DNNs) on various platforms. This thesis proposes five tools to address the challenges for designing DNNs that are efficient in time, in resources and in power consumption. We first present Aegis and SPGC to address the challenges in improving the memory efficiency of DL training and inference. Aegis makes mixed precision training (MPT) stabler by layer-wise gradient scaling. Empirical experiments show that Aegis can improve MPT accuracy by at most 4\%. SPGC focuses on structured pruning: replacing standard convolution with group convolution (GConv) to avoid irregular sparsity. SPGC formulates GConv pruning as a channel permutation problem and proposes a novel heuristic polynomial-time algorithm. Common DNNs pruned by SPGC have maximally 1\% higher accuracy than prior work. This thesis also addresses the challenges lying in the gap between DNN descriptions and executables by Polygeist for software and POLSCA for hardware. Many novel techniques, e.g. statement splitting and memory partitioning, are explored and used to expand polyhedral optimisation. Polygeist can speed up software execution in sequential and parallel by 2.53 and 9.47 times on Polybench/C. POLSCA achieves 1.5 times speedup over hardware designs directly generated from high-level synthesis on Polybench/C. Moreover, this thesis presents Deacon, a framework that generates FPGA-based DNN accelerators of streaming architectures with advanced pipelining techniques to address the challenges from heterogeneous convolution and residual connections. Deacon provides fine-grained pipelining, graph-level optimisation, and heuristic exploration by graph colouring. Compared with prior designs, Deacon shows resource/power consumption efficiency improvement of 1.2x/3.5x for MobileNets and 1.0x/2.8x for SqueezeNets. All these tools are open source, some of which have already gained public engagement. We believe they can make efficient deep learning applications easier to build and deploy.Open Acces
    • …
    corecore