1,881 research outputs found

    A survey of the state of the art and focused research in range systems, task 2

    Get PDF
    Contract generated publications are compiled which describe the research activities for the reporting period. Study topics include: equivalent configurations of systolic arrays; least squares estimation algorithms with systolic array architectures; modeling and equilization of nonlinear bandlimited satellite channels; and least squares estimation and Kalman filtering by systolic arrays

    Interstellar: Using Halide's Scheduling Language to Analyze DNN Accelerators

    Full text link
    We show that DNN accelerator micro-architectures and their program mappings represent specific choices of loop order and hardware parallelism for computing the seven nested loops of DNNs, which enables us to create a formal taxonomy of all existing dense DNN accelerators. Surprisingly, the loop transformations needed to create these hardware variants can be precisely and concisely represented by Halide's scheduling language. By modifying the Halide compiler to generate hardware, we create a system that can fairly compare these prior accelerators. As long as proper loop blocking schemes are used, and the hardware can support mapping replicated loops, many different hardware dataflows yield similar energy efficiency with good performance. This is because the loop blocking can ensure that most data references stay on-chip with good locality and the processing units have high resource utilization. How resources are allocated, especially in the memory system, has a large impact on energy and performance. By optimizing hardware resource allocation while keeping throughput constant, we achieve up to 4.2X energy improvement for Convolutional Neural Networks (CNNs), 1.6X and 1.8X improvement for Long Short-Term Memories (LSTMs) and multi-layer perceptrons (MLPs), respectively.Comment: Published as a conference paper at ASPLOS 202

    A Comprehensive Methodology for Algorithm Characterization, Regularization and Mapping Into Optimal VLSI Arrays.

    Get PDF
    This dissertation provides a fairly comprehensive treatment of a broad class of algorithms as it pertains to systolic implementation. We describe some formal algorithmic transformations that can be utilized to map regular and some irregular compute-bound algorithms into the best fit time-optimal systolic architectures. The resulted architectures can be one-dimensional, two-dimensional, three-dimensional or nonplanar. The methodology detailed in the dissertation employs, like other methods, the concept of dependence vector to order, in space and time, the index points representing the algorithm. However, by differentiating between two types of dependence vectors, the ordering procedure is allowed to be flexible and time optimal. Furthermore, unlike other methodologies, the approach reported here does not put constraints on the topology or dimensionality of the target architecture. The ordered index points are represented by nodes in a diagram called Systolic Precedence Diagram (SPD). The SPD is a form of precedence graph that takes into account the systolic operation requirements of strictly local communications and regular data flow. Therefore, any algorithm with variable dependence vectors has to be transformed into a regular indexed set of computations with local dependencies. This can be done by replacing variable dependence vectors with sets of fixed dependence vectors. The SPD is transformed into an acyclic, labeled, directed graph called the Systolic Directed Graph (SDG). The SDG models the data flow as well as the timing for the execution of the given algorithm on a time-optimal array. The target architectures are obtained by projecting the SDG along defined directions. If more than one valid projection direction exists, different designs are obtained. The resulting architectures are then evaluated to determine if an improvement in the performance can be achieved by increasing PE fan-out. If so, the methodology provides the corresponding systolic implementation. By employing a new graph transformation, the SDG is manipulated so that it can be mapped into fixed-size and fixed-depth multi-linear arrays. The latter is a new concept of systolic arrays that is adaptable to changes in the state of technology. It promises a bonded clock skew, higher throughput and better performance than the linear implementation

    Synthesis, structure and power of systolic computations

    Get PDF
    AbstractA variety of problems related to systolic architectures, systems, models and computations are discussed. The emphases are on theoretical problems of a broader interest. Main motivations and interesting/important applications are also presented. The first part is devoted to problems related to synthesis, transformations and simulations of systolic systems and architectures. In the second part, the power and structure of tree and linear array computations are studied in detail. The goal is to survey main research directions, problems, methods and techniques in not too formal a way

    Bit-level pipelined digit-serial array processors

    Get PDF
    A new architecture for high performance digit-serial vector inner product (VIP) which can be pipelined to the bit-level is introduced. The design of the digit-serial vector inner product is based on a new systematic design methodology using radix-2n arithmetic. The proposed architecture allows a high level of bit-level pipelining to increase the throughput rate with minimum initial delay and minimum area. This will give designers greater flexibility in finding the best tradeoff between hardware cost and throughput rate. It is shown that sub-digit pipelined digit-serial structure can achieve a higher throughput rate with much less area consumption than an equivalent bit-parallel structure. A twin-pipe architecture to double the throughput rate of digit-serial multipliers and consequently that of the digit-serial vector inner product is also presented. The effect of the number of pipelining levels and the twin-pipe architecture on the throughput rate and hardware cost are discussed. A two's complement digit-serial architecture which can operate on both negative and positive numbers is also presented

    Multidimensional Systolic Arrays of LMS AlgorithmAdaptive (FIR) Digital Filters

    Get PDF
    A multidimensional systolic arrays realization of LMS algorithm by a method of mapping regular algorithm onto processor array, are designed. They are based on appropriately selected 1-D systolic array filter that depends on the inner product sum systolic implementation. Various arrays may be derived that exhibit a regular arrangement of the cells (processors) and local interconnection pattern, which are important for VLSI implementation. It reduces latency time and increases the throughput rate in comparison to classical 1-D systolic arrays. The 3-D multilayered array consists of 2-D layers, which are connected with each other only by edges. Such arrays for LMS-based adaptive (FIR) filter may be opposed the fundamental requirements of fast convergence rate in most adaptive filter applications

    Reliable and Energy Efficient MLC STT-RAM Buffer for CNN Accelerators

    Get PDF
    We propose a lightweight scheme where the formation of a data block is changed in such a way that it can tolerate soft errors significantly better than the baseline. The key insight behind our work is that CNN weights are normalized between -1 and 1 after each convolutional layer, and this leaves one bit unused in half-precision floating-point representation. By taking advantage of the unused bit, we create a backup for the most significant bit to protect it against the soft errors. Also, considering the fact that in MLC STT-RAMs the cost of memory operations (read and write), and reliability of a cell are content-dependent (some patterns take larger current and longer time, while they are more susceptible to soft error), we rearrange the data block to minimize the number of costly bit patterns. Combining these two techniques provides the same level of accuracy compared to an error-free baseline while improving the read and write energy by 9% and 6%, respectively

    An Analytical Model of Configurable Systolic Arrays to find the Best-Fitting Accelerator for a given DNN Workload

    Get PDF
    Since their breakthrough, complexity of Deep Neural Networks (DNNs) is rising steadily. As a result, accelerators for DNNs are now used in many domains. However, designing and configuring an accelerator that meets the requirements of a given application perfectly is a challenging task. In this paper, we therefore present our approach to support the accelerator design process. With an analytical model of a systolic array we can estimate performance, energy consumption and area for each design option. To determine these metrics, usually a cycle accurate simulation is performed, which is a time-consuming task. Hence, the design space has to be restricted heavily. Analytical modelling, however, allows for fast evaluation of a design using a mathematical abstraction of the accelerator. For DNNs, this works especially well since the dataflow and memory accesses have high regularity. To show the correctness of our model, we perform an exemplary realization with the state-of-the-art systolic array generator Gemmini and compare it with a cycle accurate simulation and state-of-the-art modelling tools, showing less than 1% deviation. We also conducted a design space exploration, showing the analytical model’s capabilities to support an accelerator design. In a case study on ResNet-34, we can demonstrate that our model and DSE tool reduces the time to find the best-fitting solution by four or two orders of magnitude compared to a cycle-accurate simulation or state-of-the-art modelling tools, respectively
    corecore