8,155 research outputs found

    Interstellar: Using Halide's Scheduling Language to Analyze DNN Accelerators

    Full text link
    We show that DNN accelerator micro-architectures and their program mappings represent specific choices of loop order and hardware parallelism for computing the seven nested loops of DNNs, which enables us to create a formal taxonomy of all existing dense DNN accelerators. Surprisingly, the loop transformations needed to create these hardware variants can be precisely and concisely represented by Halide's scheduling language. By modifying the Halide compiler to generate hardware, we create a system that can fairly compare these prior accelerators. As long as proper loop blocking schemes are used, and the hardware can support mapping replicated loops, many different hardware dataflows yield similar energy efficiency with good performance. This is because the loop blocking can ensure that most data references stay on-chip with good locality and the processing units have high resource utilization. How resources are allocated, especially in the memory system, has a large impact on energy and performance. By optimizing hardware resource allocation while keeping throughput constant, we achieve up to 4.2X energy improvement for Convolutional Neural Networks (CNNs), 1.6X and 1.8X improvement for Long Short-Term Memories (LSTMs) and multi-layer perceptrons (MLPs), respectively.Comment: Published as a conference paper at ASPLOS 202

    Polly's Polyhedral Scheduling in the Presence of Reductions

    Full text link
    The polyhedral model provides a powerful mathematical abstraction to enable effective optimization of loop nests with respect to a given optimization goal, e.g., exploiting parallelism. Unexploited reduction properties are a frequent reason for polyhedral optimizers to assume parallelism prohibiting dependences. To our knowledge, no polyhedral loop optimizer available in any production compiler provides support for reductions. In this paper, we show that leveraging the parallelism of reductions can lead to a significant performance increase. We give a precise, dependence based, definition of reductions and discuss ways to extend polyhedral optimization to exploit the associativity and commutativity of reduction computations. We have implemented a reduction-enabled scheduling approach in the Polly polyhedral optimizer and evaluate it on the standard Polybench 3.2 benchmark suite. We were able to detect and model all 52 arithmetic reductions and achieve speedups up to 2.21×\times on a quad core machine by exploiting the multidimensional reduction in the BiCG benchmark.Comment: Presented at the IMPACT15 worksho

    Dynamic Time-domain Duplexing for Self-backhauled Millimeter Wave Cellular Networks

    Full text link
    Millimeter wave (mmW) bands between 30 and 300 GHz have attracted considerable attention for next-generation cellular networks due to vast quantities of available spectrum and the possibility of very high-dimensional antenna ar-rays. However, a key issue in these systems is range: mmW signals are extremely vulnerable to shadowing and poor high-frequency propagation. Multi-hop relaying is therefore a natural technology for such systems to improve cell range and cell edge rates without the addition of wired access points. This paper studies the problem of scheduling for a simple infrastructure cellular relay system where communication between wired base stations and User Equipment follow a hierarchical tree structure through fixed relay nodes. Such a systems builds naturally on existing cellular mmW backhaul by adding mmW in the access links. A key feature of the proposed system is that TDD duplexing selections can be made on a link-by-link basis due to directional isolation from other links. We devise an efficient, greedy algorithm for centralized scheduling that maximizes network utility by jointly optimizing the duplexing schedule and resources allocation for dense, relay-enhanced OFDMA/TDD mmW networks. The proposed algorithm can dynamically adapt to loading, channel conditions and traffic demands. Significant throughput gains and improved resource utilization offered by our algorithm over the static, globally-synchronized TDD patterns are demonstrated through simulations based on empirically-derived channel models at 28 GHz.Comment: IEEE Workshop on Next Generation Backhaul/Fronthaul Networks - BackNets 201

    Simulation-based high-level synthesis of Nyquist-rate data converters using MATLAB/SIMULINK

    Get PDF
    This paper presents a toolbox for the simulation, optimization and high-level synthesis of Nyquist-rate Analog-to-Digital (A/D) and Digital-to-Analog (D/A) Converters in MATLAB®. The embedded simulator uses SIMULINK® C-coded S-functions to model all required subcircuits including their main error mechanisms. This approach allows to drastically speed up the simulation CPU-time up to 2 orders of magnitude as compared with previous approaches - based on the use of SIMULINK® elementary blocks. Moreover, S-functions are more suitable for implementing a more detailed description of the circuit. For all subcircuits, the accuracy of the behavioral models has been verified by electrical simulation using HSPICE. For synthesis purposes, the simulator is used for performance evaluation and combined with an hybrid optimizer for design parameter selection. The optimizer combines adaptive statistical optimization algorithm inspired in simulated annealing with a design-oriented formulation of the cost function. It has been integrated in the MATLAB/SIMULINK® platform by using the MATLAB® engine library, so that the optimization core runs in background while MATLAB® acts as a computation engine. The implementation on the MATLAB® platform brings numerous advantages in terms of signal processing, high flexibility for tool expansion and simulation with other electronic subsystems. Additionally, the presented toolbox comprises a friendly graphical user interface to allow the designer to browse through all steps of the simulation, synthesis and post-processing of results. In order to illustrate the capabilities of the toolbox, a 0.13)im CMOS 12bit@80MS/s analog front-end for broadband power line communications, made up of a pipeline ADC and a current steering DAC, is synthesized and high-level sized. Different experiments show the effectiveness of the proposed methodology.Ministerio de Ciencia y Tecnología TIC2003-02355RAICONI

    Space Station communications and tracking systems modeling and RF link simulation

    Get PDF
    In this final report, the effort spent on Space Station Communications and Tracking System Modeling and RF Link Simulation is described in detail. The effort is mainly divided into three parts: frequency division multiple access (FDMA) system simulation modeling and software implementation; a study on design and evaluation of a functional computerized RF link simulation/analysis system for Space Station; and a study on design and evaluation of simulation system architecture. This report documents the results of these studies. In addition, a separate User's Manual on Space Communications Simulation System (SCSS) (Version 1) documents the software developed for the Space Station FDMA communications system simulation. The final report, SCSS user's manual, and the software located in the NASA JSC system analysis division's VAX 750 computer together serve as the deliverables from LinCom for this project effort

    Introducing Molly: Distributed Memory Parallelization with LLVM

    Get PDF
    Programming for distributed memory machines has always been a tedious task, but necessary because compilers have not been sufficiently able to optimize for such machines themselves. Molly is an extension to the LLVM compiler toolchain that is able to distribute and reorganize workload and data if the program is organized in statically determined loop control-flows. These are represented as polyhedral integer-point sets that allow program transformations applied on them. Memory distribution and layout can be declared by the programmer as needed and the necessary asynchronous MPI communication is generated automatically. The primary motivation is to run Lattice QCD simulations on IBM Blue Gene/Q supercomputers, but since the implementation is not yet completed, this paper shows the capabilities on Conway's Game of Life
    corecore