8,155 research outputs found
Interstellar: Using Halide's Scheduling Language to Analyze DNN Accelerators
We show that DNN accelerator micro-architectures and their program mappings
represent specific choices of loop order and hardware parallelism for computing
the seven nested loops of DNNs, which enables us to create a formal taxonomy of
all existing dense DNN accelerators. Surprisingly, the loop transformations
needed to create these hardware variants can be precisely and concisely
represented by Halide's scheduling language. By modifying the Halide compiler
to generate hardware, we create a system that can fairly compare these prior
accelerators. As long as proper loop blocking schemes are used, and the
hardware can support mapping replicated loops, many different hardware
dataflows yield similar energy efficiency with good performance. This is
because the loop blocking can ensure that most data references stay on-chip
with good locality and the processing units have high resource utilization. How
resources are allocated, especially in the memory system, has a large impact on
energy and performance. By optimizing hardware resource allocation while
keeping throughput constant, we achieve up to 4.2X energy improvement for
Convolutional Neural Networks (CNNs), 1.6X and 1.8X improvement for Long
Short-Term Memories (LSTMs) and multi-layer perceptrons (MLPs), respectively.Comment: Published as a conference paper at ASPLOS 202
Polly's Polyhedral Scheduling in the Presence of Reductions
The polyhedral model provides a powerful mathematical abstraction to enable
effective optimization of loop nests with respect to a given optimization goal,
e.g., exploiting parallelism. Unexploited reduction properties are a frequent
reason for polyhedral optimizers to assume parallelism prohibiting dependences.
To our knowledge, no polyhedral loop optimizer available in any production
compiler provides support for reductions. In this paper, we show that
leveraging the parallelism of reductions can lead to a significant performance
increase. We give a precise, dependence based, definition of reductions and
discuss ways to extend polyhedral optimization to exploit the associativity and
commutativity of reduction computations. We have implemented a
reduction-enabled scheduling approach in the Polly polyhedral optimizer and
evaluate it on the standard Polybench 3.2 benchmark suite. We were able to
detect and model all 52 arithmetic reductions and achieve speedups up to
2.21 on a quad core machine by exploiting the multidimensional
reduction in the BiCG benchmark.Comment: Presented at the IMPACT15 worksho
Dynamic Time-domain Duplexing for Self-backhauled Millimeter Wave Cellular Networks
Millimeter wave (mmW) bands between 30 and 300 GHz have attracted
considerable attention for next-generation cellular networks due to vast
quantities of available spectrum and the possibility of very high-dimensional
antenna ar-rays. However, a key issue in these systems is range: mmW signals
are extremely vulnerable to shadowing and poor high-frequency propagation.
Multi-hop relaying is therefore a natural technology for such systems to
improve cell range and cell edge rates without the addition of wired access
points. This paper studies the problem of scheduling for a simple
infrastructure cellular relay system where communication between wired base
stations and User Equipment follow a hierarchical tree structure through fixed
relay nodes. Such a systems builds naturally on existing cellular mmW backhaul
by adding mmW in the access links. A key feature of the proposed system is that
TDD duplexing selections can be made on a link-by-link basis due to directional
isolation from other links. We devise an efficient, greedy algorithm for
centralized scheduling that maximizes network utility by jointly optimizing the
duplexing schedule and resources allocation for dense, relay-enhanced OFDMA/TDD
mmW networks. The proposed algorithm can dynamically adapt to loading, channel
conditions and traffic demands. Significant throughput gains and improved
resource utilization offered by our algorithm over the static,
globally-synchronized TDD patterns are demonstrated through simulations based
on empirically-derived channel models at 28 GHz.Comment: IEEE Workshop on Next Generation Backhaul/Fronthaul Networks -
BackNets 201
Simulation-based high-level synthesis of Nyquist-rate data converters using MATLAB/SIMULINK
This paper presents a toolbox for the simulation, optimization and high-level synthesis of Nyquist-rate Analog-to-Digital (A/D) and Digital-to-Analog (D/A) Converters in MATLAB®. The embedded simulator uses SIMULINK® C-coded S-functions to model all required subcircuits including their main error mechanisms. This approach allows to drastically speed up the simulation CPU-time up to 2 orders of magnitude as compared with previous approaches - based on the use of SIMULINK® elementary blocks. Moreover, S-functions are more suitable for implementing a more detailed description of the circuit. For all subcircuits, the accuracy of the behavioral models has been verified by electrical simulation using HSPICE. For synthesis purposes, the simulator is used for performance evaluation and combined with an hybrid optimizer for design parameter selection. The optimizer combines adaptive statistical optimization algorithm inspired in simulated annealing with a design-oriented formulation of the cost function. It has been integrated in the MATLAB/SIMULINK® platform by using the MATLAB® engine library, so that the optimization core runs in background while MATLAB® acts as a computation engine. The implementation on the MATLAB® platform brings numerous advantages in terms of signal processing, high flexibility for tool expansion and simulation with other electronic subsystems. Additionally, the presented toolbox comprises a friendly graphical user interface to allow the designer to browse through all steps of the simulation, synthesis and post-processing of results. In order to illustrate the capabilities of the toolbox, a 0.13)im CMOS 12bit@80MS/s analog front-end for broadband power line communications, made up of a pipeline ADC and a current steering DAC, is synthesized and high-level sized. Different experiments show the effectiveness of the proposed methodology.Ministerio de Ciencia y Tecnología TIC2003-02355RAICONI
Space Station communications and tracking systems modeling and RF link simulation
In this final report, the effort spent on Space Station Communications and Tracking System Modeling and RF Link Simulation is described in detail. The effort is mainly divided into three parts: frequency division multiple access (FDMA) system simulation modeling and software implementation; a study on design and evaluation of a functional computerized RF link simulation/analysis system for Space Station; and a study on design and evaluation of simulation system architecture. This report documents the results of these studies. In addition, a separate User's Manual on Space Communications Simulation System (SCSS) (Version 1) documents the software developed for the Space Station FDMA communications system simulation. The final report, SCSS user's manual, and the software located in the NASA JSC system analysis division's VAX 750 computer together serve as the deliverables from LinCom for this project effort
Introducing Molly: Distributed Memory Parallelization with LLVM
Programming for distributed memory machines has always been a tedious task,
but necessary because compilers have not been sufficiently able to optimize for
such machines themselves. Molly is an extension to the LLVM compiler toolchain
that is able to distribute and reorganize workload and data if the program is
organized in statically determined loop control-flows. These are represented as
polyhedral integer-point sets that allow program transformations applied on
them. Memory distribution and layout can be declared by the programmer as
needed and the necessary asynchronous MPI communication is generated
automatically. The primary motivation is to run Lattice QCD simulations on IBM
Blue Gene/Q supercomputers, but since the implementation is not yet completed,
this paper shows the capabilities on Conway's Game of Life
- …