637 research outputs found
A Survey of Techniques For Improving Energy Efficiency in Embedded Computing Systems
Recent technological advances have greatly improved the performance and
features of embedded systems. With the number of just mobile devices now
reaching nearly equal to the population of earth, embedded systems have truly
become ubiquitous. These trends, however, have also made the task of managing
their power consumption extremely challenging. In recent years, several
techniques have been proposed to address this issue. In this paper, we survey
the techniques for managing power consumption of embedded systems. We discuss
the need of power management and provide a classification of the techniques on
several important parameters to highlight their similarities and differences.
This paper is intended to help the researchers and application-developers in
gaining insights into the working of power management techniques and designing
even more efficient high-performance embedded systems of tomorrow
Parallel Simulation for VLSI Power Grid
Due to the increasing complexity of VLSI circuits, power grid simulation has become more and more time-consuming. Hence, there is a need for fast and accurate power grid simulator. In order to perform power grid simulation in a timely manner, parallel algorithms have been developed to accelerate the simulation. In this dissertation, we present parallel algorithms and software for power grid simulation on CPU-GPU platforms. The power grid is divided into disjoint partitions. The partitions are enlarged using Breath First Search (BFS) method. In the partition enlarging process, a portion of edges are ignored to make the matrix factorization light-weight. Solving the enlarged partitions using a direct solver serves as a preconditioner for the Preconditioned Conjugate Gradient (PCG) method that is used to solve the power grid. This work combines the advantages of direct solvers and iterative solvers to obtain an efficient hybrid parallel solver. Two-tier parallelism is harnessed using MPI for partitions and CUDA within each partition. The experiments conducted on supercomputing clusters demonstrate significant speed improvements over a state-of-the-art direct solver in both static and transient analysis
SPICE²: A Spatial, Parallel Architecture for Accelerating the Spice Circuit Simulator
Spatial processing of sparse, irregular floating-point computation using a single FPGA enables up to an order of magnitude speedup (mean 2.8X speedup) over a conventional microprocessor for the SPICE circuit simulator. We deliver this speedup using a hybrid parallel architecture that spatially implements the heterogeneous forms of parallelism available in SPICE. We decompose SPICE into its three constituent phases: Model-Evaluation, Sparse Matrix-Solve, and Iteration Control and parallelize each phase independently. We exploit data-parallel device evaluations in the Model-Evaluation phase, sparse dataflow parallelism in the Sparse Matrix-Solve phase and compose the complete design in streaming fashion. We name our parallel architecture SPICE²: Spatial Processors Interconnected for Concurrent Execution for accelerating the SPICE circuit simulator. We program the parallel architecture with a high-level, domain-specific framework that identifies, exposes and exploits parallelism available in the SPICE circuit simulator. This design is optimized with an auto-tuner that can scale the design to use larger FPGA capacities without expert intervention and can even target other parallel architectures with the assistance of automated code-generation. This FPGA architecture is able to outperform conventional processors due to a combination of factors including high utilization of statically-scheduled resources, low-overhead dataflow scheduling of fine-grained tasks, and overlapped processing of the control algorithms.
We demonstrate that we can independently accelerate Model-Evaluation by a mean factor of 6.5X(1.4--23X) across a range of non-linear device models and Matrix-Solve by 2.4X(0.6--13X) across various benchmark matrices while delivering a mean combined speedup of 2.8X(0.2--11X) for the two together when comparing a Xilinx Virtex-6 LX760 (40nm) with an Intel Core i7 965 (45nm). With our high-level framework, we can also accelerate Single-Precision Model-Evaluation on NVIDIA GPUs, ATI GPUs, IBM Cell, and Sun Niagara 2 architectures.
We expect approaches based on exploiting spatial parallelism to become important as frequency scaling slows down and modern processing architectures turn to parallelism (\eg multi-core, GPUs) due to constraints of power consumption. This thesis shows how to express, exploit and optimize spatial parallelism for an important class of problems that are challenging to parallelize.</p
A Survey on Design Methodologies for Accelerating Deep Learning on Heterogeneous Architectures
In recent years, the field of Deep Learning has seen many disruptive and
impactful advancements. Given the increasing complexity of deep neural
networks, the need for efficient hardware accelerators has become more and more
pressing to design heterogeneous HPC platforms. The design of Deep Learning
accelerators requires a multidisciplinary approach, combining expertise from
several areas, spanning from computer architecture to approximate computing,
computational models, and machine learning algorithms. Several methodologies
and tools have been proposed to design accelerators for Deep Learning,
including hardware-software co-design approaches, high-level synthesis methods,
specific customized compilers, and methodologies for design space exploration,
modeling, and simulation. These methodologies aim to maximize the exploitable
parallelism and minimize data movement to achieve high performance and energy
efficiency. This survey provides a holistic review of the most influential
design methodologies and EDA tools proposed in recent years to implement Deep
Learning accelerators, offering the reader a wide perspective in this rapidly
evolving field. In particular, this work complements the previous survey
proposed by the same authors in [203], which focuses on Deep Learning hardware
accelerators for heterogeneous HPC platforms
Parallel and Distributed Computing
The 14 chapters presented in this book cover a wide variety of representative works ranging from hardware design to application development. Particularly, the topics that are addressed are programmable and reconfigurable devices and systems, dependability of GPUs (General Purpose Units), network topologies, cache coherence protocols, resource allocation, scheduling algorithms, peertopeer networks, largescale network simulation, and parallel routines and algorithms. In this way, the articles included in this book constitute an excellent reference for engineers and researchers who have particular interests in each of these topics in parallel and distributed computing
- …