176 research outputs found

    Compiler-directed energy reduction using dynamic voltage scaling and voltage Islands for embedded systems

    Get PDF
    Cataloged from PDF version of article.Addressing power and energy consumption related issues early in the system design flow ensures good design and minimizes iterations for faster turnaround time. In particular, optimizations at software level, e.g., those supported by compilers, are very important for minimizing energy consumption of embedded applications. Recent research demonstrates that voltage islands provide the flexibility to reduce power by selectively shutting down the different regions of the chip and/or running the select parts of the chip at different voltage/frequency levels. As against most of the prior work on voltage islands that mainly focused on the architecture design and IP placement related issues, this paper studies the necessary software compiler support for voltage islands. Specifically, we focus on an embedded multiprocessor architecture that supports both voltage islands and control domains within these islands, and determine how an optimizing compiler can automatically map an embedded application onto this architecture. Such an automated support is critical since it is unrealistic to expect an application programmer to reach a good mapping correlating multiple factors such as performance and energy at the same time. Our experiments with the proposed compiler support show that our approach is very effective in reducing energy consumption. The experiments also show that the energy savings we achieve are consistent across a wide range of values of our major simulation parameters

    Performance and Memory Space Optimizations for Embedded Systems

    Get PDF
    Embedded systems have three common principles: real-time performance, low power consumption, and low price (limited hardware). Embedded computers use chip multiprocessors (CMPs) to meet these expectations. However, one of the major problems is lack of efficient software support for CMPs; in particular, automated code parallelizers are needed. The aim of this study is to explore various ways to increase performance, as well as reducing resource usage and energy consumption for embedded systems. We use code restructuring, loop scheduling, data transformation, code and data placement, and scratch-pad memory (SPM) management as our tools in different embedded system scenarios. The majority of our work is focused on loop scheduling. Main contributions of our work are: We propose a memory saving strategy that exploits the value locality in array data by storing arrays in a compressed form. Based on the compressed forms of the input arrays, our approach automatically determines the compressed forms of the output arrays and also automatically restructures the code. We propose and evaluate a compiler-directed code scheduling scheme, which considers both parallelism and data locality. It analyzes the code using a locality parallelism graph representation, and assigns the nodes of this graph to processors.We also introduce an Integer Linear Programming based formulation of the scheduling problem. We propose a compiler-based SPM conscious loop scheduling strategy for array/loop based embedded applications. The method is to distribute loop iterations across parallel processors in an SPM-conscious manner. The compiler identifies potential SPM hits and misses, and distributes loop iterations such that the processors have close execution times. We present an SPM management technique using Markov chain based data access. We propose a compiler directed integrated code and data placement scheme for 2-D mesh based CMP architectures. Using a Code-Data Affinity Graph (CDAG) to represent the relationship between loop iterations and array data, it assigns the sets of loop iterations to processing cores and sets of data blocks to on-chip memories. We present a memory bank aware dynamic loop scheduling scheme for array intensive applications.The goal is to minimize the number of memory banks needed for executing the group of loop iterations

    Data locality and parallelism optimization using a constraint-based approach

    Get PDF
    Cataloged from PDF version of article.Embedded applications are becoming increasingly complex and processing ever-increasing datasets. In the context of data-intensive embedded applications, there have been two complementary approaches to enhancing application behavior, namely, data locality optimizations and improving loop-level parallelism. Data locality needs to be enhanced to maximize the number of data accesses satisfied from the higher levels of the memory hierarchy. On the other hand, compiler-based code parallelization schemes require a fresh look for chip multiprocessors as interprocessor communication is much cheaper than off-chip memory accesses. Therefore, a compiler needs to minimize the number of off-chip memory accesses. This can be achieved by considering multiple loop nests simultaneously. Although compilers address these two problems, there is an inherent difficulty in optimizing both data locality and parallelism simultaneously. Therefore, an integrated approach that combines these two can generate much better results than each individual approach. Based on these observations, this paper proposes a constraint network (CN)-based formulation for data locality optimization and code parallelization. The paper also presents experimental evidence, demonstrating the success of the proposed approach, and compares our results with those obtained through previously proposed approaches. The experiments from our implementation indicate that the proposed approach is very effective in enhancing data locality and parallelization. © 2010 Elsevier Inc. All rights reserved

    Code scheduling for optimizing parallelism and data locality

    Get PDF
    As chip multiprocessors proliferate, programming support for these devices is likely to receive a lot of attention in the near future. Parallelism and data locality are two critical issues in a chip multiprocessor environment. Unfortunately, most of the published work in the literature focuses only on one of these problems, and this can prevent one from achieving the best possible performance. The main goal of this paper is to propose and evaluate a compiler-directed code parallelization scheme, which considers both parallelism and data locality at the same time. Our compiler captures the inherent parallelism and data reuse in the application code being analyzed using a novel representation called the locality-parallelism graph (LPG). Our partitioning/scheduling algorithm assigns the nodes of this graph to the processors in the architecture and schedules them for execution. We implemented this algorithm and evaluated its effectiveness using a set of benchmark codes. The results collected so far indicate that our approach improves overall execution latency significantly. In this paper, we also introduce an ILP (Integer Linear Programming) based formulation of the problem, and implement the schedule obtained by the ILP solver. The results indicate that our approach gets within 4% of the ILP solution. © 2010 Springer-Verlag

    Compiling for parallel multithreaded computation on symmetric multiprocessors

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1998.Includes bibliographical references (p. 145-149).by Andrew Shaw.Ph.D

    Parallel simulation techniques for telecommunication network modelling

    Get PDF
    In this thesis, we consider the application of parallel simulation to the performance modelling of telecommunication networks. A largely automated approach was first explored using a parallelizing compiler to speed up the simulation of simple models of circuit-switched networks. This yielded reasonable results for relatively little effort compared with other approaches. However, more complex simulation models of packet- and cell-based telecommunication networks, requiring the use of discrete event techniques, need an alternative approach. A critical review of parallel discrete event simulation indicated that a distributed model components approach using conservative or optimistic synchronization would be worth exploring. Experiments were therefore conducted using simulation models of queuing networks and Asynchronous Transfer Mode (ATM) networks to explore the potential speed-up possible using this approach. Specifically, it is shown that these techniques can be used successfully to speed-up the execution of useful telecommunication network simulations. A detailed investigation has demonstrated that conservative synchronization performs very well for applications with good look ahead properties and sufficient message traffic density and, given such properties, will significantly outperform optimistic synchronization. Optimistic synchronization, however, gives reasonable speed-up for models with a wider range of such properties and can be optimized for speed-up and memory usage at run time. Thus, it is confirmed as being more generally applicable particularly as model development is somewhat easier than for conservative synchronization. This has to be balanced against the more difficult task of developing and debugging an optimistic synchronization kernel and the application models

    Run-time optimization of adaptive irregular applications

    Get PDF
    Compared to traditional compile-time optimization, run-time optimization could offer significant performance improvements when parallelizing and optimizing adaptive irregular applications, because it performs program analysis and adaptive optimizations during program execution. Run-time techniques can succeed where static techniques fail because they exploit the characteristics of input data, programs' dynamic behaviors, and the underneath execution environment. When optimizing adaptive irregular applications for parallel execution, a common observation is that the effectiveness of the optimizing transformations depends on programs' input data and their dynamic phases. This dissertation presents a set of run-time optimization techniques that match the characteristics of programs' dynamic memory access patterns and the appropriate optimization (parallelization) transformations. First, we present a general adaptive algorithm selection framework to automatically and adaptively select at run-time the best performing, functionally equivalent algorithm for each of its execution instances. The selection process is based on off-line automatically generated prediction models and characteristics (collected and analyzed dynamically) of the algorithm's input data, In this dissertation, we specialize this framework for automatic selection of reduction algorithms. In this research, we have identified a small set of machine independent high-level characterization parameters and then we deployed an off-line, systematic experiment process to generate prediction models. These models, in turn, match the parameters to the best optimization transformations for a given machine. The technique has been evaluated thoroughly in terms of applications, platforms, and programs' dynamic behaviors. Specifically, for the reduction algorithm selection, the selected performance is within 2% of optimal performance and on average is 60% better than "Replicated Buffer," the default parallel reduction algorithm specified by OpenMP standard. To reduce the overhead of speculative run-time parallelization, we have developed an adaptive run-time parallelization technique that dynamically chooses effcient shadow structures to record a program's dynamic memory access patterns for parallelization. This technique complements the original speculative run-time parallelization technique, the LRPD test, in parallelizing loops with sparse memory accesses. The techniques presented in this dissertation have been implemented in an optimizing research compiler and can be viewed as effective building blocks for comprehensive run-time optimization systems, e.g., feedback-directed optimization systems and dynamic compilation systems

    SPICE²: A Spatial, Parallel Architecture for Accelerating the Spice Circuit Simulator

    Get PDF
    Spatial processing of sparse, irregular floating-point computation using a single FPGA enables up to an order of magnitude speedup (mean 2.8X speedup) over a conventional microprocessor for the SPICE circuit simulator. We deliver this speedup using a hybrid parallel architecture that spatially implements the heterogeneous forms of parallelism available in SPICE. We decompose SPICE into its three constituent phases: Model-Evaluation, Sparse Matrix-Solve, and Iteration Control and parallelize each phase independently. We exploit data-parallel device evaluations in the Model-Evaluation phase, sparse dataflow parallelism in the Sparse Matrix-Solve phase and compose the complete design in streaming fashion. We name our parallel architecture SPICE²: Spatial Processors Interconnected for Concurrent Execution for accelerating the SPICE circuit simulator. We program the parallel architecture with a high-level, domain-specific framework that identifies, exposes and exploits parallelism available in the SPICE circuit simulator. This design is optimized with an auto-tuner that can scale the design to use larger FPGA capacities without expert intervention and can even target other parallel architectures with the assistance of automated code-generation. This FPGA architecture is able to outperform conventional processors due to a combination of factors including high utilization of statically-scheduled resources, low-overhead dataflow scheduling of fine-grained tasks, and overlapped processing of the control algorithms. We demonstrate that we can independently accelerate Model-Evaluation by a mean factor of 6.5X(1.4--23X) across a range of non-linear device models and Matrix-Solve by 2.4X(0.6--13X) across various benchmark matrices while delivering a mean combined speedup of 2.8X(0.2--11X) for the two together when comparing a Xilinx Virtex-6 LX760 (40nm) with an Intel Core i7 965 (45nm). With our high-level framework, we can also accelerate Single-Precision Model-Evaluation on NVIDIA GPUs, ATI GPUs, IBM Cell, and Sun Niagara 2 architectures. We expect approaches based on exploiting spatial parallelism to become important as frequency scaling slows down and modern processing architectures turn to parallelism (\eg multi-core, GPUs) due to constraints of power consumption. This thesis shows how to express, exploit and optimize spatial parallelism for an important class of problems that are challenging to parallelize.</p

    Algorithm Libraries for Multi-Core Processors

    Get PDF
    By providing parallelized versions of established algorithm libraries, we ease the exploitation of the multiple cores on modern processors for the programmer. The Multi-Core STL provides basic algorithms for internal memory, while the parallelized STXXL enables multi-core acceleration for algorithms on large data sets stored on disk. Some parallelized geometric algorithms are introduced into CGAL. Further, we design and implement sorting algorithms for huge data in distributed external memory
    • …
    corecore