80 research outputs found

    Thermal-Aware Compilation for Register Window-Based Embedded Processors

    Get PDF
    The development of compiler-based mechanisms to optimize the thermal profile of large register files to improve the processor reliability has become an important issue. Thermal hotspots have been known to cause severe reliability issues, while the thermal profile of the devices is also related to the leakage power consumption and the cooling cost. Register window-based architectures provide a relatively large register files. However, such large register files are not designed or utilized for thermal balancing or reliability enhancement. In this paper, we propose a compilation flow that utilizes the register windows to reduce optimize the thermal profile and to reduce the hotspots. As a result, the thermal profile and reliability of the device is clearly improved. Simulation results show that the proposed flow achieves up to 91% reduction of hotspots and 11% reduction of the peak temperature in embedded processors

    Thermal-Aware Compilation for System-on-Chip Processing Architectures

    Get PDF
    The development of compiler-based mechanisms to reduce the percentage of hotspots and optimize the thermal profile of large register files has become an important issue. Thermal hotspots have been known to cause severe reliability issues, while the thermal profile of the devices is also related to the leakage power consumption and the cooling cost. In this paper we propose several compilation techniques that, based on an efficient register allocation mechanism, reduce the percentage of hotspots in the register file and uniformly distribute the heat. As a result, the thermal profile and reliability of the device is clearly improved. Simulation results show that the proposed flow achieved 91% reduction of hotspots and 11% reduction of the peak temperature

    Thermal-aware compilation for system-on-chip processing architectures

    Full text link

    Combining FPGA prototyping and high-level simulation approaches for Design Space Exploration of MPSoCs

    Get PDF
    Modern embedded systems are parallel, component-based, heterogeneous and finely tuned on the basis of the workload that must be executed on them. To improve design reuse, Application Specific Instruction-set Processors (ASIPs) are often employed as building blocks in such systems, as a solution capable of satisfying the required functional and physical constraints (e.g. throughput, latency, power or energy consumption etc.), while providing, at the same time, high flexibility and adaptability. Composing a multi-processor architecture including ASIPs and mapping parallel applications onto it is a design activity that require an extensive Design Space Exploration process (DSE), to result in cost-effective systems. The work described here aims at defining novel methodologies for the application-driven customizations of such highly heterogeneous embedded systems. The issue is tackled at different levels, integrating different tools. High-level event-based simulation is a widely used technique that offers speed and flexibility as main points of strength, but needs, as a preliminary input and periodically during the iteration process, calibration data that must be acquired by means of more accurate evaluation methods. Typically, this calibration is performed using instruction-level cycleaccurate simulators that, however, turn out to be very slow, especially when complete multiprocessor systems must be evaluated or when the grain of the calibration is too fine, while FPGA approaches have shown to performbetter for this particular applications. FPGA-based emulation techniques have been proposed in the recent past as an alternative solution to the software-based simulation approach, but some further steps are needed before they can be effectively exploitedwithin architectural design space exploration. Firstly, some kind of technology-awareness must be introduced, to enable the translation of the emulation results into a pre-estimation of a prospective ASIC implementation of the design. Moreover, when performing architectural DSE, a significant number of different candidate design points has to be evaluated and compared. In this case, if no countermeasures are taken, the advantages achievable with FPGAs, in terms of emulation speed, are counterbalanced by the overhead introduced by the time needed to go through the physical synthesis and implementation flow. Developed FPGA-based prototyping platform overcomes such limitations, enabling the use of FPGA-based prototyping for micro-architectural design space exploration of ASIP processors. In this approach, to increase the emulation speed-up, two different methods are proposed: the first is based on automatic instantiation of additional hardware modules, able to reconfigure at runtime the prototype, while the second leverages manipulation of application binary code, compiled for a custom VLIW ASIP architecture, that is transformed into code executable on a different configuration. This allows to prototype a whole set of ASIP solutions after one single FPGA implementation flow, mitigating the afore-mentioned overhead.A short overview on the tools used throughout the work will also be offered, covering basic aspects of Intel-Silicon Hive ASIP development toolchain, SESAME framework general description, along with a review of state-of-art simulation and prototyping techniques for complex multi-processor systems. Each proposed approach will be validated through a real-world use case, confirming the validity of this solution

    Combining FPGA prototyping and high-level simulation approaches for Design Space Exploration of MPSoCs

    Get PDF
    Modern embedded systems are parallel, component-based, heterogeneous and finely tuned on the basis of the workload that must be executed on them. To improve design reuse, Application Specific Instruction-set Processors (ASIPs) are often employed as building blocks in such systems, as a solution capable of satisfying the required functional and physical constraints (e.g. throughput, latency, power or energy consumption etc.), while providing, at the same time, high flexibility and adaptability. Composing a multi-processor architecture including ASIPs and mapping parallel applications onto it is a design activity that require an extensive Design Space Exploration process (DSE), to result in cost-effective systems. The work described here aims at defining novel methodologies for the application-driven customizations of such highly heterogeneous embedded systems. The issue is tackled at different levels, integrating different tools. High-level event-based simulation is a widely used technique that offers speed and flexibility as main points of strength, but needs, as a preliminary input and periodically during the iteration process, calibration data that must be acquired by means of more accurate evaluation methods. Typically, this calibration is performed using instruction-level cycleaccurate simulators that, however, turn out to be very slow, especially when complete multiprocessor systems must be evaluated or when the grain of the calibration is too fine, while FPGA approaches have shown to performbetter for this particular applications. FPGA-based emulation techniques have been proposed in the recent past as an alternative solution to the software-based simulation approach, but some further steps are needed before they can be effectively exploitedwithin architectural design space exploration. Firstly, some kind of technology-awareness must be introduced, to enable the translation of the emulation results into a pre-estimation of a prospective ASIC implementation of the design. Moreover, when performing architectural DSE, a significant number of different candidate design points has to be evaluated and compared. In this case, if no countermeasures are taken, the advantages achievable with FPGAs, in terms of emulation speed, are counterbalanced by the overhead introduced by the time needed to go through the physical synthesis and implementation flow. Developed FPGA-based prototyping platform overcomes such limitations, enabling the use of FPGA-based prototyping for micro-architectural design space exploration of ASIP processors. In this approach, to increase the emulation speed-up, two different methods are proposed: the first is based on automatic instantiation of additional hardware modules, able to reconfigure at runtime the prototype, while the second leverages manipulation of application binary code, compiled for a custom VLIW ASIP architecture, that is transformed into code executable on a different configuration. This allows to prototype a whole set of ASIP solutions after one single FPGA implementation flow, mitigating the afore-mentioned overhead.A short overview on the tools used throughout the work will also be offered, covering basic aspects of Intel-Silicon Hive ASIP development toolchain, SESAME framework general description, along with a review of state-of-art simulation and prototyping techniques for complex multi-processor systems. Each proposed approach will be validated through a real-world use case, confirming the validity of this solution

    Libra: Achieving Efficient Instruction- and Data- Parallel Execution for Mobile Applications.

    Full text link
    Mobile computing as exemplified by the smart phone has become an integral part of our daily lives. The next generation of these devices will be driven by providing richer user experiences and compelling capabilities: higher definition multimedia, 3D graphics, augmented reality, and voice interfaces. To meet these goals, the core computing capabilities of the smart phone must be scaled. But, the energy budgets are increasing at a much lower rate, thus fundamental improvements in computing efficiency must be garnered. To meet this challenge, computer architects employ hardware accelerators in the form of SIMD and VLIW. Single-instruction multiple-data (SIMD) accelerators provide high degrees of scalability for applications rich in data-level parallelism (DLP). Very long instruction word (VLIW) accelerators provide moderate scalability for applications with high degrees of instruction-level parallelism (ILP). Unfortunately, applications are not so nicely partitioned into two groups: many applications have some DLP, but also contain significant fractions of code with low trip count loops, complex control/data dependences, or non-uniform execution behavior for which no DLP exists. Therefore, a more adaptive accelerator is required to be able to deploy resources as needed: exploit DLP on SIMD when it’s available, but fall back to ILP on the same hardware when necessary. In this thesis, we first focus on various compiler solutions that solve inefficiency problem in both VLIW and SIMD accelerators. For SIMD accelerators, a new vectorization pass, called SIMD Defragmenter, is introduced to uncover hidden DLP using subgraph identification in SIMD accelerators. CGRA express effectively accelerates sequential code regions using a bypass network in VLIW accelerators, and Resource Recycling leverages stream-graph modulo scheduling technique for scheduling of multiple code regions in multi-core accelerators. Second, we propose the new scalable multicore accelerator referred to as Libra for mobile systems, which can support execution of code regions having both DLP and ILP, as well as hybrid combinations of the two. We believe that as industry requires higher performance, the proposed flexible accelerator and compiler support will put more resources to work in order to meet the performance and power efficiency requirements.PHDElectrical EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/99840/1/yjunpark_1.pd

    MLGOPerf: An ML Guided Inliner to Optimize Performance

    Full text link
    For the past 25 years, we have witnessed an extensive application of Machine Learning to the Compiler space; the selection and the phase-ordering problem. However, limited works have been upstreamed into the state-of-the-art compilers, i.e., LLVM, to seamlessly integrate the former into the optimization pipeline of a compiler to be readily deployed by the user. MLGO was among the first of such projects and it only strives to reduce the code size of a binary with an ML-based Inliner using Reinforcement Learning. This paper presents MLGOPerf; the first end-to-end framework capable of optimizing performance using LLVM's ML-Inliner. It employs a secondary ML model to generate rewards used for training a retargeted Reinforcement learning agent, previously used as the primary model by MLGO. It does so by predicting the post-inlining speedup of a function under analysis and it enables a fast training framework for the primary model which otherwise wouldn't be practical. The experimental results show MLGOPerf is able to gain up to 1.8% and 2.2% with respect to LLVM's optimization at O3 when trained for performance on SPEC CPU2006 and Cbench benchmarks, respectively. Furthermore, the proposed approach provides up to 26% increased opportunities to autotune code regions for our benchmarks which can be translated into an additional 3.7% speedup value.Comment: Version 2: Added the missing Table 6. The short version of this work is accepted at ACM/IEEE CASES 202

    An FPGA implementation of an investigative many-core processor, Fynbos : in support of a Fortran autoparallelising software pipeline

    Get PDF
    Includes bibliographical references.In light of the power, memory, ILP, and utilisation walls facing the computing industry, this work examines the hypothetical many-core approach to finding greater compute performance and efficiency. In order to achieve greater efficiency in an environment in which Moore’s law continues but TDP has been capped, a means of deriving performance from dark and dim silicon is needed. The many-core hypothesis is one approach to exploiting these available transistors efficiently. As understood in this work, it involves trading in hardware control complexity for hundreds to thousands of parallel simple processing elements, and operating at a clock speed sufficiently low as to allow the efficiency gains of near threshold voltage operation. Performance is there- fore dependant on exploiting a new degree of fine-grained parallelism such as is currently only found in GPGPUs, but in a manner that is not as restrictive in application domain range. While removing the complex control hardware of traditional CPUs provides space for more arithmetic hardware, a basic level of control is still required. For a number of reasons this work chooses to replace this control largely with static scheduling. This pushes the burden of control primarily to the software and specifically the compiler, rather not to the programmer or to an application specific means of control simplification. An existing legacy tool chain capable of autoparallelising sequential Fortran code to the degree of parallelism necessary for many-core exists. This work implements a many-core architecture to match it. Prototyping the design on an FPGA, it is possible to examine the real world performance of the compiler-architecture system to a greater degree than simulation only would allow. Comparing theoretical peak performance and real performance in a case study application, the system is found to be more efficient than any other reviewed, but to also significantly under perform relative to current competing architectures. This failing is apportioned to taking the need for simple hardware too far, and an inability to implement static scheduling mitigating tactics due to lack of support for such in the compiler

    Dynamically reconfigurable architecture for embedded computer vision systems

    Get PDF
    The objective of this research work is to design, develop and implement a new architecture which integrates on the same chip all the processing levels of a complete Computer Vision system, so that the execution is efficient without compromising the power consumption while keeping a reduced cost. For this purpose, an analysis and classification of different mathematical operations and algorithms commonly used in Computer Vision are carried out, as well as a in-depth review of the image processing capabilities of current-generation hardware devices. This permits to determine the requirements and the key aspects for an efficient architecture. A representative set of algorithms is employed as benchmark to evaluate the proposed architecture, which is implemented on an FPGA-based system-on-chip. Finally, the prototype is compared to other related approaches in order to determine its advantages and weaknesses

    Energy autonomous systems : future trends in devices, technology, and systems

    Get PDF
    The rapid evolution of electronic devices since the beginning of the nanoelectronics era has brought about exceptional computational power in an ever shrinking system footprint. This has enabled among others the wealth of nomadic battery powered wireless systems (smart phones, mp3 players, GPS, …) that society currently enjoys. Emerging integration technologies enabling even smaller volumes and the associated increased functional density may bring about a new revolution in systems targeting wearable healthcare, wellness, lifestyle and industrial monitoring applications
    • …
    corecore