226 research outputs found
A Survey of Techniques For Improving Energy Efficiency in Embedded Computing Systems
Recent technological advances have greatly improved the performance and
features of embedded systems. With the number of just mobile devices now
reaching nearly equal to the population of earth, embedded systems have truly
become ubiquitous. These trends, however, have also made the task of managing
their power consumption extremely challenging. In recent years, several
techniques have been proposed to address this issue. In this paper, we survey
the techniques for managing power consumption of embedded systems. We discuss
the need of power management and provide a classification of the techniques on
several important parameters to highlight their similarities and differences.
This paper is intended to help the researchers and application-developers in
gaining insights into the working of power management techniques and designing
even more efficient high-performance embedded systems of tomorrow
A RECONFIGURABLE AND EXTENSIBLE EXPLORATION PLATFORM FOR FUTURE HETEROGENEOUS SYSTEMS
Accelerator-based -or heterogeneous- computing has become increasingly
important in a variety of scenarios, ranging from High-Performance Computing (HPC) to embedded systems. While most solutions use sometimes
custom-made components, most of today’s systems rely on commodity highend CPUs and/or GPU devices, which deliver adequate performance while
ensuring programmability, productivity, and application portability. Unfortunately, pure general-purpose hardware is affected by inherently limited
power-efficiency, that is, low GFLOPS-per-Watt, now considered as a primary metric. The many-core model and architectural customization can
play here a key role, as they enable unprecedented levels of power-efficiency
compared to CPUs/GPUs. However, such paradigms are still immature and
deeper exploration is indispensable.
This dissertation investigates customizability and proposes novel solutions
for heterogeneous architectures, focusing on mechanisms related to coherence and network-on-chip (NoC). First, the work presents a non-coherent
scratchpad memory with a configurable bank remapping system to reduce
bank conflicts. The experimental results show the benefits of both using a
customizable hardware bank remapping function and non-coherent memories for some types of algorithms. Next, we demonstrate how a distributed
synchronization master better suits many-cores than standard centralized
solutions. This solution, inspired by the directory-based coherence mechanism, supports concurrent synchronizations without relying on memory
transactions. The results collected for different NoC sizes provided indications about the area overheads incurred by our solution and demonstrated
the benefits of using a dedicated hardware synchronization support. Finally, this dissertation proposes an advanced coherence subsystem, based
on the sparse directory approach, with a selective coherence maintenance
system which allows coherence to be deactivated for blocks that do not require it. Experimental results show that the use of a hybrid coherent and
non-coherent architectural mechanism along with an extended coherence
protocol can enhance performance.
The above results were all collected by means of a modular and customizable heterogeneous many-core system developed to support the exploration
of power-efficient high-performance computing architectures. The system is
based on a NoC and a customizable GPU-like accelerator core, as well as
a reconfigurable coherence subsystem, ensuring application-specific configuration capabilities. All the explored solutions were evaluated on this real heterogeneous system, which comes along with the above methodological
results as part of the contribution in this dissertation. In fact, as a key
benefit, the experimental platform enables users to integrate novel hardware/software solutions on a full-system scale, whereas existing platforms
do not always support a comprehensive heterogeneous architecture exploration
Recommended from our members
AN ARCHITECTURE EVALUATION AND IMPLEMENTATION OF A SOFT GPGPU FOR FPGAs
Embedded and mobile systems must be able to execute a variety of different types of code, often with minimal available hardware. Many embedded systems now come with a simple processor and an FPGA, but not more energy-hungry components, such as a GPGPU. In this dissertation we present FlexGrip, a soft architecture which allows for the execution of GPGPU code on an FPGA without the need to recompile the design. The architecture is optimized for FPGA implementation to effectively support the conditional and thread-based execution characteristics of GPGPU execution without FPGA design recompilation. This architecture supports direct CUDA compilation to a binary which is executable on the FPGA-based GPGPU. Our architecture is customizable, thus providing the FPGA designer with a selection of GPGPU cores which display performance versus area tradeoffs.
This dissertation describes the FlexGrip architecture in detail and showcases the benefits by evaluating the design for a collection of five standard CUDA benchmarks which are compiled using standard GPGPU compilation tools. Speedups of 23x, on average, versus a MicroBlaze microprocessor are achieved for designs which take advantage of the conditional execution capabilities offered by FlexGrip. We also show FlexGrip can achieve an 80% average reduction of dynamic energy versus the MicroBlaze microprocessor.
The dissertation furthers discussion by exploring application-customized versions of the soft GPGPU, thus exploiting the overlay architecture. We expand the architecture to multiple processors per GPGPU and optimizing away features which are not needed by certain classes of applications. These optimizations, which include the effective use of block RAMs and DSP blocks, are critical to the performance of FlexGrip. By implementing a 2 GPGPU design, we show speedups of 44x on average versus a MicroBlaze microprocessor. Application-customized versions of the soft GPGPU can be used to further reduce dynamic energy consumption by an average of 14%.
To complete this thesis, we augmented a GPGPU cycle accurate simulator to emulate FlexGrip and evaluate different levels of cache design spaces. We show performance increases for select benchmarks, however, we also show that 64% and 45% of benchmarks exhibited performance decreases when L1D cache was enabled for the 1 SMP and 2 SMP configurations, and only one benchmark showed performance improvement when the L2 cache was enabled
From Parallel Programs to Customized Parallel Processors
The need for fast time to market of new embedded processor-based designs calls for a rapid design methodology of the included processors. The call for such a methodology is even more emphasized in the context of so called soft cores targeted to reconfigurable fabrics where per-design processor customization is commonplace.
The C language has been commonly used as an input to hardware/software co-design flows. However, as C is a sequential language, its potential to generate parallel operations to utilize naturally parallel hardware constructs is far from optimal, leading to a customized processor design space with limited parallel resource scalability. In contrast, when utilizing a parallel programming language as an input, a wider processor design space can be explored to produce customized processors with varying degrees of utilized parallelism.
This Thesis proposes a novel Multicore Application-Specific Instruction Set Processor (MCASIP) co-design methodology that exploits parallel programming languages as the application input format. In the methodology, the designer can explicitly capture the parallelism of the algorithm and exploit specialized instructions using a parallel programming language in contrast to being on the mercy of the compiler or the hardware to extract the parallelism from a sequential input. The Thesis proposes a multicore processor template based on the Transport Triggered Architecture, compiler techniques involved in static parallelization of computation kernels with barriers and a datapath integrated hardware accelerator for low overhead software synchronization implementation. These contributions enable scaling the customized processors both at the instruction and task levels to efficiently exploit the parallelism in the input program up to the implementation constraints such as the memory bandwidth or the chip area. The different contributions are validated with case studies, comparisons and design examples
Performance and area evaluations of processor-based benchmarks on FPGA devices
The computing system on SoCs is being long-term research since the FPGA technology has emerged due to its personality of re-programmable fabric, reconfigurable computing, and fast development time to market. During the last decade, uni-processor in a SoC is no longer to deal with the high growing market for complex applications such as Mobile Phones audio and video encoding, image and network processing. Due to the number of transistors on a silicon wafer is increasing, the recent FPGAs or embedded systems are advancing toward multi-processor-based design to meet tremendous performance and benefit this kind of systems are possible. Therefore, is an upcoming age of the MPSoC. In addition, most of the embedded processors are soft-cores, because they are flexible and reconfigurable for specific software functions and easy to build homogenous multi-processor systems for parallel programming. Moreover, behavioural synthesis tools are becoming a lot more powerful and enable to create datapath of logic units from high-level algorithms such as C to HDL and available for partitioning a HW/SW concurrent methodology.
A range of embedded processors is able to implement on a FPGA-based prototyping to integrate the CPUs on a programmable device. This research is, firstly represent different types of computer architectures in modern embedded processors that are followed in different type of software applications (eg. Multi-threading Operations or Complex Functions) on FPGA-based SoCs; and secondly investigate their capability by executing a wide-range of multimedia software codes (Integer-algometric only) in different models of the processor-systems (uni-processor or multi-processor or Co-design), and finally compare those results in terms of the benchmarks and resource utilizations within FPGAs. All the examined programs were written in standard C and executed in a variety numbers of soft-core processors or hardware units to obtain the execution times. However, the number of processors and their customizable configuration or hardware datapath being generated are limited by a target FPGA resource, and designers need to understand the FPGA-based tradeoffs that have been considered - Speed versus Area.
For this experimental purpose, I defined benchmarks into DLP / HLS catalogues, which are "data" and "function" intensive respectively. The programs of DLP will be executed in LEON3 MP and LE1 CMP multi-processor systems and the programs of HLS in the LegUp Co-design system on target FPGAs. In preliminary, the performance of the soft-core processors will be examined by executing all the benchmarks. The whole story of this thesis work centres on the issue of the execute times or the speed-up and area breakdown on FPGA devices in terms of different programs
Cache Coherency for Symmetric Multiprocessor Systems on Programmable Chips
Rapid progress in the area of Field-Programmable Gate Arrays (FPGAs) has led to the availability of softcore processors that are simple to use, and can enable the development of a fully working system in minutes. This has lead to the enormous popularity of System-On-Programmable-Chip (SOPC) computing platforms. These softcore processors, while relatively simple compared to their leading-edge hardcore counterparts, are often designed with a number of advanced performance-enhancing features, such as instruction and data caches. Moreover, they are designed to be used in a uniprocessor or uncoupled multiprocessor architecture, and not in a tightly-coupled multiprocessing architecture. As a result, traditional cache-coherency protocols are not suitable for use with such systems. This thesis describes a system for enforcing cache coherency on symmetric multiprocessing (SMP) systems using softcore processors. A hybrid protocol that incorporates hardware and software to enforce cache coherency is presented
Design and resource management of reconfigurable multiprocessors for data-parallel applications
FPGA (Field-Programmable Gate Array)-based custom reconfigurable computing machines have established themselves as low-cost and low-risk alternatives to ASIC (Application-Specific Integrated Circuit) implementations and general-purpose microprocessors in accelerating a wide range of computation-intensive applications. Most often they are Application Specific Programmable Circuiits (ASPCs), which are developer programmable instead of user programmable. The major disadvantages of ASPCs are minimal programmability, and significant time and energy overheads caused by required hardware reconfiguration when the problem size outnumbers the available reconfigurable resources; these problems are expected to become more serious with increases in the FPGA chip size. On the other hand, dominant high-performance computing systems, such as PC clusters and SMPs (Symmetric Multiprocessors), suffer from high communication latencies and/or scalability problems.
This research introduces low-cost, user-programmable and reconfigurable MultiProcessor-on-a-Programmable-Chip (MPoPC) systems for high-performance, low-cost computing. It also proposes a relevant resource management framework that deals with performance, power consumption and energy issues. These semi-customized systems reduce significantly runtime device reconfiguration by employing userprogrammable processing elements that are reusable for different tasks in large, complex applications. For the sake of illustration, two different types of MPoPCs with hardware FPUs (floating-point units) are designed and implemented for credible performance evaluation and modeling: the coarse-grain MIMD (Multiple-Instruction, Multiple-Data) CG-MPoPC machine based on a processor IP (Intellectual Property) core and the mixed-mode (MIMD, SIMD or M-SIMD) variant-grain HERA (HEterogeneous Reconfigurable Architecture) machine. In addition to alleviating the above difficulties, MPoPCs can offer several performance and energy advantages to our data-parallel applications when compared to ASPCs; they are simpler and more scalable, and have less verification time and cost. Various common computation-intensive benchmark algorithms, such as matrix-matrix multiplication (MMM) and LU factorization, are studied and their parallel solutions are shown for the two MPoPCs. The performance is evaluated with large sparse real-world matrices primarily from power engineering. We expect even further performance gains on MPoPCs in the near future by employing ever improving FPGAs. The innovative nature of this work has the potential to guide research in this arising field of high-performance, low-cost reconfigurable computing.
The largest advantage of reconfigurable logic lies in its large degree of hardware customization and reconfiguration which allows reusing the resources to match the computation and communication needs of applications. Therefore, a major effort in the presented design methodology for mixed-mode MPoPCs, like HERA, is devoted to effective resource management. A two-phase approach is applied. A mixed-mode weighted Task Flow Graph (w-TFG) is first constructed for any given application, where tasks are classified according to their most appropriate computing mode (e.g., SIMD or MIMD). At compile time, an architecture is customized and synthesized for the TFG using an Integer Linear Programming (ILP) formulation and a parameterized hardware component library. Various run-time scheduling schemes with different performanceenergy objectives are proposed. A system-level energy model for HERA, which is based on low-level implementation data and run-time statistics, is proposed to guide performance-energy trade-off decisions. A parallel power flow analysis technique based on Newton\u27s method is proposed and employed to verify the methodology
On the use of embedded debug features for permanent and transient fault resilience in microprocessors
Microprocessor-based systems are employed in an increasing number of applications where dependability is a major constraint. For this reason detecting faults arising during normal operation while introducing the least possible penalties is a main concern. Different forms of redundancy have been employed to ensure error-free behavior, while error detection mechanisms can be employed where some detection latency is tolerated. However, the high complexity and the low observability of microprocessors internal resources make the identification of adequate on-line error detection strategies a very challenging task, which can be tackled at circuit or system level. Concerning system-level strategies, a common limitation is in the mechanism used to monitor program execution and then detect errors as soon as possible, so as to reduce their impact on the application. In this work, an on-line error detection approach based on the reuse of available debugging infrastructures is proposed. The approach can be applied to different system architectures profiting from the debug trace port available in most of current microprocessors to observe possible misbehaviors. Two microprocessors have been used to study the applicability of the solution. LEON3 and ARM7TDMI. Results show that the presented fault detection technique enhances observability and thus error detection abilities in microprocessor-based systems without requiring modifications on the core architecture
- …