2 research outputs found

    Low-power architecture with scratch-pad memory for accelerating embedded applications with run-time reuse

    No full text
    Current embedded systems are usually designed for data-dominated applications, but they have a tight energy and time budget. Scratch-pad memories are completely software-controlled memories with predictable behaviour and good performance and energy characteristics, thus they tend to become a standard feature in many embedded systems. However, their predictability is not helping if the application accesses its data dynamically, when the addresses of the accessed data depend on the application's input. In such cases, predetermining the scratch-pad content at design-time is not always possible as the compiler cannot predict the runtime input. Moreover, in this case, both data reuse and data placement in the scratch-pad are inefficient because chunks of data already stored cannot be efficiently reused and combined with the runtime accessed data blocks. State-of-the art techniques copy each new data block to the scratch-pad without considering whether portions of them are already in it. Such dynamic temporal locality cannot be predicted or exploited by the compiler. The authors here present a system architecture, strongly connected to the system's scratch-pad and the processor's compiler, which is able to efficiently exploit run-time data reuse in the scratch-pad by being capable of holding valuable information, such as the exact data contents of the scratch-pad at runtime, and using it to do all the necessary operations for placing each new data block in scratch-pad. It is fine tuned for applications with run-time reuse between rectangular data blocks. The application domain of the proposed architecture is multimedia applications with run-time reuse, certain applications with linked lists and multi-threaded applications. It operates in a time and energy-efficient manner when compared with existing scratch-pad architectures without the authors' scratch-pad accelerator engine, showing its higher normalised performance and lower normalised energy consumption. Experimental results show up to 2.5 times performance increase compared with existing scratch-pad architectures and 5 times compared with cache architectures and energy decrease up to 1.9 and 3.9 times, respectively

    Vector coprocessor sharing techniques for multicores: performance and energy gains

    Get PDF
    Vector Processors (VPs) created the breakthroughs needed for the emergence of computational science many years ago. All commercial computing architectures on the market today contain some form of vector or SIMD processing. Many high-performance and embedded applications, often dealing with streams of data, cannot efficiently utilize dedicated vector processors for various reasons: limited percentage of sustained vector code due to substantial flow control; inherent small parallelism or the frequent involvement of operating system tasks; varying vector length across applications or within a single application; data dependencies within short sequences of instructions, a problem further exacerbated without loop unrolling or other compiler optimization techniques. Additionally, existing rigid SIMD architectures cannot tolerate efficiently dynamic application environments with many cores that may require the runtime adjustment of assigned vector resources in order to operate at desired energy/performance levels. To simultaneously alleviate these drawbacks of rigid lane-based VP architectures, while also releasing on-chip real estate for other important design choices, the first part of this research proposes three architectural contexts for the implementation of a shared vector coprocessor in multicore processors. Sharing an expensive resource among multiple cores increases the efficiency of the functional units and the overall system throughput. The second part of the dissertation regards the evaluation and characterization of the three proposed shared vector architectures from the performance and power perspectives on an FPGA (Field-Programmable Gate Array) prototype. The third part of this work introduces performance and power estimation models based on observations deduced from the experimental results. The results show the opportunity to adaptively adjust the number of vector lanes assigned to individual cores or processing threads in order to minimize various energy-performance metrics on modern vector- capable multicore processors that run applications with dynamic workloads. Therefore, the fourth part of this research focuses on the development of a fine-to-coarse grain power management technique and a relevant adaptive hardware/software infrastructure which dynamically adjusts the assigned VP resources (number of vector lanes) in order to minimize the energy consumption for applications with dynamic workloads. In order to remove the inherent limitations imposed by FPGA technologies, the fifth part of this work consists of implementing an ASIC (Application Specific Integrated Circuit) version of the shared VP towards precise performance-energy studies involving high- performance vector processing in multicore environments
    corecore