46 research outputs found

    WCET Optimizations and Architectural Support for Hard Real-Time Systems

    Get PDF
    As time predictability is critical to hard real-time systems, it is not only necessary to accurately estimate the worst-case execution time (WCET) of the real-time tasks but also desirable to improve either the WCET of the tasks or time predictability of the system, because the real-time tasks with lower WCETs are easy to schedule and more likely to meat their deadlines. As a real-time system is an integration of software and hardware, the optimization can be achieved through two ways: software optimization and time-predictable architectural support. In terms of software optimization, we fi rst propose a loop-based instruction prefetching approach to further improve the WCET comparing with simple prefetching techniques such as Next-N-Line prefetching which can enhance both the average-case performance and the worst-case performance. Our prefetching approach can exploit the program controlow information to intelligently prefetch instructions that are most likely needed. Second, as inter-thread interferences in shared caches can signi cantly a ect the WCET of real-time tasks running on multicore processors, we study three multicore-aware code positioning methods to reduce the inter-core L2 cache interferences between co-running real-time threads. One strategy focuses on decreasing the longest WCET among the co-running threads, and two other methods aim at achieving fairness in terms of the amount or percentage of WCET reduction among co-running threads. In the aspect of time-predictable architectural support, we introduce the concept of architectural time predictability (ATP) to separate timing uncertainty concerns caused by hardware from software, which greatly facilitates the advancement of time-predictable processor design. We also propose a metric called Architectural Time-predictability Factor (ATF) to measure architectural time predictability quantitatively. Furthermore, while cache memories can generally improve average-case performance, they are harmful to time predictability and thus are not desirable for hard real-time and safety-critical systems. In contrast, Scratch-Pad Memories (SPMs) are time predictable, but they may lead to inferior performance. Guided by ATF, we propose and evaluate a variety of hybrid on-chip memory architectures to combine both caches and SPMs intelligently to achieve good time predictability and high performance. Detailed implementation and experimental results discussion are presented in this dissertation

    Compiler Assisted Cache Prefetch Using Procedure Call Hierarchy

    Get PDF
    Microprocessor performance has been increasing at an exponential rate while memory system performance improved at a linear rate. This widening difference in performances is increasingly rendering advances in computer architecture less useful as more instructions spend more time waiting for data to be fetched from the memory after a cache miss. Data prefetching is a technique that avoids some cache misses by bringing data into the cache before it is actually needed. Different approaches to data prefetching have been developed, however existing prefetch schemes do not eliminate all cache misses and even with smaller cache miss ratio, miss latency remains an important performance limiter. In this thesis, we propose a technique called Compiler Assisted Cache Prefetch Using Procedure Call Hierarchy (CAPPH). It is a hardware-software prefetch technique that uses a compiler to provide information pertaining to data structure layout, data-flow and procedure-call hierarchy of the program to a mechanism that prefetches linked data structures (LDS). It can prefetch data for procedures even before they are called by using this statically generated information. It is also capable of issuing prefetches for recursive functions that access LDS and arbitrary access sequences which are otherwise difficult to prefetch. The scheme is simulated using RSIML, a SPARC v8 simulator. Benchmarks em3d, health and mst from the Olden suite were used. The scheme was compared with an otherwise identical system with no prefetch and one using sequential prefetch. Simulations were performed to measure CAPPH performance and the decrease in the miss ratio of loads accessing LDS. Statistics of individual loads were collected, and accuracy, coverage and timeliness were measured against varying cache size and latency. Results from individual loads accessing linked data structures show considerable decrease in their miss ratios and average access times. CAPPH is found to be more accurate than sequential prefetch. The coverage and timeliness are lower in CAPPH than in sequential prefetch. We suggest heuristics to further enhance the effectiveness of the prefetch technique

    Iterative Compilation and Performance Prediction for Numerical Applications

    Get PDF
    Institute for Computing Systems ArchitectureAs the current rate of improvement in processor performance far exceeds the rate of memory performance, memory latency is the dominant overhead in many performance critical applications. In many cases, automatic compiler-based approaches to improving memory performance are limited and programmers frequently resort to manual optimisation techniques. However, this process is tedious and time-consuming. Furthermore, a diverse range of a rapidly evolving hardware makes the optimisation process even more complex. It is often hard to predict the potential benefits from different optimisations and there are no simple criteria to stop optimisations i.e. when optimal memory performance has been achieved or sufficiently approached. This thesis presents a platform independent optimisation approach for numerical applications based on iterative feedback-directed program restructuring using a new reasonably fast and accurate performance prediction technique for guiding optimisations. New strategies for searching the optimisation space, by means of profiling to find the best possible program variant, have been developed. These strategies have been evaluated using a range of kernels and programs on different platforms and operating systems. A significant performance improvement has been achieved using new approaches when compared to the state-of-the-art native static and platform-specific feedback directed compilers

    Hardware-based generation of independent subtraces of instructions in clustered processors

    Full text link
    © 2013 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Multicore chips are currently dominating the microprocessor market as designs that improve performance and sustain power consumption. However, complex core features must be still considered to provide good performance for existing sequential applications. An effective approach to reduce core complexity without dramatically sacrificing performance is to distribute critical processor structures by using clustered microarchitectures. In these designs, communication latency among clusters is a critical performance bottleneck, and a good steering algorithm is required to reduce intercluster communication. In this paper, we propose a new energy-efficient microarchitectural approach that reduces intercluster communication by detecting and generating independent chains of instructions, referred to as subtraces, from the execution of sequential programs. The devised mechanism has been modeled on an x86-based trace-cache processor, where subtraces are built in the fill unit, stored in a trace cache, and individually steered to different clusters. Experimental results show that the proposal reaches performance speedups around 7 and 15 percent for point-to-point and bus-based interconnects, respectively, while achieving energy savings of up to 12 percent.This work was supported by the Spanish MICINN, Consolider Programme, and Plan E funds, as well as European Commission FEDER funds, under Grants CSD2006-00046 and TIN2009-14475-C04-01.Ubal Tena, R.; Sahuquillo Borrás, J.; Petit Martí, SV.; López Rodríguez, PJ.; Duato Marín, JF. (2013). Hardware-based generation of independent subtraces of instructions in clustered processors. IEEE Transactions on Computers. 62(5):944-955. https://doi.org/10.1109/TC.2012.42S94495562

    Memory Subsystem Optimization Techniques for Modern High-Performance General-Purpose Processors

    Get PDF
    abstract: General-purpose processors propel the advances and innovations that are the subject of humanity’s many endeavors. Catering to this demand, chip-multiprocessors (CMPs) and general-purpose graphics processing units (GPGPUs) have seen many high-performance innovations in their architectures. With these advances, the memory subsystem has become the performance- and energy-limiting aspect of CMPs and GPGPUs alike. This dissertation identifies and mitigates the key performance and energy-efficiency bottlenecks in the memory subsystem of general-purpose processors via novel, practical, microarchitecture and system-architecture solutions. Addressing the important Last Level Cache (LLC) management problem in CMPs, I observe that LLC management decisions made in isolation, as in prior proposals, often lead to sub-optimal system performance. I demonstrate that in order to maximize system performance, it is essential to manage the LLCs while being cognizant of its interaction with the system main memory. I propose ReMAP, which reduces the net memory access cost by evicting cache lines that either have no reuse, or have low memory access cost. ReMAP improves the performance of the CMP system by as much as 13%, and by an average of 6.5%. Rather than the LLC, the L1 data cache has a pronounced impact on GPGPU performance by acting as the bandwidth filter for the rest of the memory subsystem. Prior work has shown that the severely constrained data cache capacity in GPGPUs leads to sub-optimal performance. In this thesis, I propose two novel techniques that address the GPGPU data cache capacity problem. I propose ID-Cache that performs effective cache bypassing and cache line size selection to improve cache capacity utilization. Next, I propose LATTE-CC that considers the GPU’s latency tolerance feature and adaptively compresses the data stored in the data cache, thereby increasing its effective capacity. ID-Cache and LATTE-CC are shown to achieve 71% and 19.2% speedup, respectively, over a wide variety of GPGPU applications. Complementing the aforementioned microarchitecture techniques, I identify the need for system architecture innovations to sustain performance scalability of GPG- PUs in the face of slowing Moore’s Law. I propose a novel GPU architecture called the Multi-Chip-Module GPU (MCM-GPU) that integrates multiple GPU modules to form a single logical GPU. With intelligent memory subsystem optimizations tailored for MCM-GPUs, it can achieve within 7% of the performance of a similar but hypothetical monolithic die GPU. Taking a step further, I present an in-depth study of the energy-efficiency characteristics of future MCM-GPUs. I demonstrate that the inherent non-uniform memory access side-effects form the key energy-efficiency bottleneck in the future. In summary, this thesis offers key insights into the performance and energy-efficiency bottlenecks in CMPs and GPGPUs, which can guide future architects towards developing high-performance and energy-efficient general-purpose processors.Dissertation/ThesisDoctoral Dissertation Computer Science 201

    Compilation and Scheduling Techniques for Embedded Systems

    Get PDF
    Embedded applications are constantly increasing in size, which has resulted in increasing demand on designers of digital signal processors (DSPs) to meet the tight memory, size and cost constraints. With this trend, memory requirement reduction through code compaction and variable coalescing techniques are gaining more ground. Also, as the current trend in complex embedded systems of using multiprocessor system-on-chip (MPSoC) grows, problems like mapping, memory management and scheduling are gaining more attention. The first part of the dissertation deals with problems related to digital signal processors. Most modern DSPs provide multiple address registers and a dedicated address generation unit (AGU) which performs address generation in parallel to instruction execution. A careful placement of variables in memory is important in decreasing the number of address arithmetic instructions leading to compact and efficient code. Chapters 2 and 3 present effective heuristics for the simple and the general offset assignment problems with variable coalescing. A solution based on simulated annealing is also presented. Chapter 4 presents an optimal integer linear programming (ILP) solution to the offset assignment problem with variable coalescing and operand permutation. A new approach to the general offset assignment problem is introduced. Chapter 5 presents an optimal ILP formulation and a genetic algorithm solution to the address register allocation problem (ARA) with code transformation techniques. The ARA problem is used to generate compact codes for array-intensive embedded applications. In the second part of the dissertation, we study problems related to MPSoCs. MPSoCs provide the flexibility to meet the performance requirements of multimedia applications while respecting the tight embedded system constraints. MPSoC-based embedded systems often employ software-managed memories called scratch-pad memories (SPM). Scheduling the tasks of an application on the processors and partitioning the available SPM budget among those processors are two critical issues in reducing the overall computation time. Traditionally, the step of task scheduling is applied separately from the memory partitioning step. Such a decoupled approach may miss better quality schedules. Chapters 6 and 7 present effective heuristics that integrate task allocation and SPM partitioning to further reduce the execution time of embedded applications for single and multi-application scenarios
    corecore