101 research outputs found

    Cache behavior prediction by abstract interpretation

    Get PDF
    Abstract interpretation is a technique for the static detection of dynamic properties of programs. It is semantics based, that is, it computes approximative properties of the semantics of programs. On this basis, it allows for correctness proofs of analyses. It replaces commonly used ad hoc techniques by systematic, provable ones, and it allows the automatic generation of analyzers from specifications as in the Program Analyzer Generator, PAG. In this paper, abstract interpretation is applied to the problem of predicting the cache behavior of programs. Abstract semantics of machine programs are defined which determine the contents of caches. For interprocedural analysis, existing methods are examined and a new approach that is especially tailored for the cache analysis is presented. This allows for a static classification of the cache behavior of memory references of programs. The calculated information can be used to sharpen worst case execution time estimations. It is possible to analyze instruction, data, and combined instruction/data caches for common (re)placement and write strategies. Experimental results are presented that demonstrate the applicability of the analysis

    A general guide to applying machine learning to computer architecture

    Get PDF
    The resurgence of machine learning since the late 1990s has been enabled by significant advances in computing performance and the growth of big data. The ability of these algorithms to detect complex patterns in data which are extremely difficult to achieve manually, helps to produce effective predictive models. Whilst computer architects have been accelerating the performance of machine learning algorithms with GPUs and custom hardware, there have been few implementations leveraging these algorithms to improve the computer system performance. The work that has been conducted, however, has produced considerably promising results. The purpose of this paper is to serve as a foundational base and guide to future computer architecture research seeking to make use of machine learning models for improving system efficiency. We describe a method that highlights when, why, and how to utilize machine learning models for improving system performance and provide a relevant example showcasing the effectiveness of applying machine learning in computer architecture. We describe a process of data generation every execution quantum and parameter engineering. This is followed by a survey of a set of popular machine learning models. We discuss their strengths and weaknesses and provide an evaluation of implementations for the purpose of creating a workload performance predictor for different core types in an x86 processor. The predictions can then be exploited by a scheduler for heterogeneous processors to improve the system throughput. The algorithms of focus are stochastic gradient descent based linear regression, decision trees, random forests, artificial neural networks, and k-nearest neighbors.This work has been supported by the European Research Council (ERC) Advanced Grant RoMoL (Grant Agreemnt 321253) and by the Spanish Ministry of Science and Innovation (contract TIN 2015-65316P).Peer ReviewedPostprint (published version

    Area-efficient near-associative memories on FPGAs

    Full text link

    Establishing Confidence and Understanding Uncertainty in Real-Time Systems

    Get PDF

    Last-touch correlated data streaming

    Get PDF
    Recent research advocates address-correlating predictors to identify cache block addresses for prefetch. Unfortunately, address-correlating predictors require correlation data storage proportional in size to a program's active memory footprint. As a result, current proposals for this class of predictor are either limited in coverage due to constrained on-chip storage requirements or limited in prediction lookaheaddue to long off-chip correlation data lookup. In this paper, we propose Last-Touch Correlated Data Streaming (LT-cords), a practical address-correlating predictor. The key idea of LT-cords is to record correlation data off chip in the order they will be used and stream them into a practicallysized on-chip table shortly before they are needed, thereby obviating the need for scalable on-chip tables and enabling low-latency lookup. We use cycle-accurate simulation of an 8-way out-of-order superscalar processor to show that: (1) LT-cords with 214KB of on-chip storage can achieve the same coverage as a last-touch predictor with unlimited storage, without sacrificing predictor lookahead, and (2) LT-cords improves performance by 60% on average and 385% at best in the benchmarks studied. © 2007 IEEE

    ANALYTICAL MODEL FOR CHIP MULTIPROCESSOR MEMORY HIERARCHY DESIGN AND MAMAGEMENT

    Get PDF
    Continued advances in circuit integration technology has ushered in the era of chip multiprocessor (CMP) architectures as further scaling of the performance of conventional wide-issue superscalar processor architectures remains hard and costly. CMP architectures take advantageof Moore¡¯s Law by integrating more cores in a given chip area rather than a single fastyet larger core. They achieve higher performance with multithreaded workloads. However,CMP architectures pose many new memory hierarchy design and management problems thatmust be addressed. For example, how many cores and how much cache capacity must weintegrate in a single chip to obtain the best throughput possible? Which is more effective,allocating more cache capacity or memory bandwidth to a program?This thesis research develops simple yet powerful analytical models to study two newmemory hierarchy design and resource management problems for CMPs. First, we considerthe chip area allocation problem to maximize the chip throughput. Our model focuses onthe trade-off between the number of cores, cache capacity, and cache management strategies.We find that different cache management schemes demand different area allocation to coresand cache to achieve their maximum performance. Second, we analyze the effect of cachecapacity partitioning on the bandwidth requirement of a given program. Furthermore, ourmodel considers how bandwidth allocation to different co-scheduled programs will affect theindividual programs¡¯ performance. Since the CMP design space is large and simulating only one design point of the designspace under various workloads would be extremely time-consuming, the conventionalsimulation-based research approach quickly becomes ineffective. We anticipate that ouranalytical models will provide practical tools to CMP designers and correctly guide theirdesign efforts at an early design stage. Furthermore, our models will allow them to betterunderstand potentially complex interactions among key design parameters

    Exploring Hybrid SPM-Cache Architectures to Improve Performance and Energy Efficiency for Real-time Computing

    Get PDF
    Real-time computing is not just fast computing but time-predictable computing. Many tasks in safety-critical embedded real-time systems have hard real-time characteristics. Failure to meet deadlines may result in the loss of life or in large damages. Known of Worst Case Execution Time (WCET) is important for reliability or correct functional behavior of the system. As multi-core processors are increasingly adopted in industry, it has become a great challenge to accurately bound the worst-case execution time (WCET) for real-time systems running on multi-core chips. This is particularly true because of the inter-thread interferences in accessing shared resources on multi-cores, such as shared L2 caches, which can significantly affect the performance but are very difficult to be estimate statically. We propose an approach to analyzing Worst Case Execution Time (WCET) for multi-core processors with shared L2 instruction caches by using a model checking based method. Our experiments indicate that compared to the static analysis technique based on extended ILP (Integer Linear Programming), our approach improves the tightness of WCET estimation more than 31.1% for the benchmarks we studied. However, due to the inherent complexity of multi-core timing analysis and the state explosion problem, the model checking based approach currently can only work with small real-time kernels for dual-core processors. At the same time, improving the average-case performance and energy efficiency has also been important for real-time systems. Recently, Hybrid SPM-Cache (HSC) architectures by combining caches and Scratch-Pad Memories (SPMs) have been increasingly used in commercial processors and research prototypes. Our research explores HSC architectures for real-time systems to reconcile time predictability, performance, and energy consumption. We study the energy dissipation of a number of hybrid on-chip memory architectures by combining both caches and Scratch-Pad Memories (SPM) without increasing the total on-chip memory size. Our experimental results indicate that with the equivalent total on-chip memory size, several hybrid SPM-Cache architectures are more energy-efficient than either pure software controlled SPMs or pure hardware-controlled caches. In particular, using the hybrid SPM-cache to store both instructions and data can achieve the best energy efficiency. However, the SPM allocation for the HSC architecture must be aware of the cache to harness the full potential of the HSC architecture. First, we propose and evaluate four SPM allocation strategies to reduce WCET for hybrid SPM-Caches with different complexities. These algorithms differ by whether or not they can cooperate with the cache or be aware of the WCET. Our evaluation shows that the cache aware and WCET-oriented SPM allocation can maximally reduce the WCET with minimum or even positive impact on the average-case execution time (ACET). Moreover, we explore four SPM allocation algorithms to maximize performance on the HSC architecture, including three heuristic-based algorithms, and an optimal algorithm based on model checking. Our experiments indicate that the Greedy Stack Distance based Allocation (GSDA) can run efficiently while achieving performance either the same as or close to the optimal results got by the Optimal Stack Distance based Allocation (OSDA). Last but not the least, we extend the two stack distance based allocation algorithms to GSDA-E and OSDA-E to minimize the energy consumption of the HSC architecture. Our experimental results show that the GSDA-E can also reduce the energy either the same as or close to the optimal results attained by the OSDA-E, while achieving performance close to the OSDA and GSDA

    Speeding up architectural simulation through high-level core abstractions and sampling

    Get PDF

    Memory Optimizations for Time-Predictable Embedded Software

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH
    corecore