108 research outputs found

    Hardware Architectural Support for Caching Partitioned Reconfigurations in Reconfigurable Systems

    Get PDF
    The efficiency of the reconfiguration process in modern FPGAs can improve drastically if an on-chip configuration memory is included in the system because it can reduce both the reconfiguration latency and its energy consumption. However, FPGA on-chip memory resources are very limited. Thus, it is very important to manage them effectively in order to improve the reconfiguration process as much as possible even when the size of the on-chip configuration memory is small. This paper presents a hardware implementation of an on-chip configuration memory controller that efficiently manages run-time reconfigurations. In order to optimize the use of the on-chip memory, this controller includes support to deal with configurations that have been divided into blocks of customizable size. When a reconfiguration must be carried out, our controller provides the blocks stored on-chip and looks for the remaining blocks by accessing to the off-chip configuration memory. Moreover, it dynamically decides which blocks must be stored on-chip. To this end, the designed controller implements a simple but efficient technique that allows maximizing the benefits of the on-chip memories. Experimental results will demonstrate that its implementation cost is very affordable and that it introduces negligible run-time management overheads

    A Survey of Probabilistic Timing Analysis Techniques for Real-Time Systems

    Get PDF
    This survey covers probabilistic timing analysis techniques for real-time systems. It reviews and critiques the key results in the field from its origins in 2000 to the latest research published up to the end of August 2018. The survey provides a taxonomy of the different methods used, and a classification of existing research. A detailed review is provided covering the main subject areas: static probabilistic timing analysis, measurement-based probabilistic timing analysis, and hybrid methods. In addition, research on supporting mechanisms and techniques, case studies, and evaluations is also reviewed. The survey concludes by identifying open issues, key challenges and possible directions for future research

    Improving cache Behavior in CMP architectures throug cache partitioning techniques

    Get PDF
    The evolution of microprocessor design in the last few decades has changed significantly, moving from simple inorder single core architectures to superscalar and vector architectures in order to extract the maximum available instruction level parallelism. Executing several instructions from the same thread in parallel allows significantly improving the performance of an application. However, there is only a limited amount of parallelism available in each thread, because of data and control dependences. Furthermore, designing a high performance, single, monolithic processor has become very complex due to power and chip latencies constraints. These limitations have motivated the use of thread level parallelism (TLP) as a common strategy for improving processor performance. Multithreaded processors allow executing different threads at the same time, sharing some hardware resources. There are several flavors of multithreaded processors that exploit the TLP, such as chip multiprocessors (CMP), coarse grain multithreading, fine grain multithreading, simultaneous multithreading (SMT), and combinations of them.To improve cost and power efficiency, the computer industry has adopted multicore chips. In particular, CMP architectures have become the most common design decision (combined sometimes with multithreaded cores). Firstly, CMPs reduce design costs and average power consumption by promoting design re-use and simpler processor cores. For example, it is less complex to design a chip with many small, simple cores than a chip with fewer, larger, monolithic cores.Furthermore, simpler cores have less power hungry centralized hardware structures. Secondly, CMPs reduce costs by improving hardware resource utilization. On a multicore chip, co-scheduled threads can share costly microarchitecture resources that would otherwise be underutilized. Higher resource utilization improves aggregate performance and enables lower cost design alternatives.One of the resources that impacts most on the final performance of an application is the cache hierarchy. Caches store data recently used by the applications in order to take advantage of temporal and spatial locality of applications. Caches provide fast access to data, improving the performance of applications. Caches with low latencies have to be small, which prompts the design of a cache hierarchy organized into several levels of cache.In CMPs, the cache hierarchy is normally organized in a first level (L1) of instruction and data caches private to each core. A last level of cache (LLC) is normally shared among different cores in the processor (L2, L3 or both). Shared caches increase resource utilization and system performance. Large caches improve performance and efficiency by increasing the probability that each application can access data from a closer level of the cache hierarchy. It also allows an application to make use of the entire cache if needed.A second advantage of having a shared cache in a CMP design has to do with the cache coherency. In parallel applications, different threads share the same data and keep a local copy of this data in their cache. With multiple processors, it is possible for one processor to change the data, leaving another processor's cache with outdated data. Cache coherency protocol monitors changes to data and ensures that all processor caches have the most recent data. When the parallel application executes on the same physical chip, the cache coherency circuitry can operate at the speed of on-chip communications, rather than having to use the much slower between-chip communication, as is required with discrete processors on separate chips. These coherence protocols are simpler to design with a unified and shared level of cache onchip.Due to the advantages that multicore architectures offer, chip vendors use CMP architectures in current high performance, network, real-time and embedded systems. Several of these commercial processors have a level of the cache hierarchy shared by different cores. For example, the Sun UltraSPARC T2 has a 16-way 4MB L2 cache shared by 8 cores each one up to 8-way SMT. Other processors like the Intel Core 2 family also share up to a 12MB 24-way L2 cache. In contrast, the AMD K10 family has a private L2 cache per core and a shared L3 cache, with up to a 6MB 64-way L3 cache.As the long-term trend of increasing integration continues, the number of cores per chip is also projected to increase with each successive technology generation. Some significant studies have shown that processors with hundreds of cores per chip will appear in the market in the following years. The manycore era has already begun. Although this era provides many opportunities, it also presents many challenges. In particular, higher hardware resource sharing among concurrently executing threads can cause individual thread's performance to become unpredictable and might lead to violations of the individual applications' performance requirements. Current resource management mechanisms and policies are no longer adequate for future multicore systems.Some applications present low re-use of their data and pollute caches with data streams, such as multimedia, communications or streaming applications, or have many compulsory misses that cannot be solved by assigning more cache space to the application. Traditional eviction policies such as Least Recently Used (LRU), pseudo LRU or random are demand-driven, that is, they tend to give more space to the application that has more accesses to the cache hierarchy.When no direct control over shared resources is exercised (the last level cache in this case), it is possible that a particular thread allocates most of the shared resources, degrading other threads performance. As a consequence, high resource sharing and resource utilization can cause systems to become unstable and violate individual applications' requirements. If we want to provide a Quality of Service (QoS) to applications, we need to enhance the control over shared resources and enrich the collaboration between the OS and the architecture.In this thesis, we propose software and hardware mechanisms to improve cache sharing in CMP architectures. We make use of a holistic approach, coordinating targets of software and hardware to improve system aggregate performance and provide QoS to applications. We make use of explicit resource allocation techniques to control the shared cache in a CMP architecture, with resource allocation targets driven by hardware and software mechanisms.The main contributions of this thesis are the following:- We have characterized different single- and multithreaded applications and classified workloads with a systematic method to better understand and explain the cache sharing effects on a CMP architecture. We have made a special effort in studying previous cache partitioning techniques for CMP architectures, in order to acquire the insight to propose improved mechanisms.- In CMP architectures with out-of-order processors, cache misses can be served in parallel and share the miss penalty to access main memory. We take this fact into account to propose new cache partitioning algorithms guided by the memory-level parallelism (MLP) of each application. With these algorithms, the system performance is improved (in terms of throughput and fairness) without significantly increasing the hardware required by previous proposals.- Driving cache partition decisions with indirect indicators of performance such as misses, MLP or data re-use may lead to suboptimal cache partitions. Ideally, the appropriate metric to drive cache partitions should be the target metric to optimize, which is normally related to IPC. Thus, we have developed a hardware mechanism, OPACU, which is able to obtain at run-time accurate predictions of the performance of an application when running with different cache assignments.- Using performance predictions, we have introduced a new framework to manage shared caches in CMP architectures, FlexDCP, which allows the OS to optimize different IPC-related target metrics like throughput or fairness and provide QoS to applications. FlexDCP allows an enhanced coordination between the hardware and the software layers, which leads to improved system performance and flexibility.- Next, we have made use of performance estimations to reduce the load imbalance problem in parallel applications. We have built a run-time mechanism that detects parallel applications sensitive to cache allocation and, in these situations, the load imbalance is reduced by assigning more cache space to the slowest threads. This mechanism, helps reducing the long optimization time in terms of man-years of effort devoted to large-scale parallel applications.- Finally, we have stated the main characteristics that future multicore processors with thousands of cores should have. An enhanced coordination between the software and hardware layers has been proposed to better manage the shared resources in these architectures

    State Management for Efficient Event Pattern Detection

    Get PDF
    Event Stream Processing (ESP) Systeme überwachen kontinuierliche Datenströme, um benutzerdefinierte Queries auszuwerten. Die Herausforderung besteht darin, dass die Queryverarbeitung zustandsbehaftet ist und die Anzahl von Teilübereinstimmungen mit der Größe der verarbeiteten Events exponentiell anwächst. Die Dynamik von Streams und die Notwendigkeit, entfernte Daten zu integrieren, erschweren die Zustandsverwaltung. Erstens liefern heterogene Eventquellen Streams mit unvorhersehbaren Eingaberaten und Queryselektivitäten. Während Spitzenzeiten ist eine erschöpfende Verarbeitung unmöglich, und die Systeme müssen auf eine Best-Effort-Verarbeitung zurückgreifen. Zweitens erfordern Queries möglicherweise externe Daten, um ein bestimmtes Event für eine Query auszuwählen. Solche Abhängigkeiten sind problematisch: Das Abrufen der Daten unterbricht die Stream-Verarbeitung. Ohne eine Eventauswahl auf Grundlage externer Daten wird das Wachstum von Teilübereinstimmungen verstärkt. In dieser Dissertation stelle ich Strategien für optimiertes Zustandsmanagement von ESP Systemen vor. Zuerst ermögliche ich eine Best-Effort-Verarbeitung mittels Load Shedding. Dabei werden sowohl Eingabeeevents als auch Teilübereinstimmungen systematisch verworfen, um eine Latenzschwelle mit minimalem Qualitätsverlust zu garantieren. Zweitens integriere ich externe Daten, indem ich das Abrufen dieser von der Verwendung in der Queryverarbeitung entkoppele. Mit einem effizienten Caching-Mechanismus vermeide ich Unterbrechungen durch Übertragungslatenzen. Dazu werden externe Daten basierend auf ihrer erwarteten Verwendung vorab abgerufen und mittels Lazy Evaluation bei der Eventauswahl berücksichtigt. Dabei wird ein Kostenmodell verwendet, um zu bestimmen, wann welche externen Daten abgerufen und wie lange sie im Cache aufbewahrt werden sollen. Ich habe die Effektivität und Effizienz der vorgeschlagenen Strategien anhand von synthetischen und realen Daten ausgewertet und unter Beweis gestellt.Event stream processing systems continuously evaluate queries over event streams to detect user-specified patterns with low latency. However, the challenge is that query processing is stateful and it maintains partial matches that grow exponentially in the size of processed events. State management is complicated by the dynamicity of streams and the need to integrate remote data. First, heterogeneous event sources yield dynamic streams with unpredictable input rates, data distributions, and query selectivities. During peak times, exhaustive processing is unreasonable, and systems shall resort to best-effort processing. Second, queries may require remote data to select a specific event for a pattern. Such dependencies are problematic: Fetching the remote data interrupts the stream processing. Yet, without event selection based on remote data, the growth of partial matches is amplified. In this dissertation, I present strategies for optimised state management in event pattern detection. First, I enable best-effort processing with load shedding that discards both input events and partial matches. I carefully select the shedding elements to satisfy a latency bound while striving for a minimal loss in result quality. Second, to efficiently integrate remote data, I decouple the fetching of remote data from its use in query evaluation by a caching mechanism. To this end, I hide the transmission latency by prefetching remote data based on anticipated use and by lazy evaluation that postpones the event selection based on remote data to avoid interruptions. A cost model is used to determine when to fetch which remote data items and how long to keep them in the cache. I evaluated the above techniques with queries over synthetic and real-world data. I show that the load shedding technique significantly improves the recall of pattern detection over baseline approaches, while the technique for remote data integration significantly reduces the pattern detection latency

    Improving Measurement-Based Timing Analysis through Randomisation and Probabilistic Analysis

    Get PDF
    The use of increasingly complex hardware and software platforms in response to the ever rising performance demands of modern real-time systems complicates the verification and validation of their timing behaviour, which form a time-and-effort-intensive step of system qualification or certification. In this paper we relate the current state of practice in measurement-based timing analysis, the predominant choice for industrial developers, to the proceedings of the PROXIMA project in that very field. We recall the difficulties that the shift towards more complex computing platforms causes in that regard. Then we discuss the probabilistic approach proposed by PROXIMA to overcome some of those limitations. We present the main principles behind the PROXIMA approach as well as the changes it requires at hardware or software level underneath the application. We also present the current status of the project against its overall goals, and highlight some of the principal confidence-building results achieved so far
    • …
    corecore