1,068 research outputs found

    Hardware-only stream prediction + cache prefetching + dynamic access ordering

    Get PDF
    Journal ArticleThe speed gap between processors and memory system is becoming the performance bottleneck for many applications, and computations with strided access patterns are among those that suffer most. The vectors used in such applications lack temporal and often spatial locality, and are usually too large to cache. In spite of their poor cache behavior, these access patterns have the advantage of being, predictable, which can be exploited to improve the efficiency of the memory subsystem. As a promising technique to relieve memory system bottleneck, prefetching has been studied in its various forms, and so is dynamic memory scheduling. This study builds on these results, combining a stride-based reference prediction table, a mechanism that prefetches L2 cache lines, and a memory controller that dynamically schedules accesses to a Direct Rambus memory subsystem. We find that such a system delivers impressive speedups for scientific applications with regular access patterns (reducing execution time by almost a factor of two) without negatively affecting the performance of non-streaming programs

    Performance Analysis and Optimization of Sparse Matrix-Vector Multiplication on Modern Multi- and Many-Core Processors

    Full text link
    This paper presents a low-overhead optimizer for the ubiquitous sparse matrix-vector multiplication (SpMV) kernel. Architectural diversity among different processors together with structural diversity among different sparse matrices lead to bottleneck diversity. This justifies an SpMV optimizer that is both matrix- and architecture-adaptive through runtime specialization. To this direction, we present an approach that first identifies the performance bottlenecks of SpMV for a given sparse matrix on the target platform either through profiling or by matrix property inspection, and then selects suitable optimizations to tackle those bottlenecks. Our optimization pool is based on the widely used Compressed Sparse Row (CSR) sparse matrix storage format and has low preprocessing overheads, making our overall approach practical even in cases where fast decision making and optimization setup is required. We evaluate our optimizer on three x86-based computing platforms and demonstrate that it is able to distinguish and appropriately optimize SpMV for the majority of matrices in a representative test suite, leading to significant speedups over the CSR and Inspector-Executor CSR SpMV kernels available in the latest release of the Intel MKL library.Comment: 10 pages, 7 figures, ICPP 201

    State Management for Efficient Event Pattern Detection

    Get PDF
    Event Stream Processing (ESP) Systeme überwachen kontinuierliche Datenströme, um benutzerdefinierte Queries auszuwerten. Die Herausforderung besteht darin, dass die Queryverarbeitung zustandsbehaftet ist und die Anzahl von Teilübereinstimmungen mit der Größe der verarbeiteten Events exponentiell anwächst. Die Dynamik von Streams und die Notwendigkeit, entfernte Daten zu integrieren, erschweren die Zustandsverwaltung. Erstens liefern heterogene Eventquellen Streams mit unvorhersehbaren Eingaberaten und Queryselektivitäten. Während Spitzenzeiten ist eine erschöpfende Verarbeitung unmöglich, und die Systeme müssen auf eine Best-Effort-Verarbeitung zurückgreifen. Zweitens erfordern Queries möglicherweise externe Daten, um ein bestimmtes Event für eine Query auszuwählen. Solche Abhängigkeiten sind problematisch: Das Abrufen der Daten unterbricht die Stream-Verarbeitung. Ohne eine Eventauswahl auf Grundlage externer Daten wird das Wachstum von Teilübereinstimmungen verstärkt. In dieser Dissertation stelle ich Strategien für optimiertes Zustandsmanagement von ESP Systemen vor. Zuerst ermögliche ich eine Best-Effort-Verarbeitung mittels Load Shedding. Dabei werden sowohl Eingabeeevents als auch Teilübereinstimmungen systematisch verworfen, um eine Latenzschwelle mit minimalem Qualitätsverlust zu garantieren. Zweitens integriere ich externe Daten, indem ich das Abrufen dieser von der Verwendung in der Queryverarbeitung entkoppele. Mit einem effizienten Caching-Mechanismus vermeide ich Unterbrechungen durch Übertragungslatenzen. Dazu werden externe Daten basierend auf ihrer erwarteten Verwendung vorab abgerufen und mittels Lazy Evaluation bei der Eventauswahl berücksichtigt. Dabei wird ein Kostenmodell verwendet, um zu bestimmen, wann welche externen Daten abgerufen und wie lange sie im Cache aufbewahrt werden sollen. Ich habe die Effektivität und Effizienz der vorgeschlagenen Strategien anhand von synthetischen und realen Daten ausgewertet und unter Beweis gestellt.Event stream processing systems continuously evaluate queries over event streams to detect user-specified patterns with low latency. However, the challenge is that query processing is stateful and it maintains partial matches that grow exponentially in the size of processed events. State management is complicated by the dynamicity of streams and the need to integrate remote data. First, heterogeneous event sources yield dynamic streams with unpredictable input rates, data distributions, and query selectivities. During peak times, exhaustive processing is unreasonable, and systems shall resort to best-effort processing. Second, queries may require remote data to select a specific event for a pattern. Such dependencies are problematic: Fetching the remote data interrupts the stream processing. Yet, without event selection based on remote data, the growth of partial matches is amplified. In this dissertation, I present strategies for optimised state management in event pattern detection. First, I enable best-effort processing with load shedding that discards both input events and partial matches. I carefully select the shedding elements to satisfy a latency bound while striving for a minimal loss in result quality. Second, to efficiently integrate remote data, I decouple the fetching of remote data from its use in query evaluation by a caching mechanism. To this end, I hide the transmission latency by prefetching remote data based on anticipated use and by lazy evaluation that postpones the event selection based on remote data to avoid interruptions. A cost model is used to determine when to fetch which remote data items and how long to keep them in the cache. I evaluated the above techniques with queries over synthetic and real-world data. I show that the load shedding technique significantly improves the recall of pattern detection over baseline approaches, while the technique for remote data integration significantly reduces the pattern detection latency

    Sandbox prefetching: safe run-time evaluation of aggressive prefetchers

    Get PDF
    pre-printMemory latency is a major factor in limiting CPU per- formance, and prefetching is a well-known method for hiding memory latency. Overly aggressive prefetching can waste scarce resources such as memory bandwidth and cache capacity, limiting or even hurting performance. It is therefore important to employ prefetching mechanisms that use these resources prudently, while still prefetching required data in a timely manner. In this work, we propose a new mechanism to deter-mine at run-time the appropriate prefetching mechanism for the currently executing program, called Sandbox Prefetching. Sandbox Prefetching evaluates simple, aggressive offset prefetchers at run-time by adding the prefetch address to a Bloom filter, rather than actually fetching the data into the cache. Subsequent cache accesses are tested against the contents of the Bloom filter to see if the aggressive prefetcher under evaluation could have accurately prefetched the data, while simultaneously testing for the existence of prefetchable streams. Real prefetches are performed when the accuracy of evaluated prefetchers exceeds a threshold. This method combines the ideas of global pattern confirmation and immediate prefetching action to achieve high performance. Sandbox Prefetching improves performance across the tested workloads by 47.6% compared to not using any prefetching, and by 18.7% compared to the Feedback Directed Prefetching technique. Performance is also improved by 1.4% compared to the Access Map Pattern Matching Prefetcher, while incurring consid- erably less logic and storage overheads

    Neighbor cache prefetching for multimedia image and video processing

    Full text link
    Cache performance is strongly influenced by the type of locality embodied in programs. In particular, multimedia programs handling images and videos are characterized by a bidimensional spatial locality, which is not adequately exploited by standard caches. In this paper we propose novel cache prefetching techniques for image data, called neighbor prefetching, able to improve exploitation of bidimensional spatial locality. A performance comparison is provided against other assessed prefetching techniques on a multimedia workload (with MPEG-2 and MPEG-4 decoding, image processing, and visual object segmentation), including a detailed evaluation of both the miss rate and the memory access time. Results prove that neighbor prefetching achieves a significant reduction in the time due to delayed memory cycles (more than 97% on MPEG-4 with respect to 75% of the second performing technique). This reduction leads to a substantial speedup on the overall memory access time (up to 140% for MPEG-4). Performance has been measured with the PRIMA trace-driven simulator, specifically devised to support cache prefetching
    corecore