Search CORE

122 research outputs found

Cache remapping to improve the performance of tiled algorithms

Author: D. Patterson
K. Lee
P. Panda
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2000
Field of study

Crossref

Ghent University Academic Bibliography

Impact of Tile-Size Selection for Skewed Tiling

Author: Li Zhiyuan
Song Yonghong
Publication venue: 'Purdue University (bepress)'
Publication date: 01/12/2000
Field of study

Purdue E-Pubs

Communion: a new strategy form memory management in high-performance computer

Author: Matsumoto Sato Liria
Toshimi Midorikawa Edson
Zuffo João Antônio
Publication venue
Publication date: 01/03/1999
Field of study

Modern computers present a big gap between peak performance and sustained performance. There are many reasons for this situation, but mainly involving an inefficient usage of computational resources. Nowadays the memory system is the most critical component because of its growing inability to keep up with the processor requests. Technological trends have produced a large and growing gap between CPU speeds and DRAM speeds. Much research has focused this memory system problem, including program optimizing techniques, data locality enhancement, hardware and software prefetching, decoupled architectures, multithreading, speculative loads and execution. These techniques have got a relative success, but they focus only one component in the hardware or software systems. We present here a new strategy for memory management in high-performance computer systems, named COMMUNION. The basic idea behind this strategy is "cooperation". We introduce some interaction possibilities among system programs that are responsible to generate and execute application programs. So, we investigate two specific interactions: between the compiler and the operating system, and among the compiling system components. The experimental results show that it's possible to get improvements of about 10 times in execution time, and about 5 times in memory demand, enhancing the interaction between the compiling system components. In the interaction between compiler and operating system, named Compiler-Aided Page Replacement (CAPR), we achieved a reduction of about 10% in space-time product, with an increase of only 0.5% in the total execution time. All these results show that it s possible to manage main memory with a better efficiency than current systems.Facultad de Informátic

Communion: a new strategy for memory management in high-performance computer systems

Author: Midorikawa Edson T.
Sato Liria Matsumoto
Zuffo João Antônio
Publication venue
Publication date: 08/11/2012
Field of study

Modern computers present a big gap between peak performance and sustained performance. There are many reasons for this situation, but mainly involving an inefficient usage of computational resources. Nowadays the memory system is the most critical component because of its growing inability to keep up with the processor requests. Technological trends have produced a large and growing gap between CPU speeds and DRAM speeds. Much research has focused this memory system problem, including program optimizing techniques, data locality enhancement, hardware and software prefetching, decoupled architectures, mutithreading, speculative loads and execution. These techniques have got a relative success, but they focus only one component in the hardware or software systems. We present here a new strategy for memory management in high-performance computer systems, named COMMUNION. The basic idea behind this strategy is cooperation. We introduce some interaction possibilities among system programs that are responsible to generate and execute application programs. So, we investigate two specific interactions: between the compiler and the operating system, and among the compiling system components. The experimental results show that it’s possible to get improvements of about 10 times in execution time, and about 5 times in memory demand. In the interaction between compiler and operating system, named Compiler-Aided Page Replacement (CAPR), we achieved a reduction of about 10% in space-time product, with an increase of only 0.5% in the total execution time. All these results show that it’s possible to manage main memory with a better efficiency than current systems.Eje: Procesamiento distribuido y paralelo. Tratamiento de señalesRed de Universidades con Carreras en Informática (RedUNCI

Servicio de Difusión de la Creación Intelectual

Communion: a new strategy for memory management in high-performance computer systems

Author: Midorikawa Edson T.
Sato Liria Matsumoto
Zuffo João Antônio
Publication venue
Publication date: 01/01/1997
Field of study

Modern computers present a big gap between peak performance and sustained performance. There are many reasons for this situation, but mainly involving an inefficient usage of computational resources. Nowadays the memory system is the most critical component because of its growing inability to keep up with the processor requests. Technological trends have produced a large and growing gap between CPU speeds and DRAM speeds. Much research has focused this memory system problem, including program optimizing techniques, data locality enhancement, hardware and software prefetching, decoupled architectures, mutithreading, speculative loads and execution. These techniques have got a relative success, but they focus only one component in the hardware or software systems. We present here a new strategy for memory management in high-performance computer systems, named COMMUNION. The basic idea behind this strategy is cooperation. We introduce some interaction possibilities among system programs that are responsible to generate and execute application programs. So, we investigate two specific interactions: between the compiler and the operating system, and among the compiling system components. The experimental results show that it’s possible to get improvements of about 10 times in execution time, and about 5 times in memory demand. In the interaction between compiler and operating system, named Compiler-Aided Page Replacement (CAPR), we achieved a reduction of about 10% in space-time product, with an increase of only 0.5% in the total execution time. All these results show that it’s possible to manage main memory with a better efficiency than current systems.Eje: Procesamiento distribuido y paralelo. Tratamiento de señalesRed de Universidades con Carreras en Informática (RedUNCI

A Survey on Compiler Autotuning using Machine Learning

Author: Ashouri Amir H.
Cavazos John
Killian William
Palermo Gianluca
Silvano Cristina
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 03/09/2018
Field of study

Since the mid-1990s, researchers have been trying to use machine-learning based approaches to solve a number of different compiler optimization problems. These techniques primarily enhance the quality of the obtained results and, more importantly, make it feasible to tackle two main compiler optimization problems: optimization selection (choosing which optimizations to apply) and phase-ordering (choosing the order of applying optimizations). The compiler optimization space continues to grow due to the advancement of applications, increasing number of compiler optimizations, and new target architectures. Generic optimization passes in compilers cannot fully leverage newly introduced optimizations and, therefore, cannot keep up with the pace of increasing options. This survey summarizes and classifies the recent advances in using machine learning for the compiler optimization field, particularly on the two major problems of (1) selecting the best optimizations and (2) the phase-ordering of optimizations. The survey highlights the approaches taken so far, the obtained results, the fine-grain classification among different approaches and finally, the influential papers of the field.Comment: version 5.0 (updated on September 2018)- Preprint Version For our Accepted Journal @ ACM CSUR 2018 (42 pages) - This survey will be updated quarterly here (Send me your new published papers to be added in the subsequent version) History: Received November 2016; Revised August 2017; Revised February 2018; Accepted March 2018

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Politecnico di Milano

Recommended from our members

A SIMD architecture for hard real-time systems

Author: Spliet Roy
Publication venue: University of Cambridge
Publication date: 31/03/2020
Field of study

Emerging safety-critical systems require high-performance data-parallel architectures and, problematically, ones that can guarantee tight and safe worst-case execution times. Given the complexity of existing architectures like GPUs, it is unlikely that sufficiently accurate models and algorithms for timing analysis will emerge in the foreseeable future. This motivates a clean-slate approach to designing a real-time data-parallel architecture. In this work I present Sim-D: a wide-SIMD architecture for hard real-time systems. Similar to GPUs, Sim-D performs hardware strip-mining to schedule the work for a compute kernel in entities called work-groups. Sim-D schedules the work for each work-group as a sequence of uninterruptible access- and execute program phases, interleaving the phases of two work-groups. By providing performance isolation between the memory- and compute resources, the execution time of each phase can be tightly bound through static analysis. I present a predictable closed-page DRAM controller that processes requests for large 1D- and 2D blocks of data, as well as indirect indexed transfers. These large transfers coalesce the data requests of a whole work-group. For a linear 4KiB transfer over a 64-bit data bus, the utilisation provably exceeds 78% for DDR4-3200AA DRAM. For 2D blocks, a well-chosen tiling configuration can achieve near-similar efficiency. I show that bounds on the execution time of indexed transfers are pessimistic by nature, but propose a novel snoopy indexed transfer mechanism that permits more reasonable bounds when the buffer size is limited. Finally, I present a worst-case execution time calculation algorithm for Sim-D. This algorithm is paired with two hardware work-group scheduling policies that deterministically reduce run-time variance. The worst-case execution time analysis algorithm combines static control flow analysis with a simulation-based cost model for execution and DRAM transfers. Its key novelty is the addition of a stage that considers work-group scheduling effects. I show that the work-group scheduling policies degrade performance on average by 8.9%, but permit the calculation of worst-case execution time bounds that are tight within 14.3% on average for benchmarks that avoid inefficient indexed transfers

Apollo (Cambridge)

Neurostream: Scalable and Energy Efficient Deep Learning with Smart Memory Cubes

Author: Azarkhish Erfan
Benini Luca
Loi Igor
Rossi Davide
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 24/09/2017
Field of study

open4siHigh-performance computing systems are moving towards 2.5D and 3D memory hierarchies, based on High Bandwidth Memory (HBM) and Hybrid Memory Cube (HMC) to mitigate the main memory bottlenecks. This trend is also creating new opportunities to revisit near-memory computation. In this paper, we propose a flexible processor-in-memory (PIM) solution for scalable and energy-efficient execution of deep convolutional networks (ConvNets), one of the fastest-growing workloads for servers and high-end embedded systems. Our co-design approach consists of a network of Smart Memory Cubes (modular extensions to the standard HMC) each augmented with a many-core PIM platform called NeuroCluster. NeuroClusters have a modular design based on NeuroStream coprocessors (for Convolution-intensive computations) and general-purpose RISC-V cores. In addition, a DRAM-friendly tiling mechanism and a scalable computation paradigm are presented to efficiently harness this computational capability with a very low programming effort. NeuroCluster occupies only 8 percent of the total logic-base (LoB) die area in a standard HMC and achieves an average performance of 240 GFLOPS for complete execution of full-featured state-of-the-art (SoA) ConvNets within a power budget of 2.5 W. Overall 11 W is consumed in a single SMC device, with 22.5 GFLOPS/W energy-efficiency which is 3.5X better than the best GPU implementations in similar technologies. The minor increase in system-level power and the negligible area increase make our PIM system a cost-effective and energy efficient solution, easily scalable to 955 GFLOPS with a small network of just four SMCs.openAzarkhish, Erfan*; Rossi, Davide; Loi, Igor; Benini, LucaAzarkhish, Erfan*; Rossi, Davide; Loi, Igor; Benini, Luc

arXiv.org e-Print Archive

Crossref

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

A metadata-enhanced framework for high performance visual effects

Author: Cornwall Jay L. T.
Cornwall Jay L. T.
Publication venue: Computing, Imperial College London
Publication date: 01/08/2010
Field of study

This thesis is devoted to reducing the interactive latency of image processing computations in visual effects. Film and television graphic artists depend upon low-latency feedback to receive a visual response to changes in effect parameters. We tackle latency with a domain-specific optimising compiler which leverages high-level program metadata to guide key computational and memory hierarchy optimisations. This metadata encodes static and dynamic information about data dependence and patterns of memory access in the algorithms constituting a visual effect – features that are typically difficult to extract through program analysis – and presents it to the compiler in an explicit form. By using domain-specific information as a substitute for program analysis, our compiler is able to target a set of complex source-level optimisations that a vendor compiler does not attempt, before passing the optimised source to the vendor compiler for lower-level optimisation. Three key metadata-supported optimisations are presented. The first is an adaptation of space and schedule optimisation – based upon well-known compositions of the loop fusion and array contraction transformations – to the dynamic working sets and schedules of a runtimeparameterised visual effect. This adaptation sidesteps the costly solution of runtime code generation by specialising static parameters in an offline process and exploiting dynamic metadata to adapt the schedule and contracted working sets at runtime to user-tunable parameters. The second optimisation comprises a set of transformations to generate SIMD ISA-augmented source code. Our approach differs from autovectorisation by using static metadata to identify parallelism, in place of data dependence analysis, and runtime metadata to tune the data layout to user-tunable parameters for optimal aligned memory access. The third optimisation comprises a related set of transformations to generate code for SIMT architectures, such as GPUs. Static dependence metadata is exploited to guide large-scale parallelisation for tens of thousands of in-flight threads. Optimal use of the alignment-sensitive, explicitly managed memory hierarchy is achieved by identifying inter-thread and intra-core data sharing opportunities in memory access metadata. A detailed performance analysis of these optimisations is presented for two industrially developed visual effects. In our evaluation we demonstrate up to 8.1x speed-ups on Intel and AMD multicore CPUs and up to 6.6x speed-ups on NVIDIA GPUs over our best hand-written implementations of these two effects. Programmability is enhanced by automating the generation of SIMD and SIMT implementations from a single programmer-managed scalar representation

Spiral - Imperial College Digital Repository

EVALUATING THE IMPACT OF MEMORY SYSTEM PERFORMANCE ON SOFTWARE PREFETCHING AND LOCALITY OPTIMIZATIONS

Author: Badawy Abdel-Hameed A.
Publication venue
Publication date: 18/09/2002
Field of study

Software prefetching and locality optimizations are two techniques for overcoming the speed gap between processor and memory known as the memory wall as suggested by Wulf and Mckee. This thesis evaluates the impact of memory trends on the effectiveness of software prefetching and locality optimizations for three types of applications: regular scientific codes, irregular scientific codes, and pointer-chasing codes. For many applications, software prefetching outperforms locality optimizations when there is sufficient bandwidth in the underlying memory system, but locality optimizations outperform software prefetching when the underlying memory system doesn't provide sufficient bandwidth. The break-even point, or equivalently the crossover bandwidth point, occurs at roughly 2.4 GBytes/sec , for 1 GHz processors on today's memory systems, and will increase on future memory systems. This thesis also studies the interactions between software prefetching and locality optimizations when applied in concert. Naively combining the two techniques provides a more robust application performance in the face of variations in memory bandwidth and/or latency, but does not yield additional performance gains. In other words, the performance won't be better than the best performance of the two techniques alone. Also, several algorithms are proposed and evaluated to better combine software prefetching and locality optimizations, including an enhanced tiling algorithm, padding for software prefetching, and index prefetching. (Also UMIACS-TR-2002-72

Digital Repository at the University of Maryland