Search CORE

872 research outputs found

Intermediately Executed Code is the Key to Find Refactorings that Improve Temporal Data Locality

Author: BEYLS K
D'Hollander Erik
Publication venue: ACM :
Publication date: 01/01/2006
Field of study

Crossref

Ghent University Academic Bibliography

Archivsystem Ask23

Sophie, an FDTD code on the way to multicore, getting rid of the memory bandwidth bottleneck better using cache.

Author: Cessenat Olivier
Publication venue: HAL CCSD
Publication date: 29/07/2009
Field of study

21 pagesFDTD codes, such as Sophie developed at CEA/DAM, no longer take advantage of the processor's increased computing power, especially recently with the raising multicore technology. This is rooted in the fact that low order numerical schemes need an important memory bandwidth to bring and store the computed fields. The aim of this article is to present a programming method at the software's architecture level that improves the memory access pattern in order to reuse data in cache instead of constantly accessing RAM memory. We will exhibit a more than two computing time improvement in practical applications. The target audience of this article is made of computing scientists and of electrical engineers that develop simulation codes with no specific knowledge in computer science or electronics

HAL-CEA

Recommended from our members

Project Report on DOE Young Investigator Grant (Contract No. DE-FG02-02ER25525) Dynamic Scheduling and Fusion of Irregular Computation (August 15, 2002 to August 14, 2005)

Author: Ding Chen
Publication venue: 'University of Rochester Press'
Publication date: 16/08/2005
Field of study

Computer simulation has become increasingly important in many scientiï¬c disciplines, but its performance and scalability are severely limited by the memory throughput on todayâs computer systems. With the support of this grant, we ï¬rst designed training-based prediction, which accurately predicts the memory performance of large applications before their execution. Then we developed optimization techniques using dynamic computation fusion and large-scale data transformation. The research work has three major components. The ï¬rst is modeling and prediction of cache behav- ior. We have developed a new technique, which uses reuse distance information from training inputs then extracts a parameterized model of the programâs cache miss rates for any input size and for any size of fully associative cache. Using the model we have built a web-based tool using three dimensional visualization. The new model can help to build cost-effective computer systems, design better benchmark suites, and improve task scheduling on heterogeneous systems. The second component is global computation for improving cache performance. We have developed an algorithm for dynamic data partitioning using sampling theory and probability distribution. Recent work from a number of groups show that manual or semi-manual computation fusion has signiï¬cant beneï¬ts in physical, mechanical, and biological simulations as well as information retrieval and machine veriï¬cation. We have developed an au- tomatic tool that measures the potential of computation fusion. The new system can be used by high-performance application programmers to estimate the potential of locality improvement for a program before trying complex transformations for a speciï¬c cache system. The last component studies models of spatial locality and the problem of data layout. In scientiï¬c programs, most data are stored in arrays. Grand challenge problems such as hydrodynamics simulation and data mining may use an enormous number of data elements. To optimize the layout across multiple arrays, we have developed a formal model called reference afï¬nity. We collaborated with the IBM production compiler group and designed an efï¬cient compiler analysis that performs as well as data or code proï¬ling does. Based on these results, the IBM group has ï¬led a patent and is including this technique in their product compiler. A major part of the project is the development of software tools. We have developed web-based visu- alization for program locality. In addition, we have implemented a prototype of array regrouping in the IBM compiler. The full implementation is expected to come out of IBM in the near future and to beneï¬t scientiï¬c applications running on IBM supercomputers. We have also developed a test environment for studying the limit of computation fusion. Finally, our work has directly inï¬uenced the design of the Intel Itanium compiler. The project has strengthened the research relation between the PIâs group and groups in DoE labs. The PI was an invited speaker at the Center for Applied Scientiï¬c Computing Seminar Series at the early stage of the project. The question that the most audience was curious about was the limit of computation fusion, which has been studied in depth in this research. In addition, the seminar directly helped a group at Lawrence Livermore to achieve four times speedup on an important DoE code. The PI helped to organize a number of high-performance computing forums, including the founding of a workshop on memory system performance (MSP). In the past two years, one fourth of the papers in the workshop came from researchers in Lawrence Livermore, Argonne, Las Alamos, and Lawrence Berkeley national laboratories. The PI lectured frequently on DoE funded research. In a broader context, high performance computing is central to Americaâs scientiï¬c and economic stature in the world, and addresses many of the most scientiï¬cally and socially important problems of our day. This research has improved the programming support for a variety of computational paradigms, including dynamic mesh, hydrodynamics, molecular dynamics, multi-grid methods, matrix algebra, and sequential and parallel sorting. In the process, the PIâs group has developed and strengthened relationships with DoE laboratories and major hardware and software vendors

UNT Digital Library

Effective Cache Apportioning for Performance Isolation Under Compiler Guidance

Author: Chatterjee Bodhisatwa
Khan Sharjeel
Pande Santosh
Publication venue
Publication date: 01/10/2022
Field of study

With a growing number of cores in modern high-performance servers, effective sharing of the last level cache (LLC) is more critical than ever. The primary agenda of such systems is to maximize performance by efficiently supporting multi-tenancy of diverse workloads. However, this could be particularly challenging to achieve in practice, because modern workloads exhibit dynamic phase behaviour, which causes their cache requirements & sensitivities to vary at finer granularities during execution. Unfortunately, existing systems are oblivious to the application phase behavior, and are unable to detect and react quickly enough to these rapidly changing cache requirements, often incurring significant performance degradation. In this paper, we propose Com-CAS, a new apportioning system that provides dynamic cache allocations for co-executing applications. Com-CAS differs from the existing cache partitioning systems by adapting to the dynamic cache requirements of applications just-in-time, as opposed to reacting, without any hardware modifications. The front-end of Com-CAS consists of compiler-analysis equipped with machine learning mechanisms to predict cache requirements, while the back-end consists of proactive scheduler that dynamically apportions LLC amongst co-executing applications leveraging Intel Cache Allocation Technology (CAT). Com-CAS's partitioning scheme utilizes the compiler-generated information across finer granularities to predict the rapidly changing dynamic application behaviors, while simultaneously maintaining data locality. Our experiments show that Com-CAS improves average weighted throughput by 15% over unpartitioned cache system, and outperforms state-of-the-art partitioning system KPart by 20%, while maintaining the worst individual application completion time degradation to meet various Service-Level Agreement (SLA) requirements

arXiv.org e-Print Archive

Automatic scheduling of image processing pipelines

Author: Sioutas Savvas
Publication venue: Technische Universiteit Eindhoven
Publication date: 18/12/2020
Field of study

Pure OAI Repository

Automatic scheduling of image processing pipelines

Author: Sioutas Savvas
Publication venue: Technische Universiteit Eindhoven
Publication date: 18/12/2020
Field of study

Pure OAI Repository

GME: GPU-based Microarchitectural Extensions to Accelerate Homomorphic Encryption

Author: Abellán José L.
Agrawal Rashmi
Bao Yuhui
Ingare Alexander
Jonatan Gilbert
Joshi Ajay
Kaeli David
Kim John
Livesay Neal
Mora Evelio
Shen Michael
Shivdikar Kaustubh
Publication venue
Publication date: 19/09/2023
Field of study

Fully Homomorphic Encryption (FHE) enables the processing of encrypted data without decrypting it. FHE has garnered significant attention over the past decade as it supports secure outsourcing of data processing to remote cloud services. Despite its promise of strong data privacy and security guarantees, FHE introduces a slowdown of up to five orders of magnitude as compared to the same computation using plaintext data. This overhead is presently a major barrier to the commercial adoption of FHE. In this work, we leverage GPUs to accelerate FHE, capitalizing on a well-established GPU ecosystem available in the cloud. We propose GME, which combines three key microarchitectural extensions along with a compile-time optimization to the current AMD CDNA GPU architecture. First, GME integrates a lightweight on-chip compute unit (CU)-side hierarchical interconnect to retain ciphertext in cache across FHE kernels, thus eliminating redundant memory transactions. Second, to tackle compute bottlenecks, GME introduces special MOD-units that provide native custom hardware support for modular reduction operations, one of the most commonly executed sets of operations in FHE. Third, by integrating the MOD-unit with our novel pipelined

64

-bit integer arithmetic cores (WMAC-units), GME further accelerates FHE workloads by

19\%

. Finally, we propose a Locality-Aware Block Scheduler (LABS) that exploits the temporal locality available in FHE primitive blocks. Incorporating these microarchitectural features and compiler optimizations, we create a synergistic approach achieving average speedups of

796\times

14.2\times

, and

2.3\times

over Intel Xeon CPU, NVIDIA V100 GPU, and Xilinx FPGA implementations, respectively

arXiv.org e-Print Archive

IMPROVING THE PERFORMANCE AND TIME-PREDICTABILITY OF GPUs

Author: Huangfu Yijie
Publication venue: VCU Scholars Compass
Publication date: 01/01/2017
Field of study

Graphic Processing Units (GPUs) are originally mainly designed to accelerate graphic applications. Now the capability of GPUs to accelerate applications that can be parallelized into a massive number of threads makes GPUs the ideal accelerator for boosting the performance of such kind of general-purpose applications. Meanwhile it is also very promising to apply GPUs to embedded and real-time applications as well, where high throughput and intensive computation are also needed. However, due to the different architecture and programming model of GPUs, how to fully utilize the advanced architectural features of GPUs to boost the performance and how to analyze the worst-case execution time (WCET) of GPU applications are the problems that need to be addressed before exploiting GPUs further in embedded and real-time applications. We propose to apply both architectural modification and static analysis methods to address these problems. First, we propose to study the GPU cache behavior and use bypassing to reduce unnecessary memory traffic and to improve the performance. The results show that the proposed bypassing method can reduce the global memory traffic by about 22% and improve the performance by about 13% on average. Second, we propose a cache access reordering framework based on both architectural extension and static analysis to improve the predictability of GPU L1 data caches. The evaluation results show that the proposed method can provide good predictability in GPU L1 data caches, while allowing the dynamic warp scheduling for good performance. Third, based on the analysis of the architecture and dynamic behavior of GPUs, we propose a WCET timing model based on a predictable warp scheduling policy to enable the WCET estimation on GPUs. The experimental results show that the proposed WCET analyzer can effectively provide WCET estimations for both soft and hard real-time application purposes. Last, we propose to analyze the shared Last Level Cache (LLC) in integrated CPU-GPU architectures and to integrate the analysis of the shared LLC into the WCET analysis of the GPU kernels in such systems. The results show that the proposed shared data LLC analysis method can improve the accuracy of the shared LLC miss rate estimations, which can further improve the WCET estimations of the GPU kernels

VCU Scholars Compass