872 research outputs found
Sophie, an FDTD code on the way to multicore, getting rid of the memory bandwidth bottleneck better using cache.
21 pagesFDTD codes, such as Sophie developed at CEA/DAM, no longer take advantage of the processor's increased computing power, especially recently with the raising multicore technology. This is rooted in the fact that low order numerical schemes need an important memory bandwidth to bring and store the computed fields. The aim of this article is to present a programming method at the software's architecture level that improves the memory access pattern in order to reuse data in cache instead of constantly accessing RAM memory. We will exhibit a more than two computing time improvement in practical applications. The target audience of this article is made of computing scientists and of electrical engineers that develop simulation codes with no specific knowledge in computer science or electronics
Recommended from our members
Project Report on DOE Young Investigator Grant (Contract No. DE-FG02-02ER25525) Dynamic Scheduling and Fusion of Irregular Computation (August 15, 2002 to August 14, 2005)
Computer simulation has become increasingly important in many scientiï¬c disciplines, but its performance and scalability are severely limited by the memory throughput on todayâs computer systems. With the support of this grant, we ï¬rst designed training-based prediction, which accurately predicts the memory performance of large applications before their execution. Then we developed optimization techniques using dynamic computation fusion and large-scale data transformation. The research work has three major components. The ï¬rst is modeling and prediction of cache behav- ior. We have developed a new technique, which uses reuse distance information from training inputs then extracts a parameterized model of the programâs cache miss rates for any input size and for any size of fully associative cache. Using the model we have built a web-based tool using three dimensional visualization. The new model can help to build cost-effective computer systems, design better benchmark suites, and improve task scheduling on heterogeneous systems. The second component is global computation for improving cache performance. We have developed an algorithm for dynamic data partitioning using sampling theory and probability distribution. Recent work from a number of groups show that manual or semi-manual computation fusion has signiï¬cant beneï¬ts in physical, mechanical, and biological simulations as well as information retrieval and machine veriï¬cation. We have developed an au- tomatic tool that measures the potential of computation fusion. The new system can be used by high-performance application programmers to estimate the potential of locality improvement for a program before trying complex transformations for a speciï¬c cache system. The last component studies models of spatial locality and the problem of data layout. In scientiï¬c programs, most data are stored in arrays. Grand challenge problems such as hydrodynamics simulation and data mining may use an enormous number of data elements. To optimize the layout across multiple arrays, we have developed a formal model called reference afï¬nity. We collaborated with the IBM production compiler group and designed an efï¬cient compiler analysis that performs as well as data or code proï¬ling does. Based on these results, the IBM group has ï¬led a patent and is including this technique in their product compiler. A major part of the project is the development of software tools. We have developed web-based visu- alization for program locality. In addition, we have implemented a prototype of array regrouping in the IBM compiler. The full implementation is expected to come out of IBM in the near future and to beneï¬t scientiï¬c applications running on IBM supercomputers. We have also developed a test environment for studying the limit of computation fusion. Finally, our work has directly inï¬uenced the design of the Intel Itanium compiler. The project has strengthened the research relation between the PIâs group and groups in DoE labs. The PI was an invited speaker at the Center for Applied Scientiï¬c Computing Seminar Series at the early stage of the project. The question that the most audience was curious about was the limit of computation fusion, which has been studied in depth in this research. In addition, the seminar directly helped a group at Lawrence Livermore to achieve four times speedup on an important DoE code. The PI helped to organize a number of high-performance computing forums, including the founding of a workshop on memory system performance (MSP). In the past two years, one fourth of the papers in the workshop came from researchers in Lawrence Livermore, Argonne, Las Alamos, and Lawrence Berkeley national laboratories. The PI lectured frequently on DoE funded research. In a broader context, high performance computing is central to Americaâs scientiï¬c and economic stature in the world, and addresses many of the most scientiï¬cally and socially important problems of our day. This research has improved the programming support for a variety of computational paradigms, including dynamic mesh, hydrodynamics, molecular dynamics, multi-grid methods, matrix algebra, and sequential and parallel sorting. In the process, the PIâs group has developed and strengthened relationships with DoE laboratories and major hardware and software vendors
Effective Cache Apportioning for Performance Isolation Under Compiler Guidance
With a growing number of cores in modern high-performance servers, effective
sharing of the last level cache (LLC) is more critical than ever. The primary
agenda of such systems is to maximize performance by efficiently supporting
multi-tenancy of diverse workloads. However, this could be particularly
challenging to achieve in practice, because modern workloads exhibit dynamic
phase behaviour, which causes their cache requirements & sensitivities to vary
at finer granularities during execution. Unfortunately, existing systems are
oblivious to the application phase behavior, and are unable to detect and react
quickly enough to these rapidly changing cache requirements, often incurring
significant performance degradation. In this paper, we propose Com-CAS, a new
apportioning system that provides dynamic cache allocations for co-executing
applications. Com-CAS differs from the existing cache partitioning systems by
adapting to the dynamic cache requirements of applications just-in-time, as
opposed to reacting, without any hardware modifications. The front-end of
Com-CAS consists of compiler-analysis equipped with machine learning mechanisms
to predict cache requirements, while the back-end consists of proactive
scheduler that dynamically apportions LLC amongst co-executing applications
leveraging Intel Cache Allocation Technology (CAT). Com-CAS's partitioning
scheme utilizes the compiler-generated information across finer granularities
to predict the rapidly changing dynamic application behaviors, while
simultaneously maintaining data locality. Our experiments show that Com-CAS
improves average weighted throughput by 15% over unpartitioned cache system,
and outperforms state-of-the-art partitioning system KPart by 20%, while
maintaining the worst individual application completion time degradation to
meet various Service-Level Agreement (SLA) requirements
GME: GPU-based Microarchitectural Extensions to Accelerate Homomorphic Encryption
Fully Homomorphic Encryption (FHE) enables the processing of encrypted data
without decrypting it. FHE has garnered significant attention over the past
decade as it supports secure outsourcing of data processing to remote cloud
services. Despite its promise of strong data privacy and security guarantees,
FHE introduces a slowdown of up to five orders of magnitude as compared to the
same computation using plaintext data. This overhead is presently a major
barrier to the commercial adoption of FHE.
In this work, we leverage GPUs to accelerate FHE, capitalizing on a
well-established GPU ecosystem available in the cloud. We propose GME, which
combines three key microarchitectural extensions along with a compile-time
optimization to the current AMD CDNA GPU architecture. First, GME integrates a
lightweight on-chip compute unit (CU)-side hierarchical interconnect to retain
ciphertext in cache across FHE kernels, thus eliminating redundant memory
transactions. Second, to tackle compute bottlenecks, GME introduces special
MOD-units that provide native custom hardware support for modular reduction
operations, one of the most commonly executed sets of operations in FHE. Third,
by integrating the MOD-unit with our novel pipelined -bit integer
arithmetic cores (WMAC-units), GME further accelerates FHE workloads by .
Finally, we propose a Locality-Aware Block Scheduler (LABS) that exploits the
temporal locality available in FHE primitive blocks. Incorporating these
microarchitectural features and compiler optimizations, we create a synergistic
approach achieving average speedups of , , and
over Intel Xeon CPU, NVIDIA V100 GPU, and Xilinx FPGA
implementations, respectively
IMPROVING THE PERFORMANCE AND TIME-PREDICTABILITY OF GPUs
Graphic Processing Units (GPUs) are originally mainly designed to accelerate graphic applications. Now the capability of GPUs to accelerate applications that can be parallelized into a massive number of threads makes GPUs the ideal accelerator for boosting the performance of such kind of general-purpose applications. Meanwhile it is also very promising to apply GPUs to embedded and real-time applications as well, where high throughput and intensive computation are also needed.
However, due to the different architecture and programming model of GPUs, how to fully utilize the advanced architectural features of GPUs to boost the performance and how to analyze the worst-case execution time (WCET) of GPU applications are the problems that need to be addressed before exploiting GPUs further in embedded and real-time applications. We propose to apply both architectural modification and static analysis methods to address these problems. First, we propose to study the GPU cache behavior and use bypassing to reduce unnecessary memory traffic and to improve the performance. The results show that the proposed bypassing method can reduce the global memory traffic by about 22% and improve the performance by about 13% on average. Second, we propose a cache access reordering framework based on both architectural extension and static analysis to improve the predictability of GPU L1 data caches. The evaluation results show that the proposed method can provide good predictability in GPU L1 data caches, while allowing the dynamic warp scheduling for good performance. Third, based on the analysis of the architecture and dynamic behavior of GPUs, we propose a WCET timing model based on a predictable warp scheduling policy to enable the WCET estimation on GPUs. The experimental results show that the proposed WCET analyzer can effectively provide WCET estimations for both soft and hard real-time application purposes. Last, we propose to analyze the shared Last Level Cache (LLC) in integrated CPU-GPU architectures and to integrate the analysis of the shared LLC into the WCET analysis of the GPU kernels in such systems. The results show that the proposed shared data LLC analysis method can improve the accuracy of the shared LLC miss rate estimations, which can further improve the WCET estimations of the GPU kernels
- …