4 research outputs found
Rewriting System for Profile-Guided Data Layout Transformations on Binaries
International audienceCareful data layout design is crucial for achieving high performance. However exploring data layouts is time-consuming and error-prone, and assessing the impact of a layout transformation on performance is difficult without performing it. We propose to guide application programmers through data layout restructuring by providing a comprehensive multidimensional description of the initial layout, built from trace analysis, and then by giving a performance evaluation of the transformations tested and an expression of each transformed layout. The programmer can limit the exploration to layouts matching some patterns. We apply this method to two multithreaded applications. The performance prediction of multiple transformations matches within 5% the performance of hand-transformed layout code
Recommended from our members
Project Report on DOE Young Investigator Grant (Contract No. DE-FG02-02ER25525) Dynamic Scheduling and Fusion of Irregular Computation (August 15, 2002 to August 14, 2005)
Computer simulation has become increasingly important in many scientiï¬c disciplines, but its performance and scalability are severely limited by the memory throughput on todayâs computer systems. With the support of this grant, we ï¬rst designed training-based prediction, which accurately predicts the memory performance of large applications before their execution. Then we developed optimization techniques using dynamic computation fusion and large-scale data transformation. The research work has three major components. The ï¬rst is modeling and prediction of cache behav- ior. We have developed a new technique, which uses reuse distance information from training inputs then extracts a parameterized model of the programâs cache miss rates for any input size and for any size of fully associative cache. Using the model we have built a web-based tool using three dimensional visualization. The new model can help to build cost-effective computer systems, design better benchmark suites, and improve task scheduling on heterogeneous systems. The second component is global computation for improving cache performance. We have developed an algorithm for dynamic data partitioning using sampling theory and probability distribution. Recent work from a number of groups show that manual or semi-manual computation fusion has signiï¬cant beneï¬ts in physical, mechanical, and biological simulations as well as information retrieval and machine veriï¬cation. We have developed an au- tomatic tool that measures the potential of computation fusion. The new system can be used by high-performance application programmers to estimate the potential of locality improvement for a program before trying complex transformations for a speciï¬c cache system. The last component studies models of spatial locality and the problem of data layout. In scientiï¬c programs, most data are stored in arrays. Grand challenge problems such as hydrodynamics simulation and data mining may use an enormous number of data elements. To optimize the layout across multiple arrays, we have developed a formal model called reference afï¬nity. We collaborated with the IBM production compiler group and designed an efï¬cient compiler analysis that performs as well as data or code proï¬ling does. Based on these results, the IBM group has ï¬led a patent and is including this technique in their product compiler. A major part of the project is the development of software tools. We have developed web-based visu- alization for program locality. In addition, we have implemented a prototype of array regrouping in the IBM compiler. The full implementation is expected to come out of IBM in the near future and to beneï¬t scientiï¬c applications running on IBM supercomputers. We have also developed a test environment for studying the limit of computation fusion. Finally, our work has directly inï¬uenced the design of the Intel Itanium compiler. The project has strengthened the research relation between the PIâs group and groups in DoE labs. The PI was an invited speaker at the Center for Applied Scientiï¬c Computing Seminar Series at the early stage of the project. The question that the most audience was curious about was the limit of computation fusion, which has been studied in depth in this research. In addition, the seminar directly helped a group at Lawrence Livermore to achieve four times speedup on an important DoE code. The PI helped to organize a number of high-performance computing forums, including the founding of a workshop on memory system performance (MSP). In the past two years, one fourth of the papers in the workshop came from researchers in Lawrence Livermore, Argonne, Las Alamos, and Lawrence Berkeley national laboratories. The PI lectured frequently on DoE funded research. In a broader context, high performance computing is central to Americaâs scientiï¬c and economic stature in the world, and addresses many of the most scientiï¬cally and socially important problems of our day. This research has improved the programming support for a variety of computational paradigms, including dynamic mesh, hydrodynamics, molecular dynamics, multi-grid methods, matrix algebra, and sequential and parallel sorting. In the process, the PIâs group has developed and strengthened relationships with DoE laboratories and major hardware and software vendors
Inter-array Data Regrouping
As the speed gap between CPU and memory widens, memory hierarchy has become the performance bottleneck for most applications because of both the high latency and low bandwidth of direct memory access. With the recent introduction of latency hiding on modern machines, the limited memory bandwidth has become the primary constraint and, consequently, the effective use of available memory bandwidth has become critical to a program. Since memory data are transferred one cache block at a time, improving the utilization of cache blocks can directly improve memory bandwidth utilization and program performance. However, existing optimizations do not maximize cache-block utilization because they are intra-array; that is, they improve only data reuse within single arrays, and they do not group useful data of multiple arrays into the same cache block. In this paper, we present inter-array data regrouping, a global data transformation that first splits and then selectively regroups all data arrays ..
Improving effective bandwidth through compiler enhancement of global and dynamic cache reuse
While CPU speed has been improved by a factor of 6400 over the past twenty years, memory bandwidth has increased by a factor of only 139 during the same period. Consequently, on modern machines the limited data supply simply cannot keep a CPU busy, and applications often utilize only a few percent of peak CPU performance. The hardware solution, which provides layers of high-bandwidth data cache, is not effective for large and complex applications primarily for two reasons: far-separated data reuse and large-stride data access. The first repeats unnecessary transfer and the second communicates useless data. Both waste memory bandwidth.
This dissertation pursues a software remedy. It investigates the potential for compiler optimizations to alter program behavior and reduce its memory bandwidth consumption. To this end, this research has studied a two-step transformation strategy: first fuse computations on the same data and then group data used by the same computation. Existing techniques such as loop blocking can be viewed as an application of this strategy within a single loop nest. In order to carry out this strategy to its full extent, this research has developed a set of compiler transformations that perform computation fusion and data grouping over the whole program and during the entire execution. The major new techniques and their unique contributions are: Maximal loop fusion : an algorithm that achieves maximal fusion among all program statements and bounded reuse distance within a fused loop. Inter-array data regrouping : the first to selectively group global data structures and to do so with guaranteed profitability and compile-time optimality. Locality grouping and dynamic packing: the first set of compiler-inserted and compiler-optimized computation and data transformations at run time.
These optimizations have been implemented in a research compiler and evaluated on real-world applications on SGI Origin2000. The result shows that, on average, the new strategy eliminates 41% of memory loads in regular applications and 63% in irregular and dynamic programs. As a result, the overall execution time is shortened by 12% to 77%.
In addition to compiler optimizations, this research has developed a performance model and designed a performance tool. The former allows precise measurement of the memory bandwidth bottleneck; the latter enables effective user tuning and accurate performance prediction for large applications: neither goal was achieved before this thesis