Search CORE

26 research outputs found

USING HARDWARE MONITORS TO AUTOMATICALLY IMPROVE MEMORY PERFORMANCE

Author: Tikir Mustafa Murat
Publication venue
Publication date: 19/09/2005
Field of study

In this thesis, we propose and evaluate several techniques to dynamically increase the memory access locality of scientific and Java server applications running on cache-coherent non-uniform memory access(cc-NUMA) servers. We first introduce a user-level online page migration scheme where applications are profiled using hardware monitors to determine the preferred locations of the memory pages. The pages are then migrated to memory units via system calls. In our approach, both profiling and page migrations are conducted online while the application runs. We also investigate the use of several potential sources of profiles gathered from hardware monitors in dynamic page migration and compare their effectiveness to using profiles from centralized hardware monitors. In particular, we evaluate using profiles from on-chip CPU monitors, valid TLB content and a hypothetical hardware feature. We also introduce a set of techniques to both measure and optimize the memory access locality in Java server applications running on cc-NUMA servers. In particular, we propose the use of several NUMA-aware Java heap layouts for initial object allocation and use of dynamic object migration during garbage collection to move objects local to the processors accessing them most. To evaluate these techniques, we also introduce a new hybrid simulation approach to simulate memory behavior of parallel applications based on gathering a partial trace of memory accesses from hardware monitors during an actual run of an application and extrapolating it to a representative full trace. Our dynamic page migration approach achieved reductions up to 90% in the number of non-local accesses, which resulted in up to a 16% performance improvement. Our results demonstrated that the combinations of inexpensive hardware monitors and a simple migration policy can be effectively used to improve the performance of real scientific applications. Our simulation study demonstrated that cache miss profiles gathered from on-chip hardware monitors, which are typically available in current micro-processors, can be effectively used to guide dynamic page migrations in an application. Our NUMA-aware heap layouts reduced the total number of non-local object accesses in SPECjbb2000 up to 41%, which resulted in up to a 40% reduction in the memory wait time of the workload

Digital Repository at the University of Maryland

Efficient instrumentation for code coverage testing

Author: Jeffrey K. Hollingsworth
Mustafa M. Tikir
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2004
Field of study

Crossref

Efficient instrumentation for code coverage testing

Author: Ball T
Jeffrey K. Hollingsworth
Mustafa M. Tikir
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref

High-frequency simulations of global seismic wave propagation using SPECFEM3D_GLOBE on 62K processors

Author: Carrington Laura
Komatitsch Dimitri
Laurenzano Michael
Michéa David
Snavely Allan
Tikir Mustafa M.
Tromp Jeroen
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/11/2008
Field of study

SPECFEM3D_GLOBE is a spectral element application enabling the simulation of global seismic wave propagation in 3D anelastic, anisotropic, rotating and self-gravitating Earth models at unprecedented resolution. A fundamental challenge in global seismology is to model the propagation of waves with periods between 1 and 2 seconds, the highest frequency signals that can propagate clear across the Earth. These waves help reveal the 3D structure of the Earth's deep interior and can be compared to seismographic recordings. We broke the 2 second barrier using the 62K processor Ranger system at TACC. Indeed we broke the barrier using just half of Ranger, by reaching a period of 1.84 seconds with sustained 28.7 Tflops on 32K processors. We obtained similar results on the XT4 Franklin system at NERSC and the XT4 Kraken system at University of Tennessee Knoxville, while a similar run on the 28K processor Jaguar system at ORNL, which has better memory bandwidth per processor, sustained 35.7 Tflops (a higher flops rate) with a 1.94 shortest period. Thus we have enabled a powerful new tool for seismic wave simulation, one that operates in the same frequency regimes as nature; in seismology there is no need to pursue periods much smaller because higher frequency signals do not propagate across the entire globe. We employed performance modeling methods to identify performance bottlenecks and worked through issues of parallel I/O and scalability. Improved mesh design and numbering results in excellent load balancing and few cache misses. The primary achievements are not just the scalability and high teraflops number, but a historic step towards understanding the physics and chemistry of the Earth's interior at unprecedented resolution

Recommended from our members

Modeling the Office of Science Ten Year Facilities Plan: The PERI Architecture Tiger Team

Author: Alam Sadaf
Bailey David H.
Carrington Laura
Daley Chris
de Supinski Bronis R.
Dubey Anshu
Gamblin Todd
Gunter Dan
Hovland Paul D.
Jagode Heike
Karavanic Karen
Marin Gabriel
Mellor-Crummey John
Moore Shirley
Norris Boyana
Oliker Leonid
Olschanowsky Catherine
Roth Philip C.
Schulz Martin
Shende Sameer
Snavely Allan
Spear Wyatt
Tikir Mustafa
Vetter Jeff
Worley Pat
Wright Nicholas
Publication venue: 'Office of Scientific and Technical Information (OSTI)'
Publication date: 26/06/2009
Field of study

The Performance Engineering Institute (PERI) originally proposed a tiger team activity as a mechanism to target significant effort optimizing key Office of Science applications, a model that was successfully realized with the assistance of two JOULE metric teams. However, the Office of Science requested a new focus beginning in 2008: assistance in forming its ten year facilities plan. To meet this request, PERI formed the Architecture Tiger Team, which is modeling the performance of key science applications on future architectures, with S3D, FLASH and GTC chosen as the first application targets. In this activity, we have measured the performance of these applications on current systems in order to understand their baseline performance and to ensure that our modeling activity focuses on the right versions and inputs of the applications. We have applied a variety of modeling techniques to anticipate the performance of these applications on a range of anticipated systems. While our initial findings predict that Office of Science applications will continue to perform well on future machines from major hardware vendors, we have also encountered several areas in which we must extend our modeling techniques in order to fulfill our mission accurately and completely. In addition, we anticipate that models of a wider range of applications will reveal critical differences between expected future systems, thus providing guidance for future Office of Science procurement decisions, and will enable DOE applications to exploit machines in future facilities fully

UNT Digital Library

Generating Efficient Stack Code for Java

Author: Mustafa Tikir
Tatiana Shpeisman
Publication venue
Publication date: 09/10/1999
Field of study

Optimizing Java byte code is complicated by the fact that it uses a stack-based execution model. Changing the intermediate representation from the stack-based to the register-based one brings the problem of Java byte code optimizations into well-studied domain of compiler optimizations for registerbased codes. In this paper we describe the technique to convert a register-based code into the Java byte code. The code generation techniques developed for the stack-based computers are not directly applicable to this problem as the comparative cost of the local memory and stack manipulation instructions in JVM is quite different from that in the stack-based computers. Naive verbose translation of the registerbased code into the Java byte code produces the code with many redundant store and load instructions. The tool that we have developed allows to remove 90-100 % of the stores to the local (i.e., non-global) variables. It produces the Java byte code that is slightly faster and shorter than..

CiteSeerX

Digital Repository at the University of Maryland