Search CORE

24 research outputs found

Feasibility of optically interconnected parallel processors using wavelength division multiplexing

Author
Publication venue: 'Office of Scientific and Technical Information (OSTI)'
Publication date
Field of study

Crossref

Recommended from our members

Exploring Shared Memory Protocols in FLASH

Author: Chame Jacqueline
Hall Mary
Horowitz Mark
Kunz Robert
Lucas Robert
Publication venue: 'Office of Scientific and Technical Information (OSTI)'
Publication date: 01/04/2007
Field of study

ABSTRACT The goal of this project was to improve the performance of large scientific and engineering applications through collaborative hardware and software mechanisms to manage the memory hierarchy of non-uniform memory access time (NUMA) shared-memory machines, as well as their component individual processors. In spite of the programming advantages of shared-memory platforms, obtaining good performance for large scientific and engineering applications on such machines can be challenging. Because communication between processors is managed implicitly by the hardware, rather than expressed by the programmer, application performance may suffer from unintended communication – communication that the programmer did not consider when developing his/her application. In this project, we developed and evaluated a collection of hardware, compiler, languages and performance monitoring tools to obtain high performance on scientific and engineering applications on NUMA platforms by managing communication through alternative coherence mechanisms. Alternative coherence mechanisms have often been discussed as a means for reducing unintended communication, although architecture implementations of such mechanisms are quite rare. This report describes an actual implementation of a set of coherence protocols that support coherent, non-coherent and write-update accesses for a CC-NUMA shared-memory architecture, the Stanford FLASH machine. Such an approach has the advantages of using alternative coherence only where it is beneficial, and also provides an evolutionary migration path for improving application performance. We present data on two computations, RandomAccess from the HPC Challenge benchmarks and a forward solver derived from LS-DYNA, showing the performance advantages of the alternative coherence mechanisms. For RandomAccess, the non-coherent and write-update versions can outperform the coherent version by factors of 5 and 2.5, respectively. In LS-DYNA, we obtain improvements of 18% on average using the non-coherent version. We also present data on the SpecOMP benchmarks, showing that the protocols have a modest overhead of less than 3% in applications where the alternative mechanisms are not needed. In addition to the selective coherence studies on the FLASH machine, in the last six months of this project ISI performed research on compiler technology for the transactional memory (TM) programming model being developed at Stanford. As part of this research ISI developed a compiler that recognizes transactional memory “pragmas” and automatically generates parallel code for the TM programming mode

UNT Digital Library

Cache-coherent distributed shared memory: perspectives on its development and future challenges

Author: A. Gupta
J. Hennessy
M. Heinrich
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

Lock Prediction to Reduce the Overhead of Synchronization Primitives

Author: Shankar Anusha
Publication venue
Publication date: 28/04/2015
Field of study

The advent of chip multi-processors has led to an increase in computational performance in recent years. Employing efficient parallel algorithms has become important to harness the full potential of multiple cores. One of the major productivity limitation in parallel programming arises due to use of Synchronization Primitives. The primitives are used to enforce mutual exclusion on critical section data. Most shared-memory multi-processor architectures provide hardware support for mutually exclusive access on shared data structures using lock and unlock operations. These operations are implemented in hardware as a set of instructions that atomically read and then write to a single memory location. Good synchronization techniques should try to reduce network bandwidth, have low access time in acquiring locks and be fair in granting requests. In a typical directory controller based locking scheme, each thread communicates with the directory controller for lock request and lock release. The overhead of this design includes communication with the directory controller for each step of lock acquisition, and this causes high latency transactions. Thus, a significant amount of time is spent in communication as compared to the actual operation. Previous works have focused on reducing the communication to home node through various techniques. One such technique of interest is the Implicit Queue on Lock Bit Technique (IQOLB). In this technique, the lock is forwarded directly to the requestor from the thread currently holding the lock without communication through the home node. Limitations of the method include the following: the forwarding operation can take place only after the current thread holding the lock has received information about the new lock requestor from the home node and also modification to cache coherence protocol to distinguish a regular memory read request and a synchronization request. Very little research has been performed in the area of lock prediction. We believe based on data analysis that lock communication is predictable and the prediction can improve performance significantly. This research focuses on predicting the sequence in which locks are acquired so that the thread currently holding the lock can preemptively invalidate the locked cache line and forward the same to subsequent requestors and hence reduce the time taken to acquire a lock. The predictor is adaptive: whenever a lock is biased towards a thread, it will remain in the cache of that particular thread, and invalidation will not take place. The benefits of the technique include reduction in the number of messages exchanged with the home node without any modification to the cache coherence protocol (does not distinguish a regular memory read request and synchronization request). The results of the evaluation of lock predictor on PARSEC benchmark suite shows an improvement in overall performance by an average of 9 % over the base case

Texas A&M Repository

Directions in parallel programming: HPF, shared virtual memory and object parallelism in pC++

Author: Bodin Francois
Gannon Dennis
Mehrotra Piyush
Priol Thierry
Publication venue
Publication date
Field of study

Fortran and C++ are the dominant programming languages used in scientific computation. Consequently, extensions to these languages are the most popular for programming massively parallel computers. We discuss two such approaches to parallel Fortran and one approach to C++. The High Performance Fortran Forum has designed HPF with the intent of supporting data parallelism on Fortran 90 applications. HPF works by asking the user to help the compiler distribute and align the data structures with the distributed memory modules in the system. Fortran-S takes a different approach in which the data distribution is managed by the operating system and the user provides annotations to indicate parallel control regions. In the case of C++, we look at pC++ which is based on a concurrent aggregate parallel model

NASA Technical Reports Server

Mobile Home Node: Improving Directory Cache Coherence Performance in NoCs via Exploitation of Producer-Consumer Relationships

Author: Soni Tarun
Publication venue
Publication date
Field of study

The implementation of multiple processors on a single chip has been made possible with advancements in process technology. The benefits of having multiple cores on a single chip bring with it a new set of constraints for maintaining fast and consistent memory accesses. Cache coherence protocols are needed to maintain the consistency of shared memory on individual caches. Current cache coherency protocols are either snoop based, which is not scalable but provides fast access for small number of cores, or directory based, which involves a directory that acts as the ordering point providing scalability with relatively slower access. Our focus is on improving the memory access time of the scalable directory protocol. We have observed that most memory requests follow a pattern where in one of the processors, which we will dub the Producer, repeatedly writes to a particular memory location. A subset of the remaining cores, which we will dub the Consumers, repeatedly read the data from that same memory location. In our implementation we utilize this relationship to provide direct cache to cache transfers and minimize the access time by avoiding the indirection through the directory. We move the directory temporarily to the Producer node so that the consumer can directly request the producer for the cache line. Our technique improves the memory access time by 13 percent and reduces network traffic by 30 percent over standard directory coherence protocol with very little area overhead

Texas A&M Repository

A two-level directory architecture for highly scalable cc-NUMA multiprocessors

Author: J. Duato
J. Gonzalez
J.M. Garcia
M.E. Acacio
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref